← Glossary / Elasticsearch

What is Elasticsearch?

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. In the context of web scraping, it serves as the primary sink for unstructured text data, logs, and complex JSON documents where full-text search, fuzzy matching, and real-time aggregations are required. While not a traditional relational database, its inverted index architecture makes it the industry standard for querying massive datasets of scraped product catalogs, news articles, and job postings.

DatabasesFull-Text SearchLuceneInverted IndexNoSQL
// 02 — definitions

Search at
scale.

Why storing scraped text in Postgres eventually breaks, and how an inverted index solves the read-heavy workload of unstructured data.

Ask a DataFlirt engineer →

TL;DR

Elasticsearch is a distributed document store optimized for search. Instead of scanning rows, it uses an inverted index to map words to the documents containing them. For scraping pipelines, it's the ideal sink for text-heavy payloads like articles, reviews, and product descriptions where downstream consumers need sub-second search across billions of records.

01Definition & structure
Elasticsearch is a distributed, RESTful search engine built on top of the Apache Lucene library. It stores data as JSON documents and uses an inverted index to enable rapid full-text search. A cluster consists of one or more nodes, which hold indices. Indices are divided into shards, which can be distributed across the nodes to parallelize operations and provide high availability.
02The inverted index
Unlike a relational database that scans rows to find a match, Elasticsearch builds an inverted index. When a scraped document is ingested, the text is tokenized into individual words (terms). The index maps each term to a list of document IDs that contain it. When you search for "laptop", Elasticsearch doesn't scan documents; it looks up "laptop" in the index and instantly retrieves the associated document IDs.
03Sharding and replication
To handle datasets larger than a single server's capacity, Elasticsearch splits indices into primary shards. Each primary shard can have one or more replica shards for redundancy and increased read throughput. For scraping pipelines, this means you can ingest terabytes of historical data across multiple nodes simultaneously, scaling horizontally as your dataset grows.
04How DataFlirt uses it
We use Elasticsearch as the serving layer for our high-volume text datasets (like global news feeds and product review aggregations). Raw scraped data lands in S3 and Snowflake, while a transformed, search-optimized subset is synced to Elasticsearch. This allows our clients to query millions of scraped records via our delivery APIs with sub-50ms latency, utilizing complex boolean filters and aggregations that would choke a standard SQL database.
05The mapping explosion problem
By default, Elasticsearch automatically detects and maps new fields in ingested JSON. If you scrape a site that uses dynamic keys (e.g., "attribute_color_red": true, "attribute_color_blue": true), Elasticsearch will create a new field mapping for every single key. The cluster state becomes bloated, propagating to all nodes, and eventually causes an OutOfMemory crash. Always use strict mappings for scraped data.
// 03 — the search math

How relevance
is scored.

Elasticsearch doesn't just return matches; it ranks them. The underlying Lucene engine uses BM25 (an evolution of TF-IDF) to score document relevance, which DataFlirt's delivery APIs expose for client queries.

Term Frequency (TF) = freq / (freq + k1 · (1 - b + b · dl / avgdl))
How often a term appears in a scraped document, normalized by document length. Okapi BM25 Algorithm
Inverse Document Frequency (IDF) = log(1 + (N - n + 0.5) / (n + 0.5))
Penalizes common words (like 'the') and boosts rare words across the index. Lucene Scoring
Cluster Health State = active_primary_shards / total_expected_primary_shards
Must be 1.0 (Green) for safe read/write operations during high-volume ingestion. Elasticsearch Cluster API
// 04 — indexing a scraped record

From raw JSON
to searchable index.

A live trace of a scraped product review being ingested into an Elasticsearch cluster, showing the analyzer pipeline breaking text into searchable tokens.

REST APIBulk IngestBM25
edge.dataflirt.io — live
CAPTURED
// POST /reviews/_doc/
{
"product_id": "B08FX123",
"review_text": "Battery life is terrible but screen is great.",
"rating": 2
}

// Lucene analyzer pipeline (standard)
tokenizer: ["Battery", "life", "is", "terrible", "but", "screen", "is", "great"]
filter_lowercase: ["battery", "life", "is", "terrible", "but", "screen", "is", "great"]

// Inverted index update
term "battery" -> doc_id: 891422
term "terrible" -> doc_id: 891422

// Cluster response
_index: "reviews_v3"
_id: "891422"
result: "created" // 201 Created
_shards: { total: 2, successful: 2, failed: 0 }
// 05 — performance bottlenecks

Where clusters
fall over.

Elasticsearch is incredibly fast for reads, but write-heavy scraping workloads can overwhelm a poorly configured cluster. These are the most common failure modes we see when scaling ingestion.

INGESTION RATE ·  ·  ·    Up to 50k docs/sec
HEAP USAGE ·  ·  ·  ·  ·  Max 50% of RAM
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Mapping explosion

Dynamic field bloat · Scraping arbitrary JSON creates too many fields, exhausting heap memory.
02

Refresh interval thrashing

Disk I/O bottleneck · Default 1s refresh forces too many segment merges during heavy writes.
03

Oversharding

Cluster state overhead · Too many small shards (under 10GB) waste CPU on cluster management.
04

JVM Garbage Collection

Stop-the-world pauses · Heavy bulk indexing triggers long GC pauses, dropping node connections.
05

Unoptimized mappings

Storage bloat · Storing exact-match IDs as 'text' instead of 'keyword' wastes index space.
// 06 — DataFlirt's search architecture

Write-optimized during crawls,

read-optimized for delivery.

We isolate ingestion from delivery. During a high-volume scrape, we route writes to dedicated indexing nodes with disabled replicas and extended refresh intervals to maximize throughput. Once the scrape completes, we force a segment merge, enable replicas, and route the index to read-optimized nodes. This prevents heavy scraping jobs from degrading search latency for downstream API consumers.

Cluster node topology

Live metrics from a dedicated Elasticsearch cluster handling a daily news scraping pipeline.

cluster.name df-news-prod-01
cluster.status green
nodes.ingest 4 × i3.4xlargeactive
nodes.data 12 × r5.2xlarge
index.refresh 30swrite-optimized
jvm.heap_usage 78%high
search.latency_p95 42ms

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About Elasticsearch architecture, mapping strategies, ingestion performance, and how DataFlirt scales search for scraped datasets.

Ask us directly →
Should I use Elasticsearch as my primary database for scraped data? +
No. Elasticsearch is a search engine, not a system of record. It lacks ACID transactions and robust relational joins. Best practice is to store the raw scraped payload in a data lake (like S3) or a relational database (like PostgreSQL), and sync the searchable fields to Elasticsearch via a CDC (Change Data Capture) pipeline or a dedicated indexing worker.
What is a mapping explosion and how do I prevent it? +
A mapping explosion occurs when you ingest scraped JSON with dynamic mapping enabled, and the source site introduces thousands of unique keys (e.g., dynamic attribute names). Elasticsearch creates a new field in the cluster state for every key, eventually crashing the master node. Prevent this by strictly defining your mapping upfront and setting dynamic: "strict" or "false".
How does DataFlirt handle bulk indexing without dropping documents? +
We use the _bulk API with exponential backoff and size our batches based on byte size (typically 5-15MB) rather than document count. We also temporarily increase the refresh_interval to 30 seconds or -1 during massive backfills to prevent the cluster from wasting I/O on constant segment merging.
What's the difference between 'text' and 'keyword' field types? +
A text field is passed through an analyzer (tokenized, lowercased, stemmed) and is used for full-text search (e.g., finding "battery" in a review). A keyword field is stored exactly as-is and is used for exact filtering, sorting, and aggregations (e.g., filtering by a specific product_id or category).
Is scraping data directly into Elasticsearch legally risky? +
The storage medium doesn't change the legal status of the data. However, Elasticsearch's ability to rapidly aggregate and surface insights from scraped data can amplify the commercial impact of the dataset. Ensure you have the right to process and store the data, particularly if it contains PII, which triggers GDPR/CCPA compliance requirements regardless of where it sits.
How many shards should my scraped index have? +
Aim for shard sizes between 10GB and 50GB. If you are scraping 5GB of data a month, a single primary shard is perfectly fine. Oversharding (creating 5 shards for a 1GB index) is a common anti-pattern that wastes JVM heap on cluster state management and slows down query execution.
$ dataflirt scope --new-project --target=elasticsearch READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h