← Glossary / Inverted Index

What is Inverted Index?

Inverted index is the core data structure behind every full-text search engine, mapping unique terms back to the documents that contain them. Instead of scanning rows to find a word, it looks up the word to find the rows. For scraping pipelines that extract millions of text-heavy records — product descriptions, news articles, or reviews — an inverted index is what makes the delivered dataset instantly queryable rather than just a dead archive in S3.

DatabasesFull-Text SearchElasticsearchInformation RetrievalText Processing

// 02 — definitions

Words to
rows.

The fundamental shift from row-based storage to term-based lookup, enabling sub-millisecond text search across terabytes of scraped data.

Ask a DataFlirt engineer →

TL;DR

An inverted index flips the standard database model. Instead of storing documents containing words, it stores a dictionary of words containing pointers to documents. It's the engine inside Elasticsearch, OpenSearch, and Lucene, and it's mandatory for any pipeline delivering unstructured text that needs to be searchable at scale.

01Definition & structure

An inverted index consists of two main parts: a Term Dictionary (a sorted list of all unique words found across the dataset) and Posting Lists (arrays of document IDs where each word appears). When you search for "wireless headphones", the engine doesn't scan documents. It looks up "wireless" and "headphones" in the dictionary, retrieves their posting lists, and intersects them to find documents containing both words.

02The analysis pipeline

Raw text cannot be indexed directly. It must pass through an analysis chain. First, character filters strip HTML tags. Second, a tokenizer splits the string into individual words. Finally, token filters lowercase the text, remove stop words ("the", "and"), and stem words to their root form ("running" becomes "run"). The resulting tokens are what actually get written to the inverted index.

03Why B-Trees fail at text

Standard relational databases use B-trees, which sort data alphabetically. If you search for a string starting with "Apple", a B-tree finds it instantly. But if you search for a word buried in the middle of a paragraph (LIKE '%Apple%'), the B-tree is useless. The database must perform a full table scan, reading every row from disk. An inverted index solves this by making every word a primary lookup key.

04How DataFlirt indexes text

We don't just deliver raw JSON for text-heavy pipelines. We map the extracted data to Elasticsearch templates. We apply custom synonym filters for industry-specific jargon (e.g., mapping "k8s" to "kubernetes" in tech job postings) and configure the index refresh intervals to handle our high-throughput batch inserts without thrashing the cluster's I/O.

05The storage tradeoff

Speed costs space. To support phrase queries (searching for exact sentences), the inverted index must store not just the document ID, but the exact position of every word in the document. To support sorting, it must build columnar doc values. A fully featured inverted index can easily double the storage footprint of your raw scraped data.

// 03 — search math

How relevance
is scored.

Finding the document is only half the battle; ranking it is the other. Modern inverted indices use BM25 to score document relevance based on term frequency and inverse document frequency.

Term Frequency (TF) = TF = f_t,d / len(d)

How often a term appears in a specific document. More occurrences = higher score. Information Retrieval Basics

Inverse Document Frequency (IDF) = IDF = log(1 + (N - n_t + 0.5) / (n_t + 0.5))

Rarity of the term across the entire corpus. 'The' scores near zero; 'MacBook' scores high. Okapi BM25 Algorithm

Index Size Overhead = S_idx ≈ S_raw × 0.65

A heavily analyzed text index often consumes 60-70% of the raw text size in storage. Elasticsearch Sizing Guidelines

// 04 — the analysis phase

From raw string
to posting list.

Before a scraped product title enters the index, it passes through an analyzer. Here is the trace of a raw string being tokenized, filtered, and mapped to a posting list.

Elasticsearchcustom analyzerBM25

edge.dataflirt.io — live

CAPTURED

// input record
doc_id: "prod_8821"
raw_text: "Apple MacBook Pro 16-inch (M2 Max)"

// 1. character filter
strip_html: "Apple MacBook Pro 16-inch (M2 Max)"

// 2. tokenizer
standard_tokens: ["Apple", "MacBook", "Pro", "16", "inch", "M2", "Max"]

// 3. token filters
lowercase: ["apple", "macbook", "pro", "16", "inch", "m2", "max"]
stop_words: ["apple", "macbook", "pro", "16", "inch", "m2", "max"] // no stops found
synonyms: ["apple", "macbook", "pro", "16", "inch", "m2", "max", "mac"] // injected 'mac'

// 4. posting list updates (term -> [doc_ids])
index.write: "macbook" -> [..., "prod_8821", ...]
index.write: "m2" -> [..., "prod_8821", ...]
status: indexed // available for search in 1000ms

// 05 — index bloat

What consumes
index storage.

Inverted indices are fast but space-hungry. Here is where the storage budget goes when indexing a typical scraped e-commerce catalog with full-text search enabled.

DATASET · · · · · · 10M product records

RAW SIZE · · · · · 42 GB JSON

INDEX SIZE · · · · · 28 GB

Posting Lists

doc IDs · The actual mapping of terms to documents

Term Frequencies & Positions

metadata · Required for phrase queries and proximity search

Stored Fields

raw data · The original JSON document returned on match

Doc Values

columnar · Columnar structures used for sorting and aggregations

Term Dictionary

FST · The trie structure holding the unique vocabulary

// 06 — search infrastructure

Searchable on arrival,

not just dumped in a bucket.

For clients buying text-heavy datasets, raw JSON is often insufficient. DataFlirt can route extracted records directly into a managed Elasticsearch or OpenSearch cluster. We handle the mapping definitions, custom analyzers for domain-specific jargon, and index lifecycle management, so your team can start querying the data the second the pipeline commits it.

index.mapping.json

Standard mapping configuration for a scraped product catalog.

index.name df_products_v3

field.title textenglish_analyzer

field.sku keywordexact match only

field.price scaled_float

field.description textpositions_omitted

refresh_interval 30sbatch optimized

cluster.status green

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About text search, index overhead, database selection, and how DataFlirt delivers searchable datasets.

Ask us directly →

What is the difference between an inverted index and a B-tree? +

A B-tree index (standard in PostgreSQL/MySQL) is great for exact matches or prefix matches (LIKE 'term%'). It fails completely on full-text search (LIKE '%term%') because it has to scan every row. An inverted index parses the text into individual words and maps each word to the rows that contain it, making full-text lookups O(1) instead of O(N).

Does DataFlirt provide indexed data or just raw files? +

Both. By default, we deliver structured JSON, CSV, or Parquet files to your S3/GCS bucket. For enterprise clients, we can also sink data directly into a managed Elasticsearch, OpenSearch, or Algolia cluster, complete with optimized mappings and custom analyzers tailored to your specific dataset.

How do you handle multi-language scraping in an index? +

We use language detection at the extraction layer. The detected language dictates which analyzer the index uses. A French product description gets routed to a French stemmer and stop-word filter, while an English one gets English rules. Mixing languages in a single text field without language-specific analyzers destroys search relevance.

What is the storage overhead of an inverted index? +

It is significant. Depending on how aggressively you index (storing term positions for phrase matching, enabling n-grams for partial matching), the index can be 50% to 150% the size of the raw text data. We optimize this by disabling position tracking on long description fields where exact phrase matching isn't required.

Can I update an inverted index in real-time? +

Yes, but it's expensive. Inverted indices are immutable at the segment level. When you update a document, the engine marks the old one as deleted and writes a new one. High-frequency updates cause segment fragmentation and force heavy background merging. For scraping pipelines, we prefer micro-batching inserts every 30-60 seconds.

Why not just use PostgreSQL with pg_trgm or tsvector? +

For small datasets (under 1M rows), PostgreSQL's built-in text search is perfectly fine. At scraping scale (10M+ rows, heavy text fields), Postgres text search becomes CPU-bound and difficult to scale horizontally. Dedicated inverted index engines like Elasticsearch distribute the index across multiple nodes natively.

$ dataflirt scope --new-project --target=inverted-index READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

What is Inverted Index?

Words torows.

TL;DR

How relevanceis scored.

From raw stringto posting list.

What consumesindex storage.

Posting Lists

Term Frequencies & Positions

Stored Fields

Doc Values

Term Dictionary

Searchable on arrival,

index.mapping.json

Stay ahead of the pipeline

Data engineeringintel, weekly.

Commonquestions.

Tell us whatto extract.We do the rest.

Related glossary terms

Elasticsearch

Full-Text Search

Stop Word Removal

Text Vectorization