← Glossary / Full-Text Search

What is Full-Text Search?

Q: Why can't I just use SQL LIKE '%term%' for scraped text?

Standard SQL LIKE queries require a full table scan. If you have 10 million scraped product descriptions, the database has to read every single row from disk to find the match. Full-text search uses an inverted index, turning an O(N) scan into an O(1) lookup.

Full-Text Search is an advanced database technique that examines all the words in every stored document as it tries to match search criteria, rather than just matching exact strings or indexed columns. For data pipelines, it's the layer that turns millions of scraped raw text blobs—like news articles, product descriptions, or legal filings—into queryable intelligence. Without it, finding a specific phrase across a terabyte of unstructured HTML is an I/O nightmare; with it, it's a sub-millisecond lookup.

DatabasesInverted IndexElasticsearchInformation RetrievalTokenization

// 02 — definitions

Beyond exact
matches.

How databases map unstructured text into highly optimized structures for rapid retrieval, and why standard SQL LIKE queries fail at scale.

Ask a DataFlirt engineer →

TL;DR

Full-text search relies on an inverted index to map words to their locations across millions of documents. It handles stemming, synonyms, and relevance scoring (like BM25). When you scrape 50 million product reviews, dumping them into a standard relational table makes them unsearchable; piping them into Elasticsearch or ClickHouse with text indexing makes them instantly accessible.

01Definition & structure

Full-Text Search is a technique for searching computer-stored documents in a database. Unlike simple string matching, it uses an inverted index to map every unique word to the documents that contain it. Before indexing, text passes through an analyzer that tokenizes the string, removes stop words (like "and", "the"), and stems words to their root form (e.g., "running" becomes "run"). This allows the search engine to understand linguistic intent rather than just byte-for-byte matches.

02How it works in practice

When you ingest a scraped article, the database doesn't just store the string. It breaks the text into tokens, normalizes them, and updates the inverted index. When a user queries "fast cars", the engine tokenizes the query, looks up "fast" and "car" in the index, finds the intersection of document IDs, calculates a relevance score (usually BM25) based on term frequency and document length, and returns the ranked results—all in milliseconds.

03Analyzers and Tokenization

The secret to good full-text search is the analyzer pipeline. It consists of three parts:

Character filters — strip out HTML tags or convert special characters.
Tokenizers — split the text into individual words (usually by whitespace or punctuation).
Token filters — lowercase everything, remove stop words, and apply stemming or lemmatization.

If your analyzer is configured poorly, a search for "iPhone 15" might fail to match "iphone-15".

04How DataFlirt handles it

We treat search indexing as a first-class citizen in our delivery pipelines. Instead of handing clients raw text files, we can stream scraped data directly into managed Elasticsearch or OpenSearch clusters. Our ingestion workers handle the heavy lifting of HTML stripping, language detection, and custom tokenization, ensuring that by the time the data lands in your cluster, it is perfectly formatted for high-performance querying.

05Did you know?

PostgreSQL has excellent built-in full-text search capabilities. Using tsvector and tsquery types, you can build a highly capable search engine without needing to deploy a separate Elasticsearch cluster. For datasets under 10-20 million rows, Postgres is often more than enough and drastically simplifies your infrastructure.

// 03 — the math

How relevance
is scored.

When a query hits a full-text index, the engine doesn't just return matches—it ranks them. BM25 is the industry standard algorithm used by Elasticsearch and Lucene to score document relevance.

Term Frequency (TF) = f(q, D) / (f(q, D) + k₁ · (1 − b + b · (|D| / avgdl)))

Saturates term frequency so repeating a word 100 times doesn't infinitely boost score. Okapi BM25

Inverse Document Frequency (IDF) = log(1 + (N − n(q) + 0.5) / (n(q) + 0.5))

Penalizes common words (like 'the') and boosts rare words. Information Retrieval Theory

DataFlirt Indexing Latency = T_index = Bytes / (Workers × Throughput_node)

Our Elasticsearch clusters ingest scraped text at ~45 MB/s per node. Internal SLO

// 04 — the pipeline

Indexing scraped
text at scale.

A trace of a DataFlirt pipeline extracting raw article text, tokenizing it, and pushing it into an Elasticsearch cluster for full-text querying.

ElasticsearchBM25Tokenization

edge.dataflirt.io — live

CAPTURED

// 1. raw scraped document
doc.id: "art_98214"
doc.text: "The quick brown foxes are jumping over the lazy dogs."

// 2. analyzer pipeline
step.char_filter: strip_html
step.tokenizer: [The, quick, brown, foxes, are, jumping, over, the, lazy, dogs]
step.token_filter: [quick, brown, fox, jump, lazi, dog] // lowercase, stop-words removed, stemmed

// 3. inverted index update
index.term: "fox" -> doc_ids: [art_98214, art_1102, ...]
index.term: "jump" -> doc_ids: [art_98214, art_5519, ...]

// 4. search query execution
query: "fox jumping"
engine.match: found in art_98214
engine.score: 4.821 // BM25 relevance
latency: 12ms

// 05 — performance bottlenecks

Where text search
slows down.

Full-text search is CPU and memory intensive. These are the primary factors that degrade query performance when indexing massive scraped datasets.

DATASET SIZE · · · · 10TB+

ENGINE · · · · · · Elasticsearch

UPDATED · · · · · · 2026-05-19

High cardinality fields

Memory bloat · Too many unique terms exhaust JVM heap space.

Deep pagination

I/O spike · Fetching the 10,000th page of results requires sorting all previous matches.

Complex analyzers

CPU bound · Heavy stemming and n-gram generation slow down ingestion.

Index fragmentation

Disk seek · Too many small segments require merging before querying.

Stop-word bloat

Index size · Failing to filter common words inflates the inverted index.

// 06 — our architecture

Searchable on arrival,

indexing at the speed of scraping.

When DataFlirt delivers unstructured text data, we don't just dump raw JSON into an S3 bucket and leave the indexing to you. For clients requiring immediate searchability, our delivery pipeline routes scraped text directly through an NLP analyzer and into a managed Elasticsearch or ClickHouse cluster. You query the data the millisecond it's scraped.

Text Ingestion Node

Live metrics from a DataFlirt indexing worker processing scraped news articles.

node.role ingest-worker-04

throughput 4,200 docs/sec

analyzer english_standard

index.size 1.4 TB

heap.usage 68%

rejected.docs 0

query.latency.p99 18ms

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About full-text search, inverted indices, relevance scoring, and how DataFlirt handles massive text ingestion.

Ask us directly →

Why can't I just use SQL LIKE '%term%' for scraped text? +

Standard SQL LIKE queries require a full table scan. If you have 10 million scraped product descriptions, the database has to read every single row from disk to find the match. Full-text search uses an inverted index, turning an O(N) scan into an O(1) lookup.

What is an inverted index? +

It's the core data structure behind full-text search. Instead of mapping documents to words (like a book), it maps words to documents (like an index at the back of a book). When you search for "battery", the engine instantly knows exactly which documents contain it.

Does full-text search handle typos and synonyms? +

Yes, if configured correctly. Analyzers can apply phonetic matching, fuzzy search (using Levenshtein distance), and synonym dictionaries during both ingestion and querying, ensuring "color" matches "colour" and "sneaker" matches "shoe".

How does DataFlirt handle indexing for massive scrapes? +

We decouple extraction from indexing. Scraped records are pushed to a Kafka queue, where dedicated ingestion workers apply text analyzers and bulk-insert into the search cluster. This prevents slow database I/O from bottlenecking the scraping fleet.

Is it legal to index copyrighted text I've scraped? +

Indexing publicly available text for internal analysis or search (like a search engine does) is generally protected under fair use or the authorized access doctrine, provided you aren't republishing the copyrighted material wholesale. Always consult counsel for your specific use case.

Should I use Elasticsearch, PostgreSQL, or ClickHouse for text? +

Postgres is fine for <10GB of text. Elasticsearch is the gold standard for complex relevance scoring and fuzzy matching. ClickHouse is increasingly popular for log analytics and structured data that requires fast, brute-force text filtering without the JVM overhead of Elastic.

$ dataflirt scope --new-project --target=full-text-search READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

What is Full-Text Search?

Beyond exactmatches.

TL;DR

How relevanceis scored.

Indexing scrapedtext at scale.

Where text searchslows down.

High cardinality fields

Deep pagination

Complex analyzers

Index fragmentation

Stop-word bloat

Searchable on arrival,

Text Ingestion Node

Stay ahead of the pipeline

Data engineeringintel, weekly.

Commonquestions.

Tell us whatto extract.We do the rest.

Related glossary terms

Inverted Index

Elasticsearch

ClickHouse

Information Extraction