← Glossary / Semantic Search

What is Semantic Search?

Semantic search is an information retrieval technique that matches queries to documents based on contextual meaning rather than exact keyword overlap. For scraping pipelines feeding AI models, it requires transforming extracted text into dense vector embeddings and indexing them in a vector database. It is the foundational retrieval layer for RAG applications, allowing systems to find relevant scraped data even when the user's vocabulary completely diverges from the source text.

Vector EmbeddingsRAGCosine SimilarityInformation RetrievalNLP
// 02 — definitions

Meaning over
keywords.

The shift from lexical matching to dense vector retrieval, and why modern data pipelines must output embeddings alongside raw text.

Ask a DataFlirt engineer →

TL;DR

Semantic search replaces traditional BM25 keyword matching with neural network embeddings. By mapping scraped text into a high-dimensional vector space, queries can retrieve conceptually similar records. It's the engine behind modern RAG architectures, but it shifts the computational burden from query time to ingestion time.

01Definition & structure
Semantic search is a retrieval method that uses machine learning models to understand the intent and contextual meaning of a query, rather than relying on exact keyword matching. It works by converting both the scraped documents and the user's query into dense numerical arrays called embeddings. Documents that share similar concepts will have vectors that are geometrically close to each other in a high-dimensional space.
02How it works in practice
The workflow is split into two phases. During ingestion, a scraper extracts text, cleans it, splits it into chunks, passes those chunks through an embedding model (like OpenAI's text-embedding-3), and stores the resulting vectors in a database. During querying, the user's search string is passed through the exact same embedding model, and the database performs a nearest-neighbor search to return the most conceptually similar chunks.
03The role of chunking
Embedding an entire webpage into a single vector dilutes the specific facts contained within it. To make semantic search effective, scraped text must be broken down into smaller segments (chunks) of 200–500 tokens. Good chunking strategies respect document structure—splitting by paragraphs or markdown headers—and include overlap between chunks so that context isn't lost at the boundaries.
04How DataFlirt handles it
We integrate embedding generation directly into our extraction pipelines. Instead of delivering raw JSON that your data engineering team has to process, DataFlirt cleans the HTML, chunks the text according to your RAG requirements, calls the embedding API, and upserts the vectors directly into your vector database. This reduces pipeline latency and eliminates an entire ETL step for AI teams.
05Did you know?
Semantic search suffers from the "curse of dimensionality." As the number of dimensions in an embedding model increases (e.g., 1536 or 3072 dimensions), the distance between any two random vectors becomes increasingly similar. This is why modern vector databases use Approximate Nearest Neighbor (ANN) algorithms like HNSW to find results quickly, trading a tiny amount of accuracy for massive gains in query speed.
// 03 — the math

How similarity
is scored.

Semantic search relies on geometric distance in high-dimensional space. DataFlirt's embedding pipelines use cosine similarity as the default metric for matching queries against scraped document vectors.

Cosine Similarity = SC(A,B) = (A · B) / (||A|| × ||B||)
Measures the cosine of the angle between two vectors. 1 is identical, 0 is orthogonal. Standard NLP metric
Recall@K = Relevant_Retrieved / Total_Relevant
Evaluates if the target document appears in the top K results. Information Retrieval SLO
Pipeline Embedding Cost = Tokens × Cost_Per_1k + Vector_DB_Storage
Ingestion cost scales linearly with scraped volume. DataFlirt pricing model
// 04 — embedding pipeline trace

From raw HTML
to searchable vector.

A live trace of a scraped product review being cleaned, chunked, embedded, and indexed into a vector database for semantic retrieval.

text-embedding-3-smallchunkingPinecone
edge.dataflirt.io — live
CAPTURED
// 1. input record
record.id: "rev_98214"
record.text: "The battery life is abysmal, died in 2 hours."

// 2. chunking & normalisation
chunker.strategy: "sentence"
chunker.tokens: 14

// 3. embedding generation
model: "text-embedding-3-small"
api.latency: 112ms
vector.dimensions: 1536
vector.preview: [0.012, -0.045, 0.881, ...]

// 4. vector database upsert
db.target: "idx-electronics-prod"
upsert.status: 200 OK
index.latency: 45ms

// 5. semantic query test
query: "poor power duration"
match.score: 0.892 --- HIGH CONFIDENCE
// 05 — retrieval constraints

Where semantic
search degrades.

Ranked by frequency of retrieval failures in production RAG pipelines. Poor chunking strategy is the dominant failure mode, destroying context before the embedding model even sees the text.

PIPELINES ·  ·  ·  ·  ·   120+ active
METRIC ·  ·  ·  ·  ·  ·   Recall@10
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Suboptimal text chunking

% of failures · Destroys context boundaries
02

Embedding model mismatch

% of failures · General models for niche domains
03

Dirty scraped data

% of failures · HTML boilerplate polluting vectors
04

Lack of hybrid search

% of failures · Ignoring exact keyword matches
05

Vector dimension bloat

% of failures · High latency and storage costs
// 06 — our pipeline

Embed at the edge,

search at scale.

DataFlirt doesn't just deliver raw JSON. For AI teams, we run the embedding generation directly within the extraction pipeline. By chunking and vectorizing the text immediately after DOM parsing, we eliminate the need for downstream ETL jobs. We deliver ready-to-query vectors directly to your Pinecone, Milvus, or Qdrant index, complete with metadata payloads for hybrid filtering.

Embedding job health

Live status of an integrated scrape-and-embed pipeline.

job.id embed-tech-042
model text-embedding-3-small
records.processed 45,102
tokens.embedded 14.2M
chunking.strategy recursive_character
db.upserts 45,102
pipeline.status active

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About vector embeddings, chunking strategies, hybrid search, and how DataFlirt integrates semantic retrieval into scraping pipelines.

Ask us directly →
What is the difference between lexical and semantic search? +
Lexical search (like BM25) looks for exact word matches. Semantic search converts text to dense vectors and measures geometric distance, allowing it to find "battery died" when you search for "poor power duration". Lexical is better for exact IDs; semantic is better for concepts.
Do I need semantic search for all scraped data? +
No. If you are scraping structured data like prices, SKUs, or exact names, lexical search or standard SQL is faster and more accurate. Semantic search is designed for unstructured text: reviews, articles, documentation, and forum posts.
How does chunking affect semantic search? +
If you embed a whole 10-page article as one vector, specific details get diluted. If you chunk it into single sentences, you lose broader context. Finding the right chunk size (usually 256–512 tokens with overlap) is critical for retrieval accuracy.
Can DataFlirt deliver directly to my vector database? +
Yes. We support native upserts to Pinecone, Qdrant, Milvus, and pgvector. We handle the batching, retry logic, and metadata attachment so your RAG application always has fresh, searchable data without you running a separate embedding ETL.
What is hybrid search? +
Combining semantic vector search with traditional keyword search. It solves the main weakness of pure semantic search: failing to find exact part numbers, acronyms, or specific proper nouns that the embedding model might smooth over.
Are there copyright issues with embedding scraped data? +
Embedding text is generally considered a transformative computational process, but storing the original copyrighted text as metadata alongside the vector carries the same legal considerations as standard web scraping. Always adhere to robots.txt and fair use principles.
$ dataflirt scope --new-project --target=semantic-search READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h