← Glossary / Vector Database

What is Vector Database?

Vector Database is a specialized storage engine designed to index, store, and query high-dimensional mathematical representations of data, known as embeddings. Unlike relational databases that match exact keywords, vector databases perform similarity searches by calculating the distance between vectors in a multi-dimensional space. For modern scraping pipelines feeding AI models, it is the critical delivery sink that turns raw scraped text into instantly retrievable context for RAG applications.

DatabasesEmbeddingsRAGSimilarity SearchAI Scraping
// 02 — definitions

Search by
meaning.

How scraped text, images, and HTML are mathematically mapped so AI applications can retrieve them based on semantic intent rather than exact keyword matches.

Ask a DataFlirt engineer →

TL;DR

A vector database stores high-dimensional arrays (embeddings) generated by machine learning models. It uses algorithms like HNSW (Hierarchical Navigable Small World) to perform Approximate Nearest Neighbor (ANN) searches in milliseconds. If your scraping pipeline feeds a Large Language Model, the data almost certainly lands here.

01Definition & structure
A vector database stores data as high-dimensional vectors (arrays of floating-point numbers). Each dimension represents a latent feature of the data learned by a machine learning model. Alongside the vector, the database stores a payload of metadata (the original text, URLs, timestamps) so that when a vector is matched, the application can retrieve the human-readable content.
02How it works in practice
You pass scraped text to an embedding model, which returns a vector. You store this vector in the database. When a user submits a query, the application embeds their query using the exact same model. The database then calculates the distance between the query vector and all stored vectors using algorithms like HNSW or IVF-PQ, returning the closest matches in milliseconds.
03The chunking problem
You cannot embed a 10,000-word scraped article into a single vector without losing semantic detail. The text must be split into overlapping chunks (e.g., 500 tokens) before embedding. Each chunk becomes its own vector in the database, tied together by a shared document ID in the metadata. Poor chunking strategy is the leading cause of bad RAG retrieval.
04How DataFlirt handles it
We integrate chunking and embedding directly into the delivery pipeline. We do not just hand you a JSON file of scraped text; we hand you a fully populated, query-ready vector index. Our delivery workers handle the API rate limits of the embedding providers and manage the batch upserts to Pinecone, Qdrant, or Milvus automatically.
05Did you know?
The "curse of dimensionality" means that as vector dimensions increase (e.g., 1536 for OpenAI models), the mathematical distance between any two random vectors becomes almost identical. This is why exact K-Nearest Neighbors (KNN) becomes computationally impossible at scale, forcing vector databases to rely on Approximate Nearest Neighbor (ANN) algorithms that trade a tiny bit of accuracy for massive speed gains.
// 03 — distance metrics

How do you measure
semantic similarity?

Vector databases do not use B-trees. They calculate the mathematical distance between the query vector and the stored vectors. Cosine similarity is the default for most text-embedding models like OpenAI's text-embedding-3-small.

Cosine Similarity = SC(A,B) = (A · B) / (||A|| × ||B||)
Measures the angle between vectors. 1 = identical, 0 = orthogonal. Standard NLP metric
Euclidean Distance (L2) = d(p,q) = √Σ(piqi)2
Measures straight-line distance. Highly sensitive to vector magnitude. Geometric distance
DataFlirt Embedding Throughput = Tembed = (tokens / batch) × latency
Our pipeline syncs scraped records directly to Pinecone or Milvus at ~4k records/sec. DataFlirt Delivery SLO
// 04 — vector ingestion trace

From scraped HTML
to indexed vector.

A live trace of a DataFlirt pipeline extracting a product review, generating a 1536-dimensional embedding, and upserting it into a vector database.

text-embedding-3-smallPineconeHNSW index
edge.dataflirt.io — live
CAPTURED
// 1. raw scraped record
record.id: "rev_99281a"
record.text: "Battery life is terrible but the screen is gorgeous."

// 2. embedding generation
model: "text-embedding-3-small"
tokens: 12
vector.dim: 1536
vector.sample: [0.012, -0.044, 0.081, ... -0.002]

// 3. metadata payload assembly
meta.source: "amazon_reviews"
meta.sentiment: "mixed"
meta.scraped_at: 1716124800

// 4. vector db upsert
index.name: "df-retail-rag-prod"
upsert.status: 200 OK
latency.embed: 142ms latency.index: 48ms
// 05 — performance bottlenecks

Where vector pipelines
lose milliseconds.

Vector databases are fast, but the end-to-end pipeline from scraping to retrieval is bottlenecked by embedding generation, network I/O, and index updates.

AVG LATENCY ·  ·  ·  ·    190ms end-to-end
INDEX TYPE ·  ·  ·  ·  ·  HNSW
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Embedding model latency

API bottleneck · Calls to OpenAI/Cohere dominate pipeline time
02

Index rebuild / compaction

Compute heavy · HNSW graph updates during high-volume upserts
03

Metadata filtering

Query overhead · Pre-filtering vs post-filtering trade-offs
04

Network payload size

Bandwidth · 1536-dim float arrays are heavy to transmit
05

Memory pressure

Infrastructure · Vectors must fit in RAM for sub-50ms search
// 06 — delivery architecture

Scrape to vector,

without the middleware mess.

Most data teams scrape to S3, run a daily Airflow job to chunk the text, call an embedding API, and finally push to Milvus or Pinecone. DataFlirt collapses this. Our delivery layer natively supports embedding generation on the fly. We chunk the extracted text, generate the vectors using your preferred model, and upsert directly to your vector database index in near real-time. The moment a page is scraped, it is available for semantic search.

Vector Delivery Sink

Live configuration for a continuous RAG pipeline.

sink.type Pinecone Serverless
embedding.model text-embedding-3-small
chunking.strategy recursive_character
chunk.size 512 tokens
vector.dimensions 1536
metadata.schema strict_validation
upsert.batch_size 1000 records

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About vector databases, embedding generation, RAG pipelines, and how DataFlirt delivers scraped data directly to semantic search indexes.

Ask us directly →
What is the difference between a vector database and Elasticsearch? +
Elasticsearch traditionally uses an inverted index (BM25) for exact keyword matching. If you search "laptop", it looks for the word "laptop". A vector database performs semantic search. If you search "portable computer", it finds "laptop" because their mathematical representations are close in vector space. Note: ES now supports vector search, but native vector DBs (Pinecone, Milvus, Qdrant) are purpose-built for high-dimensional Approximate Nearest Neighbor (ANN) search at massive scale.
Do I need a vector database if I am just scraping product prices? +
No. Relational databases (PostgreSQL) or data warehouses (Snowflake, BigQuery) are far better for structured, exact-match data like prices, SKUs, and inventory counts. Vector databases are specifically for unstructured data — text, images, or audio — that you intend to retrieve using AI models.
Can DataFlirt handle the embedding generation, or do I need to do it? +
We handle it. You provide the API key for your preferred provider (OpenAI, Cohere, HuggingFace), and our delivery layer chunks the scraped text, generates the embeddings, and writes directly to your vector index. You skip the intermediate ETL step entirely.
What happens if the scraping pipeline updates an existing record? +
We perform an upsert based on the unique record ID (e.g., the product URL or article hash). The old vector and metadata are overwritten with the new embedding. This ensures your RAG application does not hallucinate on stale data.
Are there privacy concerns with storing scraped data in vector databases? +
Yes. If you scrape Personally Identifiable Information (PII) and embed it, that PII is mathematically encoded in the vector. If an LLM retrieves it as context, it can leak the PII to the end user. We recommend strict PII masking in the extraction layer before the embedding step occurs.
How does metadata filtering work with vector search? +
Most modern vector databases support single-stage filtering. You can query "find vectors semantically similar to X, but only where the metadata field 'category' equals 'electronics'". We ensure all scraped structured data (dates, authors, categories) is attached to the vector payload as metadata to enable precise pre-filtering.
$ dataflirt scope --new-project --target=vector-database READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h