← Glossary / DuckDB

What is DuckDB?

DuckDB is an in-process SQL OLAP database management system designed for fast analytical queries on local or remote data. Unlike traditional data warehouses that require dedicated clusters and complex ingestion pipelines, DuckDB runs embedded within your scraping worker or data pipeline script. It executes vectorized queries directly against Parquet, CSV, or JSON files in S3 or local disk, making it the standard engine for transforming and validating scraped datasets before delivery.

OLAPIn-ProcessVectorizedParquetData Engineering
// 02 — definitions

Analytics,
embedded.

The mechanics of running a columnar data warehouse inside a Python or Node.js process without network overhead.

Ask a DataFlirt engineer →

TL;DR

DuckDB is to analytics what SQLite is to transactional data. It runs inside your scraper or ETL script, executing complex SQL aggregations and schema validations over millions of scraped records in milliseconds. It eliminates the need to load raw data into Snowflake or BigQuery just to clean it.

01Definition & architecture

DuckDB is an embedded analytical database. It runs entirely within the host process (like a Python script or Node.js app) and has no server to install, configure, or maintain. It uses a columnar storage model and a vectorized query execution engine, meaning it processes data in large batches rather than row-by-row. This architecture maximizes CPU cache efficiency and makes it exceptionally fast for aggregations, joins, and complex analytical queries.

02How it works in practice

You install it via a package manager (e.g., pip install duckdb) and import it into your script. You can create an in-memory database or connect to a local file. DuckDB can execute SQL queries directly against Pandas DataFrames, Polars DataFrames, Apache Arrow tables, or raw files (CSV, JSON, Parquet) on disk or in S3. Because it runs in the same process as your application, transferring data between the database and your code incurs zero serialization or network overhead.

03Why scrapers use it

Scraping pipelines generate massive amounts of raw, semi-structured data. Before delivering this data, it must be cleaned, deduplicated, and validated. Traditionally, this required loading the data into a heavy ETL tool or a cloud data warehouse. DuckDB allows engineers to perform these transformations locally at the edge. A scraper can dump millions of JSON lines, and a 10-line Python script using DuckDB can instantly convert them into a clean, partitioned Parquet dataset.

04How DataFlirt handles it

We use DuckDB extensively in our delivery infrastructure. When a scraping job completes, our delivery workers spin up an in-memory DuckDB instance. We use it to run SQL-based schema assertions against the raw output, ensuring no required fields are missing and all data types match the client contract. We then use DuckDB to deduplicate the records and export the final dataset directly to the client's S3 bucket as optimized Parquet files. This keeps our delivery latency under 90 seconds.

05Did you know?

DuckDB can query multiple remote files simultaneously using glob patterns. A query like SELECT count(*) FROM 's3://bucket/data/**/*.parquet' will automatically discover all matching files, read their footers in parallel, and execute the aggregation without downloading the actual data payloads. This makes it a powerful, zero-infrastructure query engine for massive data lakes.

// 03 — performance model

Why is it
so fast?

DuckDB's speed comes from vectorized execution and columnar data locality. Here is how we model transformation latency in our delivery pipelines.

Vectorized execution cost = C = N / 1024 × Tbatch
Processes data in batches of 1024 vectors, minimizing CPU cache misses. DuckDB Architecture
Remote query latency = L = RTT + (Bytesread / Bandwidth)
Uses HTTP range requests to read only required Parquet footers and columns. httpfs extension model
DataFlirt validation time = Tval = Rows × 0.04 ms
Average time to run schema assertions on scraped output via DuckDB. Internal SLO
// 04 — pipeline execution

Validating 2M records
in 1.4 seconds.

A trace from a DataFlirt delivery worker using DuckDB to deduplicate, validate, and convert raw JSON lines into partitioned Parquet files.

DuckDB v0.10JSON to ParquetZero-copy
edge.dataflirt.io — live
CAPTURED
// init
duckdb.connect: memory

// query execution
sql: CREATE TABLE raw AS SELECT * FROM read_json_auto('s3://bucket/raw/*.json')
rows_read: 2,145,092

// validation
sql: SELECT count(*) FROM raw WHERE price IS NULL
null_prices: 0 // ok

// deduplication
sql: CREATE TABLE clean AS SELECT DISTINCT ON (sku) * FROM raw
duplicates_removed: 14,201

// export
sql: COPY clean TO 's3://bucket/clean/' (FORMAT PARQUET, PARTITION_BY (category))
bytes_written: 142.4 MB
execution_time: 1.42s // ok
// 05 — bottleneck analysis

Where the time
actually goes.

DuckDB is rarely CPU-bound. In scraping and data delivery contexts, performance is dictated by I/O and memory constraints. Here is the latency breakdown for a typical transformation job.

SAMPLE SIZE ·  ·  ·  ·    12,000 jobs
AVG ROWS ·  ·  ·  ·  ·    2.4M
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

S3 network I/O

latency bound · Downloading Parquet chunks or JSON files
02

JSON parsing

CPU bound · Inferring schema from raw scraper output
03

Memory spilling

disk bound · Spilling to disk when aggregations exceed RAM
04

Parquet compression

CPU bound · Encoding output files
05

Vectorized computation

cache bound · Actual CPU execution time
// 06 — delivery architecture

Compute at the edge,

validating data before it ever hits a warehouse.

Traditional pipelines load raw scraped data into a central warehouse before running dbt models to clean it. This is expensive and slow. DataFlirt embeds DuckDB directly into the delivery workers. When a scraping job finishes, the worker spins up an in-memory DuckDB instance, runs SQL assertions against the raw JSON, deduplicates records, and writes partitioned Parquet files directly to the client's S3 bucket. The data arrives clean, typed, and ready to query.

Delivery worker state

Live snapshot of a DuckDB transformation job on a scraping worker.

worker.id dw-eu-west-04
duckdb.version 0.10.2
memory.limit 16GB
memory.used 4.2GB
threads 8
spill_to_disk false
job.status exporting_parquet

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About embedded analytics, DuckDB vs SQLite, remote querying, and how DataFlirt uses it for pipeline delivery.

Ask us directly →
What is the difference between DuckDB and SQLite? +
SQLite is a row-oriented database designed for transactional (OLTP) workloads — inserting or updating single records quickly. DuckDB is a column-oriented database designed for analytical (OLAP) workloads — scanning millions of rows to compute aggregates. If you are storing scraper state, use SQLite. If you are analyzing scraped data, use DuckDB.
Can multiple processes write to a DuckDB file at the same time? +
No. DuckDB is designed as a single-writer, multiple-reader system. If your scraping architecture involves hundreds of concurrent workers, they should not try to write to the same DuckDB file. Instead, have workers write raw JSON or Parquet files to S3, and use a single DuckDB process to query those files collectively.
Is it legal to store scraped data in DuckDB? +
The storage engine you use has no bearing on the legality of the scraping activity. Compliance depends on what data you collect, how you collect it, and what you do with it. Data minimization principles apply regardless of whether the data sits in DuckDB, Postgres, or a CSV file.
How does DataFlirt use DuckDB? +
We use it as the transformation engine in our delivery layer. Instead of maintaining a massive Spark cluster to clean scraped data, we embed DuckDB in our delivery workers. It reads the raw JSON output from the scrapers, runs schema validation and deduplication queries, and exports the final dataset as partitioned Parquet files to the client's bucket.
Can DuckDB query data directly from S3 without downloading it? +
Yes. Using the httpfs extension, DuckDB can query remote Parquet files using HTTP range requests. It reads the file footer to understand the schema and byte offsets, then downloads only the specific columns required for your query. This makes it incredibly efficient for exploring massive data lakes.
What happens if my dataset is larger than available RAM? +
DuckDB supports out-of-core execution. When a query requires more memory than is available, it automatically spills intermediate results to disk. Performance will degrade due to disk I/O, but the query will complete successfully. You can control the memory limit using the memory_limit configuration pragma.
$ dataflirt scope --new-project --target=duckdb READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h