← Glossary / ClickHouse

What is ClickHouse?

ClickHouse is an open-source, column-oriented OLAP database designed for lightning-fast analytical queries over massive datasets. In scraping infrastructure, it serves as the ultimate sink for high-volume raw data and telemetry, capable of ingesting millions of records per second while compressing them heavily. It's the engine that lets you run aggregations across billions of scraped price points or proxy logs without waiting for a batch job to finish.

OLAPColumnar StorageMergeTreeData IngestionAnalytics
// 02 — definitions

Billions of rows,
sub-second queries.

Why row-based databases choke on scraping telemetry, and how columnar storage flips the performance math for analytical workloads.

Ask a DataFlirt engineer →

TL;DR

ClickHouse stores data by columns rather than rows, meaning queries that aggregate specific fields (like average price or proxy failure rate) only read the necessary data from disk. It uses the MergeTree engine family to handle massive, continuous inserts, making it the standard choice for storing raw scraping output and pipeline observability metrics.

01Definition & structure

ClickHouse is an open-source, column-oriented database management system built for Online Analytical Processing (OLAP). Unlike traditional row-based databases (like MySQL or PostgreSQL) that store all fields of a record together, ClickHouse stores each column in a separate file.

This architecture means that if you run a query calculating the average price of a product across a billion rows, ClickHouse only reads the price column from disk. It completely ignores the heavy description or raw_html columns, resulting in massive I/O savings and sub-second query times.

02The MergeTree engine family

The core of ClickHouse is the MergeTree table engine. When you insert a batch of data, ClickHouse immediately writes it to disk as a new, immutable "part" sorted by the primary key. In the background, ClickHouse continuously merges these smaller parts into larger ones to optimize read performance.

This append-only, merge-in-the-background design is what allows ClickHouse to ingest millions of rows per second without locking tables or degrading read performance.

03Ingestion patterns for scraping

Because every insert creates a new file part, ClickHouse is highly sensitive to insert frequency. If your scrapers insert records one by one, you will quickly hit a Too many parts exception as the background merge process falls behind.

The correct pattern is to buffer scraped records in memory, a message queue (like Kafka), or an intermediate layer, and flush them to ClickHouse in micro-batches of 10,000 to 100,000 rows every few seconds.

04How DataFlirt handles it

We use ClickHouse as the backbone of our observability and raw data storage. Every proxy request, TLS fingerprint, and extraction event generated by our fleet is buffered via Kafka and ingested into ClickHouse using the Kafka table engine. This allows us to maintain real-time dashboards of pipeline health and anti-bot block rates across billions of daily requests, without putting any analytical load on our operational databases.

05Did you know: Vectorized execution

ClickHouse doesn't just store data in columns; it processes it in columns. Using vectorized query execution, ClickHouse loads blocks of column data directly into the CPU's L1/L2 cache and uses SIMD (Single Instruction, Multiple Data) instructions to execute operations on arrays of data simultaneously. This makes aggregations and filtering operations orders of magnitude faster than row-by-row processing.

// 03 — the performance math

Why columnar
storage wins.

ClickHouse's speed comes from avoiding unnecessary disk I/O and maximizing CPU cache efficiency through vectorized execution. Here is how we model ingestion and query costs.

I/O Reduction = Disk_Read = Σ queried_cols_bytes / compression_ratio
Queries only load the columns specified in the SELECT clause, ignoring the rest of the wide schema. Columnar Architecture
Ingestion Throughput = Ringest = batch_size / (network_latency + disk_write_time)
ClickHouse requires large batches (10k-100k rows) to minimize the overhead of creating new data parts. MergeTree Mechanics
DataFlirt Storage Efficiency = E = raw_json_size / lz4_compressed_size
Typical E > 6.0 for scraped text data, massively reducing EBS volume costs. Internal SLO
// 04 — ingestion trace

Writing 100k records
in a single block.

ClickHouse hates single-row inserts. Here is a trace of a DataFlirt pipeline flushing a micro-batch of scraped product records into a MergeTree table via the native protocol.

Native ProtocolLZ4 CompressionAsync Insert
edge.dataflirt.io — live
CAPTURED
// batch preparation
buffer.size: 100,000 // records
buffer.bytes_uncompressed: 145.2 MB

// native client insert
query: "INSERT INTO raw_products FORMAT Native"
compression: "LZ4"
bytes_transferred: 22.1 MB // 6.5x compression ratio

// server-side processing
part.created: "all_1452_1452_0"
rows_written: 100,000
insert_duration: 42ms // 2.38M rows/sec

// background merge (system.parts)
merge.task: "all_1400_1452_1"
status: merged 53 parts into 1
disk_io: optimized
// 05 — optimization levers

Where the speed
actually comes from.

ClickHouse is fast by default, but tuning these architectural features dictates whether your cluster handles terabytes or petabytes efficiently.

INGEST RATE ·  ·  ·  ·    2M+ rows/sec/node
COMPRESSION ·  ·  ·  ·    LZ4 / ZSTD
ENGINE ·  ·  ·  ·  ·  ·   ReplicatedMergeTree
01

Columnar Storage

I/O bound · Only reads columns requested in the query
02

Data Compression

CPU vs I/O · Similar data types compress heavily, saving disk reads
03

Vectorized Execution

CPU cache · Processes arrays of columns rather than single values
04

Sparse Primary Index

Memory · Fits in RAM, points to blocks of rows, not individual rows
05

Background Merges

Disk I/O · Asynchronously compacts data parts to speed up reads
// 06 — our stack

Write heavy,

read instantly.

DataFlirt uses ClickHouse as the primary sink for both raw scraped payloads and pipeline telemetry. Every HTTP request, proxy rotation, and extracted field is flushed to ClickHouse in micro-batches. This allows our clients to query their raw data instantly, while we use the same cluster to monitor proxy pool health across billions of requests without degrading ingestion performance. We rely heavily on ReplacingMergeTree to handle record deduplication asynchronously, keeping the write path completely lock-free.

system.query_log

Live metrics from a DataFlirt telemetry node.

cluster.role telemetry-sink-01
active_parts 1,204healthy
insert_rate 450k rows/s
query.p99 142msfast
compression.lz4 18% of raw size
mutations.queue 0 pending
zookeeper.status connected

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About ClickHouse architecture, ingestion patterns, schema design, and how DataFlirt leverages it for scraping infrastructure.

Ask us directly →
Should I use ClickHouse or PostgreSQL for my scraping pipeline? +
Use PostgreSQL for pipeline state, job queues, and configuration (OLTP). Use ClickHouse for the actual scraped data, logs, and telemetry (OLAP). PostgreSQL chokes when you try to run aggregations over 500 million rows or insert 100k rows per second. ClickHouse handles that effortlessly, but it is terrible at single-row updates and transactional consistency.
Why do my inserts keep failing with 'Too many parts' errors? +
You are doing single-row or small-batch inserts. ClickHouse's MergeTree engine creates a new file (part) on disk for every insert. If you insert 1 row at a time, you create thousands of files per second, overwhelming the background merge process. Always buffer your scraped data and insert in batches of 10,000 to 100,000 rows.
How do I update or delete scraped records in ClickHouse? +
Standard UPDATE and DELETE commands (mutations) are heavy, asynchronous operations in ClickHouse. Instead of mutating, use the ReplacingMergeTree or CollapsingMergeTree engines. You insert a new version of the row with a higher timestamp, and ClickHouse automatically deduplicates and removes the old version during background merges.
How does the sparse index work compared to a B-Tree? +
A B-Tree index (like in Postgres) maps every single row to a disk location. ClickHouse's sparse index maps ranges of rows (granules, default 8192 rows) to disk locations. Because the data is sorted by the primary key, the sparse index is tiny, fits entirely in RAM, and allows ClickHouse to skip reading massive chunks of irrelevant data.
What compression codec should I use for scraped text data? +
LZ4 is the default and provides the best balance of compression speed and read performance. If you are storing massive amounts of raw HTML or JSON that you query infrequently, use ZSTD. ZSTD compresses much tighter (saving disk space) but uses more CPU during ingestion and querying.
How does DataFlirt deliver data from ClickHouse to clients? +
We don't usually expose ClickHouse directly to clients. We use it as our internal raw data lake. When a client pipeline finishes, we run an asynchronous ClickHouse query to export the finalized, deduplicated dataset directly to an S3 bucket in Parquet or JSONL format, which the client then ingests into their own Snowflake or BigQuery environment.
$ dataflirt scope --new-project --target=clickhouse READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h