← Glossary / Apache ORC

What is Apache ORC?

Apache ORC (Optimized Row Columnar) is a highly efficient, strongly typed columnar storage format designed for massive analytical workloads. For data engineering teams ingesting terabytes of scraped records, ORC dramatically reduces storage costs and query latency by organizing data into compressed stripes and embedding lightweight indexes that allow query engines to skip irrelevant data entirely.

Data EngineeringColumnar StorageBig DataHadoop EcosystemCompression
// 02 — definitions

Built for
the query.

Why delivering scraped data in CSVs is a liability at scale, and how ORC restructures records to make analytical queries exponentially faster.

Ask a DataFlirt engineer →

TL;DR

ORC stores data by column rather than by row, breaking files into large chunks called stripes (typically 256 MB). It embeds Bloom filters and min/max statistics per column, enabling engines like Trino, Athena, and Spark to skip reading up to 90% of a file during a query. It's the format of choice when storage efficiency and read performance matter more than human readability.

01Definition & structure
Apache ORC is a self-describing, type-aware columnar file format. An ORC file consists of a file footer (containing metadata and schema) and multiple stripes. Each stripe (usually 256 MB) contains:
  • Index Data: Min/max values and Bloom filters for each column.
  • Row Data: The actual data, stored by column, heavily compressed.
  • Stripe Footer: Directory of stream locations within the stripe.
Because the schema is embedded in the file footer, ORC files are self-contained and strongly typed.
02Row vs Columnar storage
In a row-based format (CSV, JSON), data is stored sequentially: Row1(ColA, ColB, ColC), then Row2(ColA, ColB, ColC). To read just ColB, the disk must read the entire file. In a columnar format like ORC, data is stored by column: all of ColA, then all of ColB. A query engine can seek directly to the ColB block and read only what it needs, drastically reducing disk I/O and memory usage.
03Predicate Pushdown
The true power of ORC is predicate pushdown. Because ORC stores minimum and maximum values for every column in every stripe, a query like SELECT * FROM products WHERE price > 1000 allows the query engine to look at the stripe index first. If a stripe's max price is 500, the engine skips reading that 256 MB chunk entirely. This turns full table scans into surgical reads.
04How DataFlirt handles it
We treat ORC as a first-class delivery format for enterprise pipelines. Instead of dumping raw JSON and forcing clients to run an ETL job, our extraction workers validate data against a strict schema and write directly to Zstandard-compressed ORC files. When we deliver to your S3 bucket, you can immediately point AWS Athena at the prefix and query billions of scraped records with sub-second latency.
05The compression advantage
Because ORC stores data by column, all values in a block are of the same type. This allows for type-specific compression. A column of boolean flags or repeated category strings can be dictionary-encoded or run-length encoded, shrinking gigabytes of raw text into megabytes of disk space. This is why scraped catalogs in ORC are often 10% the size of their JSON equivalents.
// 03 — the storage math

Why ORC beats
row-based formats.

Columnar formats exploit data homogeneity. A column of scraped prices compresses far better than a row containing a mix of text, dates, and floats. This translates directly to lower S3 costs and faster Athena queries.

Compression Ratio = C = Sizeraw_json / Sizeorc
Typically 5x to 10x for scraped e-commerce catalogs due to repeated values. DataFlirt delivery benchmarks
I/O Cost (Columnar) = I/O = (Colsqueried / Colstotal) × Sizeorc
You only pay to read the columns you actually query. Row formats read 100%. Columnar architecture principle
Stripe Size = S = 256 MB
Default stripe size, optimized for sequential block reads from object storage. Apache ORC specification
// 04 — file inspection

Inside an ORC
delivery file.

Inspecting the metadata of a 10M record product catalog delivery using the ORC tools CLI. Notice the compression ratio and the embedded column statistics.

orc-toolsZSTD compressionTrino ready
edge.dataflirt.io — live
CAPTURED
$ orc-tools meta s3://df-client-042/catalog_20260519.orc
Processing data file catalog_20260519.orc [length: 142.4 MB]
Structure for catalog_20260519.orc
File Version: 0.12 with ORC_135
Rows: 10,000,000
Compression: ZSTD
Compression size: 262,144

// Stripe statistics
Stripes: 4
Stripe 1: rows: 2,500,000 data: 35.6 MB

// Column statistics (Predicate Pushdown enablers)
Column 1: product_id count: 10000000 hasNull: false
Column 2: price_inr count: 9982104 hasNull: true
min: 49.0 max: 145000.0 sum: 849201450.0
Column 3: category count: 10000000 hasNull: false
String length: 142048122

Status: File verified
// 05 — performance drivers

Where the speed
comes from.

The architectural decisions within the ORC format that reduce query latency and cloud egress costs for large-scale scraped datasets.

AVG COMPRESSION ·  ·  ·   82% vs JSON
QUERY SPEEDUP ·  ·  ·  ·  10x–100x
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Columnar layout

I/O reduction · Engines only read the specific columns requested in the SELECT clause.
02

Predicate pushdown

Data skipping · Min/max stats allow engines to skip entire 256MB stripes if the WHERE clause doesn't match.
03

Type-aware encoding

Compression · Run-length encoding for integers, dictionary encoding for repeated strings.
04

Bloom filters

Fast lookups · Probabilistic data structures to quickly test if a specific value exists in a stripe.
05

Stripe architecture

Parallelism · Large, independent chunks allow distributed engines to process a single file in parallel.
// 06 — delivery architecture

Stop parsing JSON,

start querying data.

When clients ingest millions of scraped records daily, JSON and CSV become operational bottlenecks. Parsing text is CPU-intensive, and reading full rows to aggregate a single price column wastes I/O. DataFlirt's delivery pipeline natively writes to Apache ORC, applying Zstandard compression and strict schema validation before the file ever hits your S3 bucket. You get data that is immediately ready for Athena, Trino, or Snowflake, with zero ETL required on your end.

Delivery Job: ORC Export

Metrics from a daily catalog export to a client's data lake.

job.id export-orc-091
records.total 14,204,812
size.raw_json 8.4 GB
size.orc_zstd 1.1 GB
compression 86.9% reduction
schema.validation strictpassed
delivery.status s3://client-lake/...

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about using Apache ORC for scraped data delivery, its differences from Parquet, and integration with modern data stacks.

Ask us directly →
What is the difference between Apache ORC and Apache Parquet? +
Both are columnar storage formats. Historically, ORC was built for Hive and optimized for heavy compression and complex types, while Parquet was built for Spark and optimized for nested data structures. Today, both are excellent. If your stack is heavily Trino/Presto/Hive based, ORC often has a slight edge. If you are entirely Spark-based, Parquet is usually the default. DataFlirt supports both natively.
Can I read ORC files in Python using Pandas? +
Yes. You can read ORC files into Pandas DataFrames using the pyarrow engine (e.g., pd.read_orc('file.orc')). However, ORC is designed for distributed query engines. If you are pulling a 10GB ORC file into a single Pandas process on a laptop, you are missing the point of the format's parallelization capabilities.
Why shouldn't I just use Gzip-compressed CSVs? +
Gzipped CSVs are row-based and cannot be split for parallel processing (unless you decompress the whole file first). If you want to calculate the average price of a product, a query engine reading a CSV must read the entire file, parse every row, and extract the price. With ORC, the engine only reads the compressed price column, skipping 95% of the file's I/O.
How does DataFlirt handle schema evolution in ORC deliveries? +
ORC requires a strict schema. If a target website adds a new field, we update the versioned schema contract. ORC supports schema evolution (adding columns to the end of the schema). Downstream tables in Athena or Trino will simply read nulls for the new column in older files, and the actual values in newer files, without breaking the pipeline.
What is a 'stripe' in an ORC file? +
An ORC file is divided into chunks called stripes, typically 256 MB in size. Each stripe contains index data, row data, and a stripe footer. This architecture allows distributed engines like Spark or Trino to assign different workers to different stripes within the same file, enabling massive read parallelism.
Does ORC support nested data structures like JSON? +
Yes. ORC supports complex types including arrays, maps, and structs. If we scrape a product page with a list of 15 variant objects, we can store that natively in ORC as an array of structs. You get the structural flexibility of JSON combined with the performance of columnar storage.
$ dataflirt scope --new-project --target=apache-orc READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h