← Glossary / Apache Avro

What is Apache Avro?

Apache Avro is a row-based data serialization system that relies on JSON for defining data types and protocols, but serializes the actual data into a compact binary format. In scraping pipelines, it acts as the strict schema contract between the extraction workers and downstream message queues like Kafka. If a scraper hallucinates a string where an integer belongs, Avro catches it at the edge, preventing poisoned records from silently corrupting your data warehouse.

SerializationSchema EvolutionBinary FormatKafkaData Contracts
// 02 — definitions

Strict contracts,
binary payloads.

The mechanics of separating schema definition from data payloads to achieve high-throughput, strongly-typed data ingestion across distributed scraping fleets.

Ask a DataFlirt engineer →

TL;DR

Avro is the default serialization format for high-volume streaming data. Unlike JSON, which embeds field names in every record, Avro stores the schema once and encodes the data as raw binary. This reduces payload size by up to 70% and enforces strict type validation before a scraped record ever hits your Kafka topics or Snowflake tables.

01Definition & structure
Apache Avro is a data serialization system developed within the Hadoop project. It uses JSON to define data types and protocols, but serializes the actual data into a compact, untagged binary format. An Avro schema defines the fields, their types (e.g., string, int, boolean), and whether they are optional (via unions with null). Because the schema is known to both the writer and the reader, the binary payload doesn't need to include field names, making it incredibly efficient for high-volume data streams.
02How it works in practice
In a scraping context, an extraction worker parses HTML and produces a raw dictionary of values. Before sending this data over the network, the worker validates the dictionary against an Avro schema. If the data matches, it is serialized into binary bytes and published to a message broker like Kafka. The downstream consumer (e.g., a Snowflake ingestion connector) reads the bytes, fetches the corresponding schema from a registry, and deserializes the binary back into structured rows.
03Schema evolution and drift
Web scraping is inherently unstable; target sites add, remove, or change fields constantly. Avro supports schema evolution, allowing the schema to change over time while maintaining compatibility. If a target site adds a new "discount_percentage" field, the Avro schema can be updated to include it with a default value of null. Old consumers reading new data will simply ignore the new field, preventing pipeline crashes while the downstream analytics teams update their queries.
04How DataFlirt handles it
We use Avro as the strict data contract between our extraction fleet and our delivery infrastructure. Every pipeline is bound to a versioned schema in our Confluent Schema Registry. If a selector breaks and extracts a malformed string instead of a price float, the Avro serializer throws an exception at the edge worker. The record is routed to a dead-letter queue for human review, ensuring that our clients' data lakes are never polluted with silent type-coercion errors.
05Did you know?
Unlike Protocol Buffers (Protobuf) or Thrift, Avro does not require you to generate code (like Java or Python classes) before you can use it. Because the schema is defined in plain JSON and is always present (either in the file header or via a registry), Avro libraries can dynamically process data at runtime. This makes it uniquely suited for dynamic scripting languages and rapidly changing scraping schemas.
// 03 — the payload math

Why Avro beats
JSON at scale.

The storage and bandwidth savings come from stripping repetitive keys. DataFlirt's ingestion layer uses Avro to compress millions of scraped records per minute before routing them to client sinks.

Avro payload size = Savro = binary_values + schema_header
Schema is written once per file, or managed externally via a registry. Apache Avro Specification
JSON payload size = Sjson = Σ (key_length + value_length + 4)
Keys and structural characters (quotes, colons) are repeated for every single record. DataFlirt bandwidth model
Compression ratio = C = 1 − (Savro / Sjson)
Typically yields a 60–75% reduction for flat scraped e-commerce catalogs. Internal benchmark
// 04 — extraction to avro

Validating a scraped
record at the edge.

A worker extracts a product record, validates it against the Avro schema registry, and serializes it to binary before publishing to Kafka.

Confluent Schema RegistryKafka ProducerBinary Serialization
edge.dataflirt.io — live
CAPTURED
// 1. fetch schema v4
schema.registry: "GET /subjects/product-catalog/versions/4"
schema.fields: ["id: string", "price: double", "in_stock: boolean"]

// 2. raw extraction output
extract.raw: {"id": "SKU-992", "price": 49.99, "in_stock": "yes"}

// 3. avro validation
type.check: field "in_stock" expected boolean, got string
type.coerce: "yes" -> true // ok
validation.status: passed

// 4. serialization
avro.encode: {"id": "SKU-992", "price": 49.99, "in_stock": true}
payload.json_size: 68 bytes
payload.avro_size: 19 bytes // 72% reduction

// 5. publish
kafka.topic: "raw.products.v1"
kafka.publish: ok offset=491022
// 05 — schema evolution

How schemas drift
in scraping pipelines.

Target websites change without warning. Avro handles this via schema evolution rules—allowing pipelines to adapt to new fields without breaking downstream consumers.

COMPATIBILITY ·  ·  ·  ·  BACKWARD_TRANSITIVE
REGISTRY ·  ·  ·  ·  ·    Confluent
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Adding a field with default

Backward compatible · Old consumers ignore it, new read it.
02

Removing an optional field

Forward compatible · Safe if downstream doesn't strictly require it.
03

Changing field type

Promotable · Avro handles numeric widening (int to long) automatically.
04

Renaming a field

Requires aliases · Breaks unless explicitly mapped in the schema definition.
05

Adding a required field

Incompatible · Breaks old consumers. Requires major version bump.
// 06 — our ingestion layer

Typed at the edge,

binary on the wire.

At DataFlirt, we don't let raw JSON float through our infrastructure. Every extraction worker is bound to a central schema registry. When a target site changes and a price field suddenly returns a string instead of a float, the Avro serializer catches it immediately. The record is quarantined at the edge, preventing poisoned data from entering the Kafka stream and corrupting the client's data lake.

avro-producer.log

Live metrics from a single extraction worker serializing to Avro.

schema.subject catalog-IN-v3
records.processed 14,200/sec
validation.failures 12 records
compression.ratio 3.8x
serialization.time 0.4ms/record
kafka.delivery acknowledged

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About Avro serialization, schema registries, Parquet comparisons, and how DataFlirt enforces data contracts at scale.

Ask us directly →
Why use Avro instead of JSON for scraped data? +
JSON is schemaless and verbose. It repeats field names in every record, wasting bandwidth. Avro stores the schema once and encodes data in binary, reducing payload size by up to 70%. More importantly, Avro enforces strict typing—if a scraper extracts a string where an integer is expected, Avro blocks it before it pollutes your database.
How does Avro compare to Parquet? +
Avro is row-based and optimized for write-heavy streaming workloads (like Kafka ingestion). Parquet is column-based and optimized for read-heavy analytical queries (like Athena or BigQuery). In a modern pipeline, you stream scraped data into Kafka using Avro, and then compact it into Parquet for long-term storage in the data lake.
What happens when a target website changes its structure? +
This is where Avro's schema evolution shines. If a site adds a new attribute, you update the Avro schema with a default value. Downstream consumers using the old schema will simply ignore the new field, while new consumers can read it. The pipeline doesn't break.
Do I need a Schema Registry to use Avro? +
Technically no, you can embed the schema in the header of every Avro file. But for streaming data via Kafka, a Schema Registry is essential. It stores the schemas centrally, allowing producers to just send a schema ID (4 bytes) instead of the full schema with every message.
How does DataFlirt handle Avro schema versioning? +
We treat schema changes as deployment events. When our monitors detect a target site change, we patch the extraction logic and register a new schema version with BACKWARD compatibility. Clients are notified, but their existing ingestion pipelines continue to consume the data without interruption.
Can I inspect an Avro file manually? +
Not with a standard text editor, since it's binary. You need tools like avro-tools (a Java JAR) or Python's fastavro library to deserialize the file back into JSON for debugging. DataFlirt provides a UI dashboard that automatically deserializes and samples the live Avro stream for our clients.
$ dataflirt scope --new-project --target=apache-avro READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h