← Glossary / JSONB Column

What is JSONB Column?

JSONB Column is a binary JSON data type native to PostgreSQL that allows scraping pipelines to store semi-structured payloads without defining a rigid schema upfront. Unlike standard JSON text fields, JSONB parses the data during insertion, enabling fast, indexed querying of nested keys and arrays. For data engineering teams, it's the ultimate schema-on-read buffer: dump raw API responses or scraped DOM objects immediately, and extract the typed fields downstream without losing the original payload context.

PostgreSQLSchema-on-readSemi-structured DataGIN IndexingRaw Data Zone
// 02 — definitions

Schema on
read.

The mechanics of storing raw scraped payloads in a queryable binary format before transforming them into structured relational tables.

Ask a DataFlirt engineer →

TL;DR

JSONB stores JSON data in a decomposed binary format. While it adds a slight overhead to insert times compared to plain text, it allows you to build GIN indexes on nested keys. In scraping architectures, it acts as an immutable raw layer, ensuring that if your downstream extraction logic breaks, the original payload is still queryable and intact.

01Definition & structure
JSONB (JSON Binary) is a native PostgreSQL data type used to store JSON data in a decomposed, binary format. Unlike the standard json type, which stores exact text, JSONB parses the payload upon insertion. It removes whitespace, eliminates duplicate keys (keeping the last value), and sorts keys for efficient retrieval. This binary structure allows the database engine to traverse nested objects and arrays without reparsing the text on every query.
02How it works in practice
In a scraping pipeline, the fetcher retrieves a JSON response from a hidden API or extracts a data dictionary from a page's JavaScript. Instead of mapping every possible field to a database column, the entire object is inserted into a single JSONB column. Downstream data models then use PostgreSQL's JSON path operators (like ->> to get text or #> to extract paths) to pull specific fields into structured, typed views for analytics.
03Indexing for performance
The true power of JSONB lies in its indexing capabilities. You can apply a GIN (Generalized Inverted Index) to a JSONB column, allowing the database to instantly find rows where the JSON document contains a specific key-value pair or array element. For highly specific queries, you can create functional B-Tree indexes on exact paths, making queries against a 100-million-row JSONB table execute in milliseconds.
04How DataFlirt handles it
We utilize JSONB extensively in our Bronze (raw) layer. Every successful scrape writes a record containing standard metadata columns (URL, timestamp, target ID) alongside a JSONB column containing the exact, unadulterated payload. We do not update JSONB rows; they are append-only. This guarantees that if a target alters their API schema, the data is still captured, and we can adjust our Silver layer extraction logic retroactively without data loss.
05The TOAST storage caveat
PostgreSQL pages are 8KB. If a scraped JSONB payload exceeds ~2KB, it is compressed and moved to out-of-line TOAST (The Oversized-Attribute Storage Technique) tables. While this keeps the main table lean, querying deeply nested keys inside TOASTed JSONB requires the database to fetch and decompress the entire blob just to read a single integer. If you frequently query a specific key in a large payload, it must be extracted to a standard column.
// 03 — the storage model

How JSONB impacts
your database.

JSONB trades write speed and storage space for read performance. The math below dictates how DataFlirt provisions PostgreSQL storage for high-volume raw data zones.

Storage overhead = SjsonbStext × 1.15
Binary decomposition and alignment padding increase the footprint over raw text. PostgreSQL Internals
Query latency (Indexed) = L = O(log N)
Using a GIN index on a JSONB path turns a full table scan into a rapid bitmap lookup. B-Tree / GIN characteristics
Extraction yield = Y = extracted_columns / jsonb_keys
A low yield means you are storing massive JSONB blobs but only querying a fraction of the data. DataFlirt pipeline metrics
// 04 — query execution

Querying nested data
at scale.

An EXPLAIN ANALYZE trace of a PostgreSQL query searching for a specific product variant inside a 40GB table of raw scraped JSONB payloads.

PostgreSQL 16GIN IndexJSONB Path
edge.dataflirt.io — live
CAPTURED
// query: find all products where variants array contains 'blue'
EXPLAIN ANALYZE SELECT payload->>'price' FROM raw_scrapes
WHERE payload @> '{"variants": ["blue"]}';

// execution plan
Bitmap Heap Scan on raw_scrapes (cost=42.15..412.30 rows=1402)
Recheck Cond: (payload @> '{"variants": ["blue"]}'::jsonb)
Heap Blocks: exact=845
-> Bitmap Index Scan on idx_payload_gin (cost=0.00..41.80)
Index Cond: (payload @> '{"variants": ["blue"]}'::jsonb)

// performance metrics
Planning Time: 0.184 ms
Execution Time: 4.211 ms
Rows Returned: 1,402
// A full table scan without GIN would take ~4,500 ms
// 05 — performance bottlenecks

Where JSONB queries
bog down.

JSONB is powerful, but treating a relational database like a document store has limits. These are the most common reasons JSONB queries degrade in scraping pipelines.

TABLES PROFILED ·  ·  ·   850+ raw zones
AVG PAYLOAD ·  ·  ·  ·    14 KB
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Missing GIN indexes

Full table scans · Querying nested keys without an index forces sequential reads.
02

TOAST table bloat

I/O bottleneck · Large JSONB payloads (>2KB) are stored out-of-line, doubling read I/O.
03

High update frequency

Write amplification · Updating one key rewrites the entire JSONB document.
04

Deep array traversal

CPU intensive · Unnesting complex arrays at query time spikes CPU usage.
05

Type coercion overhead

Cast failures · Extracting text and casting to numeric repeatedly in views.
// 06 — our architecture

Schema-on-read,

structured on delivery.

DataFlirt uses JSONB as the immutable raw layer for all scraping pipelines. When a target site changes its DOM structure, downstream extraction logic might fail, but the raw JSONB payload is already safely stored. This architecture allows us to patch the selector and backfill the structured columns instantly without needing to re-scrape the target. We index only the metadata keys (timestamp, target ID, status) and leave the deep payload unindexed until transformation.

raw_scrape_buffer stats

Live metrics from a PostgreSQL JSONB buffer table handling e-commerce data.

table.name raw_catalog_in
column.type JSONB
avg_row_size 18.4 KB
toast_storage 412 GBcompressed
index.gin idx_metadata_pathactive
insert_rate 4,200 rows/sec
extraction_status continuous ELT

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about JSON vs JSONB, indexing strategies, and how to structure raw scraping data in PostgreSQL.

Ask us directly →
What is the difference between JSON and JSONB in PostgreSQL? +
The json data type stores an exact copy of the input text, including whitespace and duplicate keys. It is fast to insert but slow to query because it must be reparsed on every read. jsonb stores data in a decomposed binary format. It is slightly slower to insert, strips whitespace, and sorts keys, but it is vastly faster to query and supports GIN indexing. Always use JSONB for scraping payloads.
When should I extract fields to standard columns vs keeping them in JSONB? +
If you query, filter, or join on a field frequently, extract it to a standard relational column (e.g., price NUMERIC). Keep data in JSONB if it is highly variable, sparsely populated, or only needed for downstream auditing. A hybrid approach—standard columns for core metadata and a JSONB column for the raw payload—is the industry standard.
Can I index specific keys inside a JSONB column? +
Yes. You can create a B-Tree index on a specific extracted path (e.g., CREATE INDEX idx_price ON table ((payload->>'price'))) or use a GIN index on the entire JSONB column to support containment queries (@>). Targeted B-Tree indexes are smaller and faster if you only query a few specific keys.
Does storing large scraped HTML in JSONB make sense? +
No. JSONB is for structured or semi-structured data (like API JSON responses or parsed dictionaries). If you are storing raw HTML strings, use a standard TEXT column. Wrapping a 2MB HTML string inside a JSONB object just adds parsing overhead with zero query benefits.
How does DataFlirt handle schema drift with JSONB? +
We treat the JSONB column as an append-only raw ledger. When a target API adds new fields or changes a structure, the JSONB insert succeeds automatically. Our downstream dbt models (which extract the JSONB into structured tables) will flag the schema drift, allowing our engineers to update the extraction logic and replay the raw JSONB data without losing any historical records.
Why do my JSONB updates cause database bloat? +
PostgreSQL uses MVCC (Multiversion Concurrency Control). Updating a single nested key inside a 50KB JSONB document forces Postgres to write a completely new 50KB row and mark the old one as dead. If you frequently update specific keys, extract them into standard columns to avoid massive write amplification and TOAST bloat.
$ dataflirt scope --new-project --target=jsonb-column READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h