← Glossary / NDJSON

What is NDJSON?

NDJSON (Newline Delimited JSON) is a data format where each line of a file is a valid, standalone JSON object. By removing the root array brackets and inter-record commas of standard JSON, NDJSON allows scraping pipelines to stream, append, and parse gigabyte-scale datasets line-by-line. It eliminates the out-of-memory crashes associated with monolithic JSON parsing while preserving the ability to handle deeply nested data structures that break flat CSVs.

Data DeliveryStreamingJSON LinesBig DataETL

// 02 — definitions

Line by
line.

Why standard JSON breaks at scale, and how appending single objects per line fixes the memory bottleneck in data delivery.

Ask a DataFlirt engineer →

TL;DR

NDJSON formats data so that every line is an independent JSON object. Unlike a standard JSON array, you don't need to load the entire file into memory to parse it. It is the default delivery format for high-volume scraping pipelines because it supports infinite appending and native streaming ingestion into modern data warehouses like BigQuery and Snowflake.

01Definition & structure

NDJSON stands for Newline Delimited JSON. It is a text-based data format where each line is a valid JSON value (typically an object). Unlike standard JSON, an NDJSON file does not start with a [, does not end with a ], and does not have commas separating the objects.

The structure is strictly: {object1}\n{object2}\n{object3}\n. This seemingly minor structural change fundamentally alters how computers process the file, shifting it from a monolithic block of data to a streamable sequence of independent records.

02Why standard JSON fails at scale

To parse a standard JSON array, a parser must read the entire string into memory, validate the syntax, and construct an Abstract Syntax Tree (AST). In Node.js or Python, the memory overhead of this AST is typically 3x to 5x the size of the file on disk. A 2GB JSON file will easily consume 8GB of RAM and crash a standard worker node with an Out Of Memory (OOM) error.

Because NDJSON objects are separated by newlines, the parser only needs to hold a single line in memory at a time. Memory usage becomes O(1) — constant — regardless of whether the file is 10 megabytes or 10 terabytes.

03NDJSON vs CSV

CSV is the traditional format for streaming large datasets, but it is strictly two-dimensional. Web scraped data is rarely flat. A single e-commerce product might have an array of image URLs, a nested object of technical specifications, and an array of review objects.

To put this in a CSV, you must either serialize the arrays into messy strings (e.g., "url1|url2|url3") or duplicate rows. NDJSON solves this by allowing full JSON nesting within each line, providing the structural richness of JSON with the streamability of CSV.

04How DataFlirt handles it

We default to NDJSON for all high-volume data deliveries. Our extraction workers yield records asynchronously; the moment a record is validated against the schema, it is serialized to a JSON string, appended with a newline, and flushed to the network buffer.

This means we can stream data directly to a client's S3 bucket or Snowflake instance while the crawl is still running. If a pipeline is interrupted, the NDJSON file up to that point is perfectly valid and usable — unlike a standard JSON file, which would be corrupted by a missing closing bracket.

05Did you know?

You can use standard Unix text processing tools on NDJSON files. Because each record is exactly one line, you can use wc -l data.ndjson to count the exact number of records, head -n 5 data.ndjson to sample the first five records, or grep '"stock":true' data.ndjson to filter records — all without writing a single line of code or loading a JSON parser.

// 03 — the memory model

The cost of
parsing JSON.

Standard JSON requires the entire file to be read into RAM to build the Abstract Syntax Tree (AST). NDJSON bounds memory usage to the size of the largest single record, enabling infinite scale.

Standard JSON memory footprint = M = file_size × 3.5

V8 engine AST overhead. A 1GB JSON file requires ~3.5GB of RAM to parse. Node.js memory profiling

NDJSON memory footprint = M = max(record_size) × 3.5

Memory is constant O(1). A 100GB NDJSON file parses in < 5MB of RAM. Stream processing model

DataFlirt stream throughput = T = network_bandwidth / avg_record_size

Delivery speed is bounded only by network I/O, never by CPU parsing limits. DataFlirt delivery SLO

// 04 — stream ingestion

Piping 10M records
without memory bloat.

A live trace of an extraction worker attempting to buffer a standard JSON array, crashing, and then successfully streaming the same dataset as NDJSON directly to S3.

O(1) memoryS3 multipartSnowpipe

edge.dataflirt.io — live

CAPTURED

// attempt 1: standard json array buffer
worker.memory: 840 MB
worker.memory: 1.4 GB
worker.status: FATAL ERROR: Ineffective mark-compacts near heap limit
worker.status: Allocation failed - JavaScript heap out of memory

// attempt 2: ndjson stream
stream.write: {"id":"p_123","price":49.99,"variants":["red","blue"]}
stream.write: {"id":"p_124","price":12.50,"variants":["black"]}
stream.write: {"id":"p_125","price":89.00,"variants":[]}

// stream metrics (10M records later)
stream.bytes_written: 4.2 GB
worker.memory_peak: 42.1 MB
s3.multipart_upload: completed
snowflake.copy_into: success · 10,000,000 rows loaded

// 05 — format comparison

Why NDJSON wins
the delivery layer.

Comparing data delivery formats across DataFlirt's enterprise pipelines. NDJSON dominates because it balances the schema flexibility of JSON with the streamability of CSV.

DELIVERY SHARE · · · 78% of pipelines

AVG FILE SIZE · · · · 4.2 GB

UPDATED · · · · · · 2026-05-19

01

Streamability (O(1) memory)

critical · Parse line-by-line without loading the whole file

02

Nested data support

high · Handles arrays and sub-objects that break flat CSVs

03

Append-only writes

high · No need to rewrite the file to add a new record

04

Warehouse native ingestion

medium · Direct COPY INTO support in BigQuery and Snowflake

05

Human readability

low · Easy to inspect with standard Unix tools (tail, grep)

// 06 — delivery architecture

Never buffer,

always stream.

DataFlirt's delivery layer writes NDJSON directly to cloud storage using multipart uploads. As our extraction workers yield records, they are serialized and flushed to the network buffer immediately. We never hold the dataset in memory. This architecture allows us to deliver 50-gigabyte catalog dumps with the exact same micro-instance footprint as a 10-megabyte sample. If your pipeline crashes on large datasets, you have a buffering problem, not a scraping problem.

NDJSON Delivery Stream

Live metrics from a continuous product catalog extraction writing to an S3 sink.

sink.destination s3://df-client-042/catalog/

format ndjsongzip

records.flushed 2,841,000

buffer.memory 16 MB

schema.drift 0 records

malformed_lines 0

throughput 14.2 MB/s

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about NDJSON, JSON Lines, parsing large files, and how DataFlirt handles data serialization at scale.

Ask us directly →

What is the difference between NDJSON, JSONL, and LDJSON? +

Nothing but the acronym. NDJSON (Newline Delimited JSON), JSONL (JSON Lines), and LDJSON (Line-Delimited JSON) all refer to the exact same format: a text file where every line is a valid JSON object, separated by a newline character (\n). DataFlirt uses the .ndjson extension by default, but they are entirely interchangeable.

Why not just use CSV for large datasets? +

CSV is flat. If your scraped data includes nested structures — like a product with an array of image URLs, or a business listing with multiple nested review objects — CSV forces you to either stringify the arrays, drop data, or create complex relational tables. NDJSON gives you the streaming benefits of CSV while preserving the nested hierarchy of JSON.

How do I read an NDJSON file in Python? +

Do not use json.load(). Instead, open the file and iterate over it line by line, parsing each line individually with json.loads(line). This keeps your memory footprint near zero regardless of how large the file is. Libraries like Pandas and PySpark also have native functions (e.g., read_json(lines=True)) to handle this automatically.

Does DataFlirt support standard JSON delivery? +

Yes, for small datasets. However, for any pipeline expected to yield over 100MB of data per run, we strongly recommend (and sometimes enforce) NDJSON or Parquet. Delivering a 5GB standard JSON array is a disservice to the data engineering team that has to ingest it.

Can NDJSON handle schema drift? +

Yes, exceptionally well. Because each line is an independent JSON object, a missing field or a new field on line 50,000 doesn't break the parser for the rest of the file. This makes NDJSON highly resilient to the minor schema variations common in web scraping, allowing downstream ETL tools to handle the drift gracefully.

How do modern data warehouses handle NDJSON? +

Natively and efficiently. Snowflake, BigQuery, and Redshift all support direct ingestion of NDJSON files from cloud storage. In BigQuery, you simply set the source format to JSON; it automatically infers newline delimitation and loads the nested structures into RECORD and REPEATED column types.

$ dataflirt scope --new-project --target=ndjson READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h