← Glossary / JSON Lines Format

What is JSON Lines Format?

JSON Lines format (often called NDJSON) is a text-based data delivery standard where each line is a valid, standalone JSON object. For high-volume scraping pipelines, it replaces monolithic JSON arrays by allowing records to be streamed, appended, and parsed incrementally without loading the entire dataset into memory. If your pipeline crashes mid-write, a JSON array is corrupted; a JSON Lines file simply ends at the last valid record.

Data DeliveryStreamingNDJSONETLMemory Efficiency
// 02 — definitions

Stream records,
not arrays.

Why monolithic JSON files break at scale, and how newline-delimited objects solve the memory bottleneck in data engineering.

Ask a DataFlirt engineer →

TL;DR

JSON Lines (JSONL) separates individual JSON objects with a newline character (\n). Unlike standard JSON arrays, it requires no opening or closing brackets and no commas between records. This allows scraping pipelines to write data continuously and downstream consumers to process multi-gigabyte files one record at a time.

01Definition & structure

JSON Lines is a text format where every line is a valid JSON value. It is defined by three strict rules:

  • UTF-8 Encoding: The file must be valid UTF-8.
  • Newline Delimited: Each JSON object is separated by a newline character (\n). Carriage returns (\r\n) are permitted but discouraged.
  • No Inline Newlines: The JSON objects themselves cannot contain unescaped newline characters. Any newlines within the data must be escaped as \n.
There is no root array ([ ]) and no commas separating the objects.
02Why standard JSON fails at scale
Standard JSON requires the entire document to be syntactically valid before it can be parsed. A 10GB JSON array must be loaded into memory, parsed into a DOM tree (which often inflates the memory footprint by 3x to 5x), and then processed. If the file is missing a single closing bracket at the very end, the entire 10GB file throws a parse error. JSON Lines eliminates this by making each record independently valid.
03Streaming and parallel processing
Because JSONL uses a simple newline delimiter, it is trivial to split the file for parallel processing. A data engineering tool like Apache Spark or a simple bash script using split can divide a 100-million-record JSONL file into 100 chunks of 1 million records without needing to understand the JSON syntax. This makes it the ideal intermediate format for ETL pipelines.
04How DataFlirt handles it
We use JSON Lines as the default delivery format for all high-volume scraping pipelines. Our extraction workers serialize records and stream them directly to S3 using multipart uploads. We enforce strict newline escaping at the extraction layer, ensuring that scraped text (like multi-paragraph product descriptions) never breaks the line delimiter. Files are automatically rotated at 5GB boundaries to optimize downstream ingestion into data warehouses like Snowflake or BigQuery.
05JSONL vs NDJSON vs JSON-LD
JSONL and NDJSON are identical formats; the community simply uses both names. JSON-LD (Linked Data), however, is completely different. JSON-LD is a method of encoding linked data using JSON (often embedded in HTML <script> tags for SEO). Do not confuse JSON-LD (a schema vocabulary) with JSONL (a file delivery format).
// 03 — the memory model

Why JSONL
scales infinitely.

The fundamental advantage of JSON Lines is O(1) memory complexity for parsing, regardless of file size. DataFlirt defaults to JSONL for any delivery exceeding 50,000 records to protect client ingestion pipelines.

Standard JSON parse memory = M3 × file_size
Loading a 2GB JSON array typically requires 6GB+ of RAM to build the DOM tree. V8 / Node.js heap profiling
JSON Lines parse memory = Mmax_record_size + buffer
Memory footprint remains flat. A 100GB file parses in ~50MB of RAM. Streaming parser architecture
DataFlirt delivery chunking = File = min(5 GB, 106 records)
We rotate JSONL files at 5GB uncompressed to optimize downstream S3/Snowflake ingestion. DataFlirt delivery SLO
// 04 — the delivery stream

Appending records
in real time.

A live tail of a DataFlirt extraction worker writing scraped product records directly to an S3 bucket via a multipart JSONL stream. No brackets, no commas, just newlines.

S3 multipartgzip compressedO(1) memory
edge.dataflirt.io — live
CAPTURED
// initializing delivery stream
stream.target: "s3://df-client-089/catalog_20260519.jsonl.gz"
stream.compression: "gzip (level 6)"

// raw worker output (uncompressed view)
{"id":"sku-101","price":45.99,"stock":true,"tags":["new","sale"]} \n
{"id":"sku-102","price":12.50,"stock":false,"tags":[]} \n
{"id":"sku-103","price":89.00,"stock":true,"tags":["premium"]} \n

// worker crash simulation
worker.status: SIGKILL (OOM in browser context)
stream.flush: success

// file integrity check
file.valid_json_array: false // would be broken if standard JSON
file.valid_json_lines: true // 3 records safely persisted
pipeline.recovery: resuming from sku-104
// 05 — format constraints

Where JSONL
pipelines break.

JSON Lines is robust, but strict. A single unescaped newline inside a string value corrupts the format. Here is what causes JSONL parsing errors across downstream consumer pipelines.

PIPELINES MONITORED ·   300+ active
DELIVERY VOLUME ·  ·  ·   14TB / day
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Unescaped newlines in strings

% of parse errors · Scraped text containing raw \n breaks the line delimiter
02

Trailing commas

% of parse errors · Accidentally formatting like a JSON array
03

Encoding mismatch

% of parse errors · Non-UTF-8 characters corrupting the line boundary
04

Schema drift per line

% of parse errors · Line 500 has a different schema than Line 1
05

Missing trailing newline

% of parse errors · EOF reached without a final \n delimiter
// 06 — our delivery stack

Never buffer,

stream straight to the bucket.

DataFlirt writes JSON Lines directly to cloud storage using multipart uploads. As our extraction workers parse HTML, valid records are serialized to JSON, appended with a newline, and flushed to the network. We never buffer the full dataset in memory. If a worker crashes, the file isn't corrupted — the last valid line is simply the last valid record. Combined with gzip, this allows us to deliver 100GB+ datasets with zero memory overhead on our end, and zero memory overhead for your ingestion pipeline.

delivery.stream.status

Live metrics from a high-volume JSONL delivery job.

job.id export-b2b-IN-042
format application/jsonlndjson
compression gzipratio: 8.4x
records.written 4,192,000
memory.usage 42 MBflat
newline.escaping strictenforced
output.destination s3://df-client-042/raw/

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About JSON Lines, NDJSON, streaming architectures, and how DataFlirt formats high-volume data deliveries.

Ask us directly →
What is the difference between JSON Lines and NDJSON? +
Nothing. They are two names for the exact same format. JSON Lines (JSONL) and Newline Delimited JSON (NDJSON) both dictate that each line is a valid JSON value, separated by a newline character (\n). DataFlirt uses the .jsonl extension by default, but the contents are identical.
Why not just use CSV for flat data? +
CSV is terrible at handling nested data, arrays, and multiline text. If a scraped product has an array of 5 image URLs and a dictionary of technical specifications, flattening that into CSV requires complex escaping and arbitrary column limits. JSON Lines gives you the streaming benefits of CSV with the rich, nested data types of JSON.
How do I parse a JSON Lines file in Python? +
Do not use json.load(). Instead, open the file and iterate over it line by line: for line in file: record = json.loads(line). This ensures you only ever hold one record in memory at a time, allowing you to process files that are significantly larger than your available RAM.
Can JSON Lines files be compressed? +
Yes, and they compress exceptionally well. Because JSONL repeats schema keys on every line (e.g., "price", "title"), gzip or zstd compression algorithms achieve massive deduplication. DataFlirt routinely sees 8x to 12x compression ratios on JSONL deliveries. You can stream-decompress and stream-parse simultaneously.
What happens if scraped text contains a newline? +
It must be escaped. A literal newline character in the data will break the JSONL format, as the parser will assume it's the end of the record. DataFlirt's extraction layer automatically escapes all internal newlines (converting them to \n within the JSON string) before writing the record to the stream.
Does DataFlirt support standard JSON arrays? +
Yes, for small datasets. If your pipeline produces fewer than 50,000 records, we can deliver a standard JSON array. However, for enterprise pipelines delivering millions of records daily, we strongly enforce JSONL or Parquet to ensure your ingestion infrastructure doesn't buckle under memory pressure.
$ dataflirt scope --new-project --target=json-lines-format READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h