← Glossary / JSON Lines Format

What is JSON Lines Format?

Q: How do I parse a JSON Lines file in Python?

Do not use json.load() . Instead, open the file and iterate over it line by line: for line in file: record = json.loads(line) . This ensures you only ever hold one record in memory at a time, allowing you to process files that are significantly larger than your available RAM.

JSON Lines format (often called NDJSON) is a text-based data delivery standard where each line is a valid, standalone JSON object. For high-volume scraping pipelines, it replaces monolithic JSON arrays by allowing records to be streamed, appended, and parsed incrementally without loading the entire dataset into memory. If your pipeline crashes mid-write, a JSON array is corrupted; a JSON Lines file simply ends at the last valid record.

Data DeliveryStreamingNDJSONETLMemory Efficiency

// 02 — definitions

Stream records,
not arrays.

Why monolithic JSON files break at scale, and how newline-delimited objects solve the memory bottleneck in data engineering.

Ask a DataFlirt engineer →

TL;DR

JSON Lines (JSONL) separates individual JSON objects with a newline character (\n). Unlike standard JSON arrays, it requires no opening or closing brackets and no commas between records. This allows scraping pipelines to write data continuously and downstream consumers to process multi-gigabyte files one record at a time.

01Definition & structure

JSON Lines is a text format where every line is a valid JSON value. It is defined by three strict rules:

UTF-8 Encoding: The file must be valid UTF-8.
Newline Delimited: Each JSON object is separated by a newline character (\n). Carriage returns (\r\n) are permitted but discouraged.
No Inline Newlines: The JSON objects themselves cannot contain unescaped newline characters. Any newlines within the data must be escaped as \n.

There is no root array ([ ]) and no commas separating the objects.

02Why standard JSON fails at scale

Standard JSON requires the entire document to be syntactically valid before it can be parsed. A 10GB JSON array must be loaded into memory, parsed into a DOM tree (which often inflates the memory footprint by 3x to 5x), and then processed. If the file is missing a single closing bracket at the very end, the entire 10GB file throws a parse error. JSON Lines eliminates this by making each record independently valid.

03Streaming and parallel processing

Because JSONL uses a simple newline delimiter, it is trivial to split the file for parallel processing. A data engineering tool like Apache Spark or a simple bash script using split can divide a 100-million-record JSONL file into 100 chunks of 1 million records without needing to understand the JSON syntax. This makes it the ideal intermediate format for ETL pipelines.

04How DataFlirt handles it

We use JSON Lines as the default delivery format for all high-volume scraping pipelines. Our extraction workers serialize records and stream them directly to S3 using multipart uploads. We enforce strict newline escaping at the extraction layer, ensuring that scraped text (like multi-paragraph product descriptions) never breaks the line delimiter. Files are automatically rotated at 5GB boundaries to optimize downstream ingestion into data warehouses like Snowflake or BigQuery.

05JSONL vs NDJSON vs JSON-LD

JSONL and NDJSON are identical formats; the community simply uses both names. JSON-LD (Linked Data), however, is completely different. JSON-LD is a method of encoding linked data using JSON (often embedded in HTML <script> tags for SEO). Do not confuse JSON-LD (a schema vocabulary) with JSONL (a file delivery format).

// 03 — the memory model

Why JSONL
scales infinitely.

The fundamental advantage of JSON Lines is O(1) memory complexity for parsing, regardless of file size. DataFlirt defaults to JSONL for any delivery exceeding 50,000 records to protect client ingestion pipelines.

Standard JSON parse memory = M ≈ 3 × file_size

Loading a 2GB JSON array typically requires 6GB+ of RAM to build the DOM tree. V8 / Node.js heap profiling

JSON Lines parse memory = M ≈ max_record_size + buffer

Memory footprint remains flat. A 100GB file parses in ~50MB of RAM. Streaming parser architecture

DataFlirt delivery chunking = File = min(5 GB, 10⁶ records)

We rotate JSONL files at 5GB uncompressed to optimize downstream S3/Snowflake ingestion. DataFlirt delivery SLO

// 04 — the delivery stream

Appending records
in real time.

A live tail of a DataFlirt extraction worker writing scraped product records directly to an S3 bucket via a multipart JSONL stream. No brackets, no commas, just newlines.

S3 multipartgzip compressedO(1) memory

edge.dataflirt.io — live

CAPTURED

// initializing delivery stream
stream.target: "s3://df-client-089/catalog_20260519.jsonl.gz"
stream.compression: "gzip (level 6)"

// raw worker output (uncompressed view)
{"id":"sku-101","price":45.99,"stock":true,"tags":["new","sale"]} \n
{"id":"sku-102","price":12.50,"stock":false,"tags":[]} \n
{"id":"sku-103","price":89.00,"stock":true,"tags":["premium"]} \n

// worker crash simulation
worker.status: SIGKILL (OOM in browser context)
stream.flush: success

// file integrity check
file.valid_json_array: false // would be broken if standard JSON
file.valid_json_lines: true // 3 records safely persisted
pipeline.recovery: resuming from sku-104

// 05 — format constraints

Where JSONL
pipelines break.

JSON Lines is robust, but strict. A single unescaped newline inside a string value corrupts the format. Here is what causes JSONL parsing errors across downstream consumer pipelines.

PIPELINES MONITORED · 300+ active

DELIVERY VOLUME · · · 14TB / day

UPDATED · · · · · · 2026-05-19

Unescaped newlines in strings

% of parse errors · Scraped text containing raw \n breaks the line delimiter

Trailing commas

% of parse errors · Accidentally formatting like a JSON array

Encoding mismatch

% of parse errors · Non-UTF-8 characters corrupting the line boundary

Schema drift per line

% of parse errors · Line 500 has a different schema than Line 1

Missing trailing newline

% of parse errors · EOF reached without a final \n delimiter

// 06 — our delivery stack

Never buffer,

stream straight to the bucket.

DataFlirt writes JSON Lines directly to cloud storage using multipart uploads. As our extraction workers parse HTML, valid records are serialized to JSON, appended with a newline, and flushed to the network. We never buffer the full dataset in memory. If a worker crashes, the file isn't corrupted — the last valid line is simply the last valid record. Combined with gzip, this allows us to deliver 100GB+ datasets with zero memory overhead on our end, and zero memory overhead for your ingestion pipeline.

delivery.stream.status

Live metrics from a high-volume JSONL delivery job.

job.id export-b2b-IN-042

format application/jsonlndjson

compression gzipratio: 8.4x

records.written 4,192,000

memory.usage 42 MBflat

newline.escaping strictenforced

output.destination s3://df-client-042/raw/

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About JSON Lines, NDJSON, streaming architectures, and how DataFlirt formats high-volume data deliveries.

Ask us directly →

What is the difference between JSON Lines and NDJSON? +

Nothing. They are two names for the exact same format. JSON Lines (JSONL) and Newline Delimited JSON (NDJSON) both dictate that each line is a valid JSON value, separated by a newline character (\n). DataFlirt uses the .jsonl extension by default, but the contents are identical.

Why not just use CSV for flat data? +

CSV is terrible at handling nested data, arrays, and multiline text. If a scraped product has an array of 5 image URLs and a dictionary of technical specifications, flattening that into CSV requires complex escaping and arbitrary column limits. JSON Lines gives you the streaming benefits of CSV with the rich, nested data types of JSON.

How do I parse a JSON Lines file in Python? +

Do not use json.load(). Instead, open the file and iterate over it line by line: for line in file: record = json.loads(line). This ensures you only ever hold one record in memory at a time, allowing you to process files that are significantly larger than your available RAM.

Can JSON Lines files be compressed? +

Yes, and they compress exceptionally well. Because JSONL repeats schema keys on every line (e.g., "price", "title"), gzip or zstd compression algorithms achieve massive deduplication. DataFlirt routinely sees 8x to 12x compression ratios on JSONL deliveries. You can stream-decompress and stream-parse simultaneously.

What happens if scraped text contains a newline? +

It must be escaped. A literal newline character in the data will break the JSONL format, as the parser will assume it's the end of the record. DataFlirt's extraction layer automatically escapes all internal newlines (converting them to \n within the JSON string) before writing the record to the stream.

Does DataFlirt support standard JSON arrays? +

Yes, for small datasets. If your pipeline produces fewer than 50,000 records, we can deliver a standard JSON array. However, for enterprise pipelines delivering millions of records daily, we strongly enforce JSONL or Parquet to ensure your ingestion infrastructure doesn't buckle under memory pressure.

$ dataflirt scope --new-project --target=json-lines-format READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

What is JSON Lines Format?

Stream records,not arrays.

TL;DR

Why JSONLscales infinitely.

Appending recordsin real time.

Where JSONLpipelines break.

Unescaped newlines in strings

Trailing commas

Encoding mismatch

Schema drift per line

Missing trailing newline

Never buffer,

delivery.stream.status

Stay ahead of the pipeline

Data engineeringintel, weekly.

Commonquestions.

Tell us whatto extract.We do the rest.

Related glossary terms

NDJSON

S3 Data Delivery

Data Serialization

CSV Storage