← Glossary / JSON-LD Extraction

What is JSON-LD Extraction?

JSON-LD extraction is the process of parsing structured metadata embedded directly in a webpage's HTML, bypassing the need for brittle CSS selectors. Because publishers inject this data to feed Google's rich snippets, it typically contains clean, typed records for products, prices, and reviews. For a scraping pipeline, it is the highest-signal, lowest-maintenance extraction path available - until the publisher ships malformed JSON and breaks the standard parser.

Structured DataSchema.orgParsingSEO MetadataETL

// 02 — definitions

Bypass the
DOM.

Why scrape visual elements when the publisher has already handed you a clean, typed JSON object in the document head?

Ask a DataFlirt engineer →

TL;DR

JSON-LD (JavaScript Object Notation for Linked Data) is the modern standard for SEO metadata. Extracting it involves finding the application/ld+json script tags, parsing the payload, and mapping the Schema.org vocabulary to your pipeline's output schema. It is vastly more stable than visual scraping, but requires robust error handling for syntax violations and stale cache states.

01Definition & structure

JSON-LD (JavaScript Object Notation for Linked Data) is a method of encoding linked data using JSON. In the context of web scraping, it refers to the practice of extracting data from <script type="application/ld+json"> blocks embedded in a webpage's HTML. These blocks contain structured, machine-readable representations of the page's content, formatted according to the Schema.org vocabulary. Because the data is already structured as a JSON object, it eliminates the need to write and maintain complex CSS or XPath selectors.

02The Schema.org vocabulary

JSON-LD relies on a shared vocabulary to give meaning to its keys. The dominant standard is Schema.org, backed by major search engines. A typical e-commerce page will feature a @type: "Product" object, containing nested Offer objects for pricing and availability, and AggregateRating objects for reviews. By targeting these specific @type declarations, a single extraction script can pull product data from thousands of entirely different websites without custom selectors.

03The malformed JSON problem

The biggest operational hurdle with JSON-LD is syntax errors. Many CMS platforms and SEO plugins generate JSON-LD by concatenating strings rather than using proper JSON serialization libraries. This frequently results in trailing commas, unescaped double quotes inside product descriptions, or raw newline characters. A standard JSON.parse() call will throw a fatal error on these payloads. Production pipelines must implement tolerant parsing or regex-based pre-cleaning to salvage the data.

04How DataFlirt handles it

We prioritize JSON-LD extraction for its stability, but we engineer around its flaws. Our extraction workers use a custom AST-based JSON parser that automatically repairs common syntax errors. Once parsed, we map the Schema.org fields to the client's requested schema. Crucially, we configure cross-validation rules for high-value fields like price and stock status, comparing the JSON-LD payload against the rendered DOM to detect caching drift. If the JSON-LD is stale, we fall back to visual extraction automatically.

05The UI drift trap

Publishers optimize JSON-LD for Googlebot, not for you. Because search engine crawlers don't need up-to-the-second pricing, many sites cache their JSON-LD blocks heavily to improve Time to First Byte (TTFB). Meanwhile, the actual price displayed to the user might be updated dynamically via a separate API call based on inventory levels or user location. Relying solely on JSON-LD for dynamic pricing intelligence is a common architectural mistake that leads to silently inaccurate datasets.

// 03 — extraction metrics

Measuring structured
data quality.

JSON-LD is a gift, but it is often a flawed one. DataFlirt tracks these three metrics on every pipeline that relies on structured data to ensure the payload is both valid and accurate.

Parse Success Rate = valid_json_blocks / total_ld_blocks

Drops below 1.0 when publishers introduce trailing commas or unescaped quotes. DataFlirt parser telemetry

Schema Completeness = extracted_fields / expected_schema_fields

Measures how much of the target schema the publisher actually populated. Pipeline validation layer

UI-to-LD Drift = |ld_price - dom_price|

Flags when the SEO plugin cache lags behind the live visual price. Cross-validation checks

// 04 — parser trace

Extracting and repairing
a product record.

A live trace of a DataFlirt extraction worker pulling a Schema.org/Product record. The publisher's SEO plugin generated invalid JSON, triggering our automated syntax repair before mapping.

AST repairSchema.orgcross-validation

edge.dataflirt.io — live

CAPTURED

// locate structured data
dom.query: "script[type='application/ld+json']"
blocks_found: 2

// parse block 1 (BreadcrumbList)
parse.status: success
schema.type: "BreadcrumbList" // ignored

// parse block 2 (Product)
parse.status: SyntaxError: Unexpected token ] in JSON at position 412
repair.strategy: "strip_trailing_comma"
parse.retry: success

// extract and validate
record.name: "Industrial Servo Motor 5kW"
record.sku: "SM-5000-IND"
record.price: 1245.00
record.currency: "USD"
record.availability: "InStock"

// cross-validation
dom.price_check: 1245.00 // match
pipeline.status: record yielded

// 05 — failure modes

Why structured data
pipelines break.

JSON-LD is theoretically perfect, but practically messy. These are the most common reasons a JSON-LD extraction job fails or returns bad data, ranked by frequency across our fleet.

PIPELINES · · · · · 180+ active

LD BLOCKS · · · · · 45M/day

UPDATED · · · · · · 2026-05-19

Malformed JSON syntax

42% of errors · Trailing commas, unescaped quotes, raw newlines

Stale cache drift

28% of errors · LD price differs from DOM price due to caching

Missing mandatory fields

18% of errors · Publisher omitted SKU, brand, or price

Multiple conflicting blocks

8% of errors · Two Product schemas with different data

Vocabulary misuse

4% of errors · Using string for price instead of Offer object

// 06 — DataFlirt architecture

Extract, repair,

and validate against the visual truth.

DataFlirt treats JSON-LD as a primary extraction target but never trusts it blindly. Our parsers automatically repair common syntax errors like trailing commas and unescaped control characters using AST-based recovery. More importantly, we run cross-validation checks: if the JSON-LD says a product is in stock but the DOM button says 'Sold Out', our pipeline flags the anomaly. Structured data is only useful if it reflects reality.

JSON-LD Extraction Job

Real-time metrics from a product catalog extraction worker.

target.domain industrial-supply-b2b.com

blocks.processed 14,200

syntax.repairs 312auto-fixed

schema.match Product, Offer

cross_val.failures 14 records

fallback.triggered DOM extraction

yield.success 99.9%

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about relying on JSON-LD, handling broken syntax, and ensuring data accuracy.

Ask us directly →

Why prefer JSON-LD over CSS selectors? +

CSS selectors break when a site redesigns its UI or changes class names. JSON-LD is injected specifically for machines (search engines) and rarely changes format. It provides a stable, typed data contract that survives visual redesigns, drastically reducing pipeline maintenance costs.

What happens when the JSON-LD is invalid? +

Standard JSON parsers will throw a fatal error and drop the record. This is incredibly common - publishers often manually concatenate strings to build the JSON, resulting in trailing commas or unescaped quotes. DataFlirt uses a tolerant parser that repairs these common syntax violations on the fly before mapping the data.

Can I rely on JSON-LD for real-time pricing? +

Not blindly. Many e-commerce platforms cache their SEO metadata (including JSON-LD) aggressively, while the visual price is updated dynamically via JavaScript. If you need spot-pricing accuracy, you must cross-validate the JSON-LD price against the rendered DOM price. If they drift, the DOM is usually the source of truth.

How does DataFlirt handle multiple JSON-LD blocks on one page? +

We parse all blocks and filter by the `@type` attribute (e.g., `Product`, `Recipe`, `Article`). If a page contains multiple blocks of the same target type, we merge them using a deterministic priority rule, or extract them as an array of records depending on the client's schema requirements.

Do all sites use Schema.org vocabularies? +

The vast majority do, because Google mandates it for rich search results. However, adherence to the strict Schema.org specification varies wildly. You will frequently encounter custom fields, nested objects where strings are expected, or missing mandatory properties. Your extraction logic must be defensive.

Is extracting JSON-LD legally different from scraping the DOM? +

No. The legal framework governing web scraping (like the CFAA in the US or database rights in the EU) applies to the act of accessing and extracting the data, regardless of whether you parse a `