← Glossary / Rating Extraction

What is Rating Extraction?

Rating extraction is the process of parsing visual or obfuscated review scores—like CSS-width star ratings, SVG glyphs, or localized text—into normalized, typed numeric values. Because platforms rarely expose clean floats in the DOM, pipelines must translate visual rendering logic back into data. If your extraction layer treats a 4.5-star rating as a string of HTML classes, your downstream analytics will silently break.

Data DeliveryNormalizationE-commerceParsingSchema Validation
// 02 — definitions

Visuals to
floats.

How we turn CSS widths, half-star SVGs, and localized review strings into clean, queryable numeric data.

Ask a DataFlirt engineer →

TL;DR

Rating extraction bridges the gap between how a browser renders a score and how a database queries it. It involves parsing inline styles, counting DOM elements, or extracting JSON-LD microdata to produce a standardized numeric tuple. Without strict extraction contracts, you end up with a mix of strings, nulls, and un-castable formats.

01Definition & structure
Rating extraction is the specific pipeline step responsible for identifying, parsing, and normalizing review scores and counts. Because ratings are designed for human visual consumption, they are rarely stored as clean data attributes in the DOM. The extraction layer must translate visual indicators—like the width of a colored star overlay—into a structured numeric format.
02The visual extraction problem
Most modern platforms use CSS to render fractional ratings. A 4.2-star rating might be rendered as a container of five empty stars with a colored pseudo-element set to width: 84%. To extract the true value, the scraper must capture the inline style attribute, parse the percentage, and multiply it by the scale (5). Relying on text alone often fails when platforms omit screen-reader fallbacks.
03Aggregate vs. individual ratings
Extraction logic must distinguish between an aggregate product score and an individual user's review score. Aggregate scores require extracting the total review count and the distribution histogram. Individual reviews require extracting the specific score, the review date, and the user identifier. Mixing these up in the schema destroys downstream analytical models.
04How DataFlirt handles it
We enforce strict type coercion at the edge. Our extraction workers are configured to parse the visual DOM, strip locale-specific formatting (like European decimal commas), normalize the scale, and deliver clean numerics. If a rating cannot be cast to a float, or a review count cannot be cast to an integer, the record fails schema validation and is quarantined. We never deliver stringly-typed numbers.
05The JSON-LD fallback trap
Many developers assume they can just grab the AggregateRating object from the page's JSON-LD script tag. While useful, this data is often heavily cached by the target's CDN for SEO purposes, while the visual rating is loaded dynamically via XHR. Relying solely on JSON-LD can result in extracting stale data that doesn't match what a user actually sees on the page.
// 03 — the normalization math

Calculating the
true score.

When platforms use CSS percentage widths to represent stars, the extraction layer must reverse-engineer the math to output a standard base-5 or base-10 float.

CSS width to Base-5 = R = (width_% / 100) × 5.0
Standard translation for inline style star ratings. Visual DOM extraction
Review count multiplier = C = base_val × 10suffix_power
Translating '1.2K' to 1200 or '1.5M' to 1500000. DataFlirt normalization layer
Extraction confidence = C = matches(DOM_score, JSON_LD_score) ? 1.0 : 0.5
Cross-validating visual rendering against hidden metadata. DataFlirt schema validation
// 04 — extraction trace

Parsing stars
from inline CSS.

A live trace of our extraction worker parsing an e-commerce product rating where the score is hidden in a background-image width.

CSS parsingtype coercionschema validation
edge.dataflirt.io — live
CAPTURED
// input DOM node
node.html: "<i class='star-rating' style='width: 90%'><span class='blind'>4.5 out of 5</span></i>"

// strategy 1: text extraction
extract.text: "4.5 out of 5"
parse.regex: match [4.5, 5]

// strategy 2: visual width extraction
extract.css_width: "90%"
parse.math: 90 / 100 * 5 = 4.5

// validation
cross_check: pass // text and visual match
output.rating_value: 4.5
output.rating_scale: 5.0
schema.status: validated
// 05 — failure modes

Why rating
extraction breaks.

Extracting numbers seems simple until you hit production scale. These are the most common reasons rating fields fail schema validation across our e-commerce pipelines.

PIPELINES MONITORED ·   180+ active
SCHEMA CHECKS ·  ·  ·  ·  per record
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Visual-only rendering

% of failures · No text fallback, requires CSS or SVG parsing
02

Localized decimal separators

% of failures · Comma vs dot breaks float coercion
03

Stale JSON-LD mismatch

% of failures · Metadata lags behind live DOM score
04

Review count string formats

% of failures · '1.2K' vs '1,200' parsing errors
05

Missing scale context

% of failures · 4/5 vs 4/10 ambiguity
// 06 — DataFlirt's type system

Strict types,

no stringly-typed analytics.

A rating is a float. A review count is an integer. If your pipeline delivers them as strings like '4.5 out of 5 stars' or '12,401 ratings', you are pushing extraction debt onto your data engineering team. DataFlirt's delivery layer enforces strict type coercion at the edge. We parse the visual DOM, strip the locale-specific formatting, normalize the scale, and deliver clean, query-ready numerics directly to your warehouse.

rating.schema.json

Standardized rating object output from a DataFlirt e-commerce pipeline.

rating.value 4.5
rating.scale 5.0
rating.count 12401
rating.is_aggregate true
validation.confidence 1.0
source.method css_width_calc

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About visual parsing, type coercion, JSON-LD reliability, and how DataFlirt guarantees numeric accuracy at scale.

Ask us directly →
Why not just scrape the JSON-LD microdata? +
JSON-LD is convenient but frequently stale. Many e-commerce platforms cache their HTML head and JSON-LD aggressively, while the visual DOM rating is injected dynamically via an API call. If you rely solely on JSON-LD, your extracted ratings will lag behind the actual page state. We use it as a cross-validation signal, not the primary source of truth.
How do you handle '1.2K' or '1.5M' review counts? +
We normalize them to exact integers using locale-aware multipliers at extraction time. '1.2K' becomes 1200. Storing these as strings breaks downstream sorting and aggregation. Type coercion is a mandatory step in our schema validation layer.
What if a site uses custom SVG icons for stars? +
We parse the SVG structure. If the site uses five separate SVG nodes, we count the fully filled nodes and calculate the percentage fill of the partial node. It requires a custom selector configuration, but the output remains a standard float.
How does DataFlirt ensure rating accuracy? +
Through cross-validation. We extract the score from the visual CSS, the hidden screen-reader text, and the JSON-LD. If they disagree, the record is flagged with a lower confidence score and quarantined for review. We never silently guess.
Can you extract the rating distribution histogram? +
Yes. For aggregate product pages, we extract the percentage or absolute count of 5-star, 4-star, down to 1-star reviews, delivering them as a nested array of integers. This is critical for sentiment analysis, as a 4.0 average with high variance means something very different than a unanimous 4.0.
What happens when a site changes its rating UI? +
Our schema validation catches the type failure instantly. If a CSS class change causes the rating to extract as null or a string, the record fails validation, is quarantined, and alerts our engineers. The client's data contract is protected from schema drift.
$ dataflirt scope --new-project --target=rating-extraction READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h