← Glossary / Unit Standardization

What is Unit Standardization?

Unit standardization is the process of converting fragmented, source-specific measurements—like "lbs", "oz", "kg", or "pieces"—into a single, mathematically consistent baseline before data delivery. In scraping pipelines, raw text extraction is only half the battle. If you're building pricing intelligence or inventory models, a dataset with mixed units is functionally useless until normalized.

Data DeliveryNormalizationETLSchema ValidationPricing Intelligence
// 02 — definitions

Apples to
apples.

Why raw scraped text isn't enough, and how measurement normalization prevents downstream analytical failures.

Ask a DataFlirt engineer →

TL;DR

Unit standardization intercepts raw strings like "₹400 / 500g" and "₹800 / kg", splitting them into numeric values and converting them to a common base (e.g., price per kg). It's a critical transformation layer that turns scraped text into queryable, aggregatable data.

01Definition & structure
Unit standardization is the transformation step that converts diverse measurements into a single, agreed-upon baseline. A raw scrape might yield "500g", "0.5 kg", "1.1 lbs", and "17.6 oz". A standardized pipeline parses the numeric scalar and the unit string, applies a mathematical conversion factor, and outputs a consistent weight_kg: 0.5 across all records.
02The parsing challenge
Extracting units is notoriously difficult because humans write them inconsistently. A scraper must handle abbreviations ("in", "inch", "\""), spacing variations ("500g" vs "500 g"), typographical errors ("0z"), and composite strings ("2x4x8"). Simple regex often fails here; robust pipelines require dedicated NLP or tokenization libraries mapped to a comprehensive alias dictionary.
03Locale and context dependency
Units are heavily dependent on geography and industry. A "gallon" in the US (3.785 liters) is different from an Imperial gallon in the UK (4.546 liters). A "ton" could be a short ton (2,000 lbs), a long ton (2,240 lbs), or a metric tonne (1,000 kg). Standardization logic must incorporate the source website's locale to apply the correct conversion factor.
04How DataFlirt handles it
We enforce unit standardization at the delivery layer via strict schema contracts. When a pipeline is built, the client defines the target base units. Our extraction workers pull the raw strings, and a dedicated transform microservice handles the coercion. If a unit is missing or ambiguous, we use DOM context (like category breadcrumbs) to infer it. If it remains unresolvable, the record is quarantined to prevent poisoning the dataset.
05The silent failure of implicit units
The most dangerous extraction failure isn't a crash—it's a bad assumption. If a B2B site lists a steel beam's weight as "150" without units, a naive scraper might assume kilograms. If the site actually meant pounds, every downstream price-per-kg calculation will be off by a factor of 2.2. This is why explicit bounds checking and anomaly detection are mandatory for standardized fields.
// 03 — the conversion model

How do we
normalize scale?

Standardization requires separating the scalar from the unit, applying a conversion factor, and recalculating associated metrics like price. Here is the logic DataFlirt applies during the delivery transform.

Base conversion = Vbase = Vraw × Cfactor
Maps '500g' to '0.5kg' using a deterministic lookup table. DataFlirt Transform Layer
Price per base unit = Pbase = Praw / (Vraw × Cfactor)
The core metric for competitive pricing intelligence. Standard retail analytics
Standardization yield = Y = records_normalized / total_records_with_units
DataFlirt SLO requires Y > 0.99 for production pipelines. Internal Quality Metric
// 04 — pipeline transform trace

Raw strings to
normalized floats.

A live trace of a DataFlirt extraction worker parsing a B2B grocery catalog. Notice how implicit units and mixed formats are coerced into a strict schema.

regex parsingunit coercionschema validation
edge.dataflirt.io — live
CAPTURED
// input record 1
raw.title: "Organic Almonds - 500g"
raw.price: "₹450"

// input record 2
raw.title: "Premium Walnuts 2.2 lbs"
raw.price: "₹1,800"

// transform worker
extract.unit[1]: "g" -> coerce: "kg" (factor: 0.001)
extract.unit[2]: "lbs" -> coerce: "kg" (factor: 0.453592)

// output record 1
norm.weight_kg: 0.500
norm.price_per_kg: 900.00

// output record 2
norm.weight_kg: 0.998
norm.price_per_kg: 1803.61

status: schema valid -> written to S3
// 05 — edge cases

Where unit parsing
breaks down.

Ranked by frequency of occurrence in DataFlirt's e-commerce and manufacturing pipelines. The hardest problems aren't math—they're missing context.

PIPELINES MONITORED ·   180+ active
UNIT ALIASES ·  ·  ·  ·   400+ mapped
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Implicit units

context missing · e.g., 'Size: 42' — is it cm, inches, or EU shoe size?
02

Packaging multipliers

nested math · 'Case of 12 x 500ml' requires multi-step extraction.
03

Typographical errors

regex failure · OCR or manual entry errors like '0z' instead of 'oz'.
04

Multi-dimensional units

format drift · '2x4x8' without specifying inches or cm.
05

Locale ambiguity

regional drift · US fluid ounces vs Imperial fluid ounces.
// 06 — our architecture

Extract the string,

deliver the math.

DataFlirt treats unit standardization as a first-class citizen in the delivery layer. We don't just regex for 'kg' and hope for the best. Our extraction schemas define the expected physical dimension (mass, volume, length), and our transform workers map over 400 locale-specific unit aliases to a single client-defined baseline. If a unit cannot be deterministically resolved, the record is flagged for quarantine—because a bad conversion is infinitely worse than a null value.

Transform Job Status

Live metrics from a unit standardization worker processing a global hardware catalog.

job.id transform-units-099
dimension length
target_base millimeters (mm)
aliases_mapped 14 (in, ", inch, cm, m)
records.processed 45,210
records.quarantined 12 records
yield.rate 99.97%

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about measurement extraction, locale handling, and how DataFlirt ensures mathematical consistency in delivered datasets.

Ask us directly →
Why not just deliver the raw string and let the data warehouse handle it? +
Because extraction context is lost once the data leaves the pipeline. If a scraper sees "Size: 12" under a "Footwear (UK)" category header, it can infer the unit. If you just deliver "12" to the warehouse, the analytics team has no way to know if that's US, UK, or EU sizing. Standardization must happen at the edge where context is richest.
How do you handle products sold by volume vs weight? +
We maintain separate dimensional schemas. You cannot deterministically convert milliliters to grams without knowing the specific gravity of the product. If a client requests "price per kg" but the target site lists "price per liter", we deliver the normalized volume metric and flag the dimensional mismatch, rather than guessing a conversion.
What happens when a site uses implicit units (e.g., just a number)? +
We use hierarchical fallback logic. First, we check the product title. Next, the category breadcrumbs. Finally, the site-wide locale defaults. If the unit is still ambiguous, the record is quarantined. We never guess units on production pipelines.
How does DataFlirt handle packaging multipliers like '6-pack of 12oz cans'? +
Our extraction layer parses these as composite fields: pack_size: 6, unit_size: 12, unit_type: oz. The transform worker then calculates the total volume (72 oz) and converts it to the requested base unit (e.g., 2.12 liters) before calculating the normalized price.
Can I define my own base units for delivery? +
Yes. During the pipeline scoping phase, you define the target schema. If your internal systems expect all weights in pounds and all lengths in inches, we configure the transform workers to output exactly that, regardless of what the source websites use.
How do you prevent catastrophic conversion errors? +
Through strict bounds checking. If a conversion results in a "price per kg" that is 100x higher or lower than the category median, the record is flagged for review. This catches edge cases where a site lists "1000" meaning grams, but the scraper interpreted it as kilograms.
$ dataflirt scope --new-project --target=unit-standardization READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h