← Glossary / Currency Normalization

What is Currency Normalization?

Currency normalization is the extraction-layer process of converting raw, localized price strings—complete with ambiguous symbols, varying decimal separators, and trailing text—into a standardized numeric format and a discrete ISO 4217 currency code. In global e-commerce scraping, failing to normalize currency at the edge means downstream analytics pipelines will inevitably crash when they attempt to sum a column containing both "US$1,200.50" and "1.200,50 €".

Data DeliveryETLPrice ScrapingISO 4217Type Coercion
// 02 — definitions

Strings in,
numbers out.

Why storing scraped prices as raw text is a ticking time bomb for your data warehouse, and how to fix it before delivery.

Ask a DataFlirt engineer →

TL;DR

Currency normalization splits a raw string like "₹72,400/MT" into three typed fields: a numeric value (72400), an ISO currency code ("INR"), and a unit ("MT"). It prevents type coercion failures and ensures global datasets can be queried mathematically without downstream regex gymnastics.

01Definition & structure
Currency normalization is the process of taking a raw, human-readable price string from a webpage and transforming it into a machine-readable format. A normalized price record typically consists of three distinct fields:
  • price_numeric — A strict float or integer (e.g., 1249.99).
  • currency_code — A standard ISO 4217 string (e.g., "USD", "EUR", "JPY").
  • price_raw — The original unmutated string, kept for auditing and debugging.
Without this separation, downstream databases treat the column as text, making basic operations like `AVG(price)` impossible.
02The localization trap
The most dangerous failure mode in price scraping is separator inversion. In the US and UK, "1,200.50" means one thousand two hundred and fifty cents. In Germany or Brazil, the exact same value is written as "1.200,50". If your scraper uses a naive regex to strip commas, the European price becomes 1.2005 (one point two), off by a magnitude of a thousand. Normalization requires knowing the locale of the target page before casting the type.
03Resolving ambiguous symbols
The "$" symbol is used by over 20 countries, including the US, Canada, Australia, and Mexico. The "kr" symbol represents Swedish, Norwegian, and Danish Krona. A robust normalization pipeline never relies on the symbol alone. It cross-references the symbol with the domain's TLD, the HTML lang attribute, and any available JSON-LD structured data to confidently assign the correct ISO 4217 code.
04How DataFlirt handles it
We enforce strict schema validation at the edge. When our workers extract a price, it is immediately passed through a locale-aware parsing library. If the string contains unexpected characters or the locale cannot be confidently determined, the record is quarantined. We never silently drop data or guess the currency. Our clients receive clean, typed JSON or Parquet files where every price column is guaranteed to be a valid float.
05The silent failure of type coercion
Many amateur scraping scripts use basic type coercion (like JavaScript's parseFloat()) directly on DOM text. If a site updates their pricing format to include a trailing unit—changing "$100" to "$100/mo"—the coercion will often return NaN. If the pipeline isn't monitoring null rates, this change will silently wipe out the pricing data for the entire target until a downstream consumer notices the empty dashboards weeks later.
// 03 — the parsing model

How accurate is
the conversion?

DataFlirt's extraction engine evaluates price strings against locale-aware heuristics before casting to float. Here is how we measure normalization health across our global retail pipelines.

Coercion success rate = 1 − (failed_casts / total_price_strings)
A drop below 99.9% usually indicates a site-side formatting change. DataFlirt extraction SLO
Ambiguity resolution = resolved_symbols / total_ambiguous_symbols
Mapping '$' to USD, CAD, or AUD based on page locale metadata. DataFlirt schema validation
Magnitude error risk = P(error) = f(separator_inversion, locale_mismatch)
Parsing '1.000' as 1 instead of 1000 ruins downstream aggregations. Data Engineering heuristics
// 04 — extraction pipeline trace

Raw DOM strings to
typed JSON records.

A live trace of DataFlirt's extraction layer processing international price strings. Notice how locale context dictates the decimal separator logic.

regex parsinglocale detectionISO 4217
edge.dataflirt.io — live
CAPTURED
// raw extraction from DOM
dom.price_1: "US$ 1,249.99"
dom.price_2: "1.249,99 €"
dom.price_3: "Rp 14.500.000"
dom.price_4: "Call for price"

// normalization pipeline
parse(price_1): success val: 1249.99, cur: "USD"
parse(price_2): success val: 1249.99, cur: "EUR" // EU locale detected
parse(price_3): success val: 14500000, cur: "IDR"

// edge case handling
parse(price_4): non-numeric val: null, cur: null, raw: "Call for price"

// schema validation
schema.type_check: passed
output.write: "s3://df-client-042/prices/2026-05-19/"
// 05 — failure modes

Where currency
parsing breaks.

The most common reasons a raw price string fails to cast to a clean numeric value and ISO code across our global e-commerce pipelines. Separator inversion is the most dangerous because it fails silently.

PIPELINES MONITORED ·   180+ retail
DAILY RECORDS ·  ·  ·  ·  45M+
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Separator inversion

silent failure · EU comma vs US decimal (1.000 vs 1,000)
02

Ambiguous symbols

context loss · $ without locale could be USD, CAD, AUD
03

Embedded units/text

type error · Strings like 'per kg' or 'excl. VAT'
04

Invisible unicode

parse failure · Non-breaking spaces breaking regex
05

Dynamic geo-pricing

data drift · IP location overriding target locale
// 06 — extraction architecture

Parse at the edge,

never in the warehouse.

DataFlirt normalizes currency at the extraction layer, before the record ever hits your S3 bucket. We map ambiguous symbols to ISO 4217 codes using the target domain's top-level locale and explicit page metadata (like JSON-LD or hreflang tags). If a price string violates the expected format, it is quarantined for human review rather than silently coerced into a null or, worse, an incorrect magnitude.

normalization.job.status

Live metrics from a global retail pricing pipeline.

pipeline.id global-retail-monitor
records.processed 2.4Mnominal
coercion.success 99.98%within SLO
ambiguous.symbols resolved via locale
quarantined.records 412 records
schema.enforcement strict

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about price extraction, locale handling, and how DataFlirt ensures financial data integrity at scale.

Ask us directly →
Why not just extract the raw string and clean it in dbt? +
Because context is lost. At extraction time, we have access to the page's DOM, JSON-LD metadata, hreflang tags, and the proxy's exit node location. If you just send "$1,200" to the warehouse, the analytics team has no way to know if that was scraped from the Canadian or US version of the site. Normalize at the edge where context is richest.
How do you handle the '$' symbol when scraping internationally? +
We never guess based on the symbol alone. DataFlirt's extraction engine cross-references the symbol with the domain TLD (e.g., .com.au), the HTML lang attribute, and embedded structured data. If the context is missing, we flag the record for review rather than assuming USD.
What happens when a site uses '1.000' for one thousand vs '1,000'? +
This is separator inversion, and it's why naive regex fails. We use locale-aware parsing libraries. If we are scraping a German domain (de-DE), we know the period is a thousands separator and the comma is the decimal. If we scrape a US domain, the reverse applies. Hardcoding a global replace for commas will destroy your data.
Does DataFlirt convert the currency to a base currency (e.g., USD)? +
No. We normalize the format, not the value. We extract the exact numeric value shown on the page and pair it with its ISO 4217 code (e.g., 1200.50, "EUR"). Currency conversion requires exchange rates which fluctuate by the second; that transformation belongs in your downstream BI tool, not the scraping pipeline.
How do you handle 'Price on Request' or 'Out of Stock' text in price fields? +
We use explicit schema definitions with sentinel values. If the selector returns "Price on Request", the numeric price field is set to explicit null, and we populate a secondary price_status string field with the raw text. This keeps the numeric column strictly typed for aggregations.
Is scraping pricing data legal? +
Generally, yes. Publicly available pricing data is factual information and not subject to copyright in most jurisdictions (including the US and EU). However, aggressive scraping that ignores robots.txt or degrades target server performance can lead to ToS disputes or CFAA claims. We operate strictly within compliant concurrency limits.
$ dataflirt scope --new-project --target=currency-normalization READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h