← Glossary / Derived Field Computation

What is Derived Field Computation?

Derived field computation is the step in a scraping pipeline where raw extracted values are transformed, combined, or evaluated to create new structured data points. Instead of just pulling a raw string like "Pack of 12 - $24.00", the pipeline computes the unit quantity, total price, and price-per-unit. It bridges the gap between how a website formats information for human eyes and how a database requires it for analytical queries.

Data TransformationETLSchema LogicNormalizationPost-Processing
// 02 — definitions

Compute,
don't just copy.

Raw HTML rarely matches your target database schema. Derivation is how you bridge the gap between presentation and utility.

Ask a DataFlirt engineer →

TL;DR

Derived field computation turns implicit page data into explicit record attributes. Whether it's calculating a discount percentage from MSRP and sale price, or inferring out-of-stock status from a greyed-out CSS class, derivations are where business logic lives in the extraction layer.

01Definition & structure

Derived field computation is the process of generating new data attributes by applying logic to raw extracted fields. While raw extraction pulls exactly what is on the page, derivation interprets it.

A standard derivation pipeline involves:

  • Input validation: Ensuring the raw strings exist and are not empty.
  • Type coercion: Converting strings like "$1,200.50" into floats like 1200.50.
  • Computation: Applying math, regex, or boolean logic.
  • Output assignment: Writing the result to a new, strongly-typed field in the JSON record.
02Common computation patterns

Most derivations fall into three categories:

Mathematical: Calculating unit prices, discount percentages, or converting imperial measurements to metric.

Textual/Regex: Extracting a brand name from a long product title, or pulling a specific SKU out of a messy description block.

Logical/Stateful: Inferring boolean states. For example, setting is_promoted: true if a specific "Sponsored" CSS class is present on the parent DOM node.

03The risk of silent logic failures

Derivations are brittle because they assume the raw data format is stable. If an e-commerce site changes its price format from "$10.00" to "10.00 USD", a naive regex might fail to capture the number, resulting in a null value. If that null is passed into a unit-price calculation, the pipeline might crash or output garbage data.

This is why derivations must be wrapped in strict error handling. A failed computation should flag the record, not silently pass a null downstream.

04How DataFlirt handles it

We build derivations directly into our schema contracts. When a client requests a price_per_kg field, we define the exact dependency graph (e.g., requires raw_price and weight_string). Our extraction workers execute these functions in memory immediately after parsing the DOM.

If a target site redesigns and breaks the inputs, our schema validation catches the derivation failure instantly, quarantines the affected records, and alerts our engineering team to patch the logic before the dataset is delivered.

05Extract vs. Derive: When to do what

A common mistake is trying to derive data that already exists cleanly elsewhere on the page. Before writing a complex regex to parse a brand name from a title, check the page's JSON-LD, meta tags, or hidden input fields. Often, the exact structured data you are trying to compute is already sitting in a hidden window.__INITIAL_STATE__ object.

Rule of thumb: Extract if it exists explicitly; derive only if you must infer it.

// 03 — the logic

Common derivation
patterns.

Computations range from simple arithmetic to complex conditional logic. DataFlirt defines these as pure functions in the schema contract, ensuring they execute deterministically on every extracted record.

Unit Price Calculation = Punit = price_raw / pack_size
Fails silently if pack_size defaults to 0. Requires strict input validation. Standard e-commerce derivation
Discount Inference = Dpct = (1 − (price_sale / price_msrp)) × 100
Requires cross-field validation to ensure MSRP > Sale Price. Pricing intelligence pipelines
Derivation Error Rate = Ederive = failed_computations / total_records
DataFlirt quarantines records and alerts if E > 0.01 per job. DataFlirt extraction SLO
// 04 — computation trace

From raw strings
to typed attributes.

A live trace of a derivation engine processing a raw product record. The pipeline extracts three raw DOM strings and computes five new typed fields before delivery.

Regex extractionMath operationsType coercion
edge.dataflirt.io — live
CAPTURED
// 1. raw DOM extraction
dom.title: "Makita 18V LXT Lithium-Ion Battery (2-Pack)"
dom.price_string: "Was $149.00, Now $119.00"
dom.button_class: "btn-primary disabled out-of-stock"

// 2. regex & string parsing
derive.brand: "Makita" // matched against known brand dictionary
derive.pack_size: 2 // extracted from "(2-Pack)"

// 3. numeric extraction & math
derive.price_msrp: 149.00
derive.price_sale: 119.00
derive.discount_pct: 20.13 // computed
derive.price_per_unit: 59.50 // computed (119.00 / 2)

// 4. conditional logic
derive.in_stock: false // inferred from "out-of-stock" class

// 5. output validation
schema.status: PASS // all derived fields meet type constraints
// 05 — failure modes

Why derivations
break silently.

Derived fields are highly sensitive to upstream extraction quality. When raw data formats drift, computations fail. Ranked by frequency across DataFlirt's pipeline monitoring.

PIPELINES MONITORED ·   300+ active
ERROR THRESHOLD ·  ·  ·   1.0% per run
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Type coercion failures

String to Float · Currency symbols or commas break numeric parsing
02

Missing dependent fields

Null pointer · Cannot compute unit price if pack size is missing
03

Regex boundary misses

Pattern drift · Brand name extraction fails when title format changes
04

Division by zero

Math error · Defaulting missing quantities to 0 breaks downstream math
05

Logical contradictions

Validation · Sale price computed as higher than MSRP
// 06 — our architecture

Compute at the edge,

validate at the core.

DataFlirt treats derived fields as first-class schema citizens. We define computations as pure functions that execute immediately after raw DOM extraction, forming a dependency DAG. If a derivation fails — because a price string changed format or a denominator went missing — the record is flagged and quarantined before it ever hits the delivery bucket. Silent nulls are the enemy of data engineering.

derivation_engine.log

Execution state of a derived field DAG during a pipeline run.

record.id prod_8831A
raw_fields.count 14 fields
derived.pack_size 12regex_match
derived.price_unit 1.45computed
derived.discount NaNmsrp_missing
fallback.strategy set_nullallow_optional
record.status valid · ready for delivery

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about where to put business logic, handling missing inputs, and maintaining data integrity.

Ask us directly →
Why derive fields in the scraper instead of downstream in SQL? +
Pushing logic to the scraper ensures that extraction errors are caught immediately. If a site changes its price format and your SQL pipeline breaks three days later, debugging is a nightmare. Deriving fields at extraction time allows you to quarantine bad records at the source and alert the scraping team instantly.
How do you handle missing inputs for a computation? +
Explicitly. If a computation requires two fields and one is missing, the derivation function must return a typed null (or a specific error sentinel), not a default value like 0. Defaulting to zero causes division-by-zero errors or, worse, silently corrupts downstream averages.
What's the difference between extraction and derivation? +
Extraction is pulling a value directly from the source document (e.g., grabbing the text inside an <h1> tag). Derivation is creating a new value that doesn't explicitly exist in the document by parsing, combining, or evaluating the extracted data.
Can derivations be used for data validation? +
Yes. Cross-field derivations are excellent validation checks. For example, computing price_sale < price_msrp as a boolean field. If it evaluates to false, you know either the site has a pricing error or your selectors have swapped the two fields.
How does DataFlirt monitor derivation logic? +
We track the success rate of every derivation function per pipeline run. If a specific derived field (like unit_price) suddenly drops from 99% completeness to 40%, it triggers an automated alert indicating that the underlying raw format has likely drifted.
Should I derive categories from product titles? +
Only if the site doesn't provide explicit breadcrumbs or category tags. Regex-based categorization from titles is brittle and requires constant dictionary updates. Always prefer explicit structural data over inferred text data when available.
$ dataflirt scope --new-project --target=derived-field-computation READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h