← Glossary / Record Linkage

What is Record Linkage?

Record linkage is the algorithmic process of identifying and merging records from disparate sources that refer to the same real-world entity, despite missing identifiers, typos, or formatting variations. In scraping pipelines, it is the critical bridge between raw extraction and usable intelligence. Without it, you deliver a fragmented list of duplicates; with it, you deliver a unified master dataset.

Entity ResolutionFuzzy MatchingDeduplicationMaster DataO(N²)
// 02 — definitions

Same entity,
different strings.

Why joining scraped datasets is never as simple as a SQL JOIN, and how probabilistic matching bridges the gap.

Ask a DataFlirt engineer →

TL;DR

Record linkage identifies and merges records from different sources that represent the same real-world entity. Because scraped data lacks universal primary keys, pipelines rely on blocking strategies and fuzzy matching algorithms to calculate similarity scores and resolve entities without human intervention.

01Definition & structure

Record linkage (also known as entity resolution or data matching) is the process of identifying records in one or more datasets that refer to the same real-world entity. In web scraping, this is necessary because different websites use different identifiers, naming conventions, and schemas.

A standard linkage pipeline consists of three phases:

  • Normalization: Standardizing text (lowercasing, removing punctuation, expanding abbreviations like "St." to "Street").
  • Blocking: Grouping records into smaller buckets to avoid comparing every record against every other record.
  • Scoring: Applying string distance algorithms (Jaro-Winkler, Levenshtein) and weighted logic to candidate pairs to determine a match probability.
02Deterministic vs. Probabilistic

Deterministic linkage requires exact matches on unique keys. If two records share the same UPC barcode or domain name, they are linked. It is computationally cheap but brittle—if a target site omits the UPC, the match fails.

Probabilistic linkage assigns weights to different fields. A match on "Company Name" might contribute 60% to the score, while a match on "City" contributes 20%. If the total score crosses a defined threshold (e.g., 0.85), the records are merged. This handles typos and missing data gracefully but requires careful tuning to avoid false positives.

03The blocking imperative

Without blocking, linkage is an O(N²) problem. Comparing two datasets of 1 million records each requires 1 trillion comparisons. Blocking solves this by creating a heuristic key—for example, the first three letters of the last name plus the birth year. The engine only compares records that share this block key. The tradeoff is that if the block key itself contains a typo, the true match will be missed (a false negative).

04How DataFlirt handles it

We treat record linkage as a core component of the delivery layer. Our pipelines don't just dump raw JSON; they resolve entities across targets using a multi-pass engine. We run deterministic rules first (matching on ASINs, domains, or exact phone numbers), followed by LSH (Locality-Sensitive Hashing) blocking, and finally a tuned Fellegi-Sunter probabilistic model. Conflicting fields are resolved using client-defined survivorship rules before the final dataset is written to S3.

05The false positive tradeoff

In record linkage, you must choose your poison: false positives (merging two different entities) or false negatives (failing to merge duplicates). For financial compliance or healthcare data, false positives are catastrophic, so thresholds are set extremely high (e.g., 0.98). For marketing lead generation, false negatives are worse (you don't want to email the same person twice), so thresholds are lowered (e.g., 0.75). There is no universal "correct" threshold.

// 03 — the math

How similar
is similar enough?

Record linkage relies on string distance metrics and weighted scoring to decide if 'Apple Inc.' and 'Apple Incorporated' are the same entity. DataFlirt uses a composite scoring model tuned per pipeline.

Jaro-Winkler Distance = dw = dj + (l · p · (1dj))
Heavily weights matching prefixes. Ideal for company names and addresses. Standard string metric
Fellegi-Sunter Weight = W = log2(m / u)
m = probability of match given same entity; u = probability of match given different entities. Probabilistic linkage model
DataFlirt Confidence Score = Σ (Wfield · Simfield) / Σ Wfield
Threshold typically set at > 0.88 for automated merging. Internal linkage engine
// 04 — linkage execution

Resolving entities
across three catalogs.

A live trace of a DataFlirt linkage job merging product records from three different e-commerce targets into a single master SKU.

blockingJaro-Winklerentity resolution
edge.dataflirt.io — live
CAPTURED
// input: candidate block [brand: "Sony", category: "Audio"]
rec_A: "Sony WH-1000XM5 Wireless Noise Canceling Headphones" // Target 1
rec_B: "Sony WH1000XM5/B Over-Ear Headphones Black" // Target 2
rec_C: "WH-1000XM5 Silver by Sony" // Target 3

// feature extraction & normalization
model_num: ["WH-1000XM5", "WH1000XM5", "WH-1000XM5"]
color: [null, "Black", "Silver"]

// pairwise scoring (threshold: 0.85)
score(A, B): 0.92 MATCH // high model number similarity
score(A, C): 0.78 SPLIT // color conflict (null vs Silver)
score(B, C): 0.71 SPLIT // color conflict (Black vs Silver)

// resolution output
entity_1: [rec_A, rec_B] // merged: Sony WH-1000XM5 (Black)
entity_2: [rec_C] // distinct: Sony WH-1000XM5 (Silver)
// 05 — failure modes

Why linkage
jobs fail.

The most common reasons entity resolution pipelines produce false positives (merging distinct entities) or false negatives (failing to merge duplicates).

PIPELINES MONITORED ·   140+ active
AVG PRECISION ·  ·  ·  ·  98.4%
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Missing discriminators

false positives · Lacking the one attribute (e.g., color) that separates variants
02

Over-aggressive blocking

false negatives · Candidates never compared because block keys didn't match
03

Inconsistent tokenization

false negatives · Failing to normalize '100ml' vs '100 ml' before scoring
04

Cartesian explosion (O(N²))

timeout · Attempting to cross-join 10M records without blocking
05

Conflicting ground truths

merge failure · Two sources have identical names but vastly different prices
// 06 — our architecture

Link at scale,

without the Cartesian explosion.

Comparing every scraped record against every other record is an O(N²) operation. At 10 million rows, that is 100 trillion comparisons — computationally impossible for a daily pipeline. DataFlirt's linkage engine uses aggressive Locality-Sensitive Hashing (LSH) and deterministic blocking to group candidate pairs before applying expensive probabilistic scoring, keeping pipeline latency linear and compute costs low.

Entity Resolution Job

Live metrics from a B2B company directory linkage run.

input.records 12,450,000
theoretical.pairs 7.75 × 10¹³
blocking.strategy LSH + Soundex
candidate.pairs 41,200,000
scoring.model Fellegi-Sunter · threshold 0.88
resolved.entities 8,102,440
job.duration 14m 22s

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About entity resolution, blocking strategies, handling conflicting data, and how DataFlirt builds master datasets from fragmented sources.

Ask us directly →
What is the difference between record linkage and deduplication? +
Deduplication usually refers to removing exact or near-exact duplicates within a single dataset. Record linkage is the broader process of connecting records across multiple distinct datasets (e.g., merging a scraped list of restaurants from Yelp with a list from Google Maps) where schemas and formatting differ entirely.
How do you avoid the O(N²) performance trap? +
Through blocking. Instead of comparing every record to every other record, we generate a 'block key' (e.g., the first three consonants of a company name plus the zip code). We only run the expensive string similarity algorithms on records that share a block key. This reduces comparisons from trillions to millions.
What happens when two sources have conflicting data for the same entity? +
We use survivorship rules. If Source A says a product costs $40 and Source B says $45, the linkage engine doesn't just average them. We define rules per field: e.g., 'trust Source A for price because it updates faster, but trust Source B for the product description because it is more comprehensive.'
Deterministic vs. Probabilistic linkage: which is better? +
Deterministic linkage relies on exact matches of unique identifiers (UPC, email, domain). It is fast and 100% accurate, but yields low coverage on scraped data. Probabilistic linkage uses weighted similarity scores across multiple fields. Production pipelines use both: deterministic first to catch the easy matches, probabilistic second to catch the rest.
How do you measure the quality of a linkage job? +
Using Precision (what percentage of linked records are actually the same entity) and Recall (what percentage of true duplicates were successfully linked). We maintain a manually annotated 'golden dataset' of 5,000 records per domain. Every time we tweak the linkage weights, we run a regression test against the golden dataset.
Can DataFlirt link scraped data to my internal database? +
Yes. We frequently build pipelines where the output is not just raw scraped data, but data that has already been resolved against a client's internal master data. We ingest a hashed version of your primary keys and return scraped records appended with your internal IDs.
$ dataflirt scope --new-project --target=record-linkage READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h