← Glossary / Entity Resolution

What is Entity Resolution?

Entity resolution is the algorithmic process of determining whether two or more records refer to the same real-world entity, despite variations in spelling, missing fields, or formatting differences. In scraping pipelines, it is the critical bridge between raw extracted text and a usable master dataset. Without it, a single company scraped from five different directories becomes five distinct database rows, destroying downstream aggregations and inflating data volume with noise.

Data CleaningRecord LinkageFuzzy MatchingMaster DataETL

// 02 — definitions

Same thing,
different strings.

The computational challenge of merging disparate scraped records into a single canonical entity without a shared primary key.

Ask a DataFlirt engineer →

TL;DR

Entity resolution (ER) solves the O(N²) problem of comparing every scraped record against every other record to find duplicates. It uses blocking to group likely matches, then applies string distance and phonetic algorithms to score similarity. It is the difference between a raw web dump and a production-ready dataset.

01Definition & structure

Entity resolution (also known as record linkage or deduplication) is the process of identifying and merging records that represent the same real-world entity. A complete resolution pipeline consists of three phases:

Blocking — partitioning the dataset to avoid comparing every record against every other record.
Pairwise Scoring — calculating the similarity between two records using string distance metrics (like Levenshtein or Jaro-Winkler) and phonetic algorithms.
Clustering & Survivorship — grouping matched records into a single canonical entity and deciding which attributes "survive" the merge.

02The O(N²) bottleneck

The fundamental challenge of entity resolution is scale. Comparing 10,000 records requires ~50 million pairwise comparisons. Comparing 1 million records requires ~500 billion. Without aggressive blocking strategies — such as Locality-Sensitive Hashing (LSH) or sorted neighborhood methods — entity resolution jobs quickly become computationally impossible, causing pipeline deadlocks and massive cloud compute bills.

03Deterministic vs. Probabilistic

Deterministic matching relies on exact matches of strong identifiers — if two records share the same UPC, ISBN, or tax ID, they are merged. Probabilistic matching is used when strong identifiers are missing or null. It calculates a weighted confidence score based on fuzzy text matches across multiple fields (e.g., name, address, phone). Production pipelines always attempt deterministic matching first, falling back to probabilistic models only when necessary.

04How DataFlirt handles it

We treat entity resolution as an ingestion-layer primitive, not a downstream analytics task. As records flow from our scraping fleet into the delivery sink, they pass through a graph-based resolution engine. We assign persistent, cross-pipeline entity_ids to matched clusters. If a target site updates a company's address, we don't create a new row; we update the canonical entity and emit a change-data-capture (CDC) event to the client.

05The false positive trap

The most destructive failure mode in entity resolution is over-merging (false positives). If an algorithm aggressively merges "Springfield High School" in Illinois with "Springfield High School" in Ohio, the resulting canonical record is a corrupted hybrid of both. It is always safer to under-merge (leaving duplicates in the dataset) than to over-merge. Thresholds should be tuned strictly for precision, leaving recall as a secondary concern.

// 03 — the math

How similar
is similar enough?

Entity resolution relies on distance metrics to quantify string similarity, combined with blocking strategies to make the math computationally viable at scale.

Jaro-Winkler Distance = d_w = d_j + (l · 0.1 · (1 − d_j))

Favours strings that match from the beginning. Standard for company and human names. Winkler, 1990

Blocking Reduction Ratio = 1 − (comparisons_made / (N · (N−1) / 2))

Measures the efficiency of the blocking phase. A ratio of 0.99 means 99% of unnecessary comparisons were skipped. DataFlirt pipeline metrics

DataFlirt Confidence Score = C = (w₁·name_sim) + (w₂·addr_sim) − conflict_penalty

Weighted multi-field similarity. C > 0.88 triggers an automatic merge. Internal ER engine

// 04 — resolution trace

Merging records
across three sources.

A live trace of our resolution engine evaluating three scraped company profiles. Without a shared DUNS or tax ID, the engine relies on fuzzy field matching and graph clustering.

Jaro-WinklerGeo-BlockingGraph Merge

edge.dataflirt.io — live

CAPTURED

// input records
rec_A: { name: "Acme Corp", addr: "123 Main St, NY", phone: "555-0100" }
rec_B: { name: "Acme Corporation", addr: "123 Main Street, New York", phone: null }
rec_C: { name: "Acme Corp LLC", addr: "456 Broad St, NY", phone: "555-0100" }

// blocking phase (key: state_ny)
block_size: 3 records -> 3 pairwise comparisons required

// pairwise scoring
compare(A, B): name=0.92, addr=0.95, phone=null -> score: 0.93
compare(A, C): name=0.88, addr=0.12, phone=1.00 -> score: 0.65
compare(B, C): name=0.85, addr=0.10, phone=null -> score: 0.42

// clustering & decision (threshold: 0.88)
edge(A, B): MATCH
edge(A, C): CONFLICT // address mismatch penalty applied

// output
entity_id: "ent_8f92a1b" -> [rec_A, rec_B]
entity_id: "ent_4c33d9e" -> [rec_C]

// 05 — resolution complexity

Why matching
breaks at scale.

The primary drivers of entity resolution failure and computational bloat across DataFlirt's B2B and e-commerce pipelines.

ENTITIES TRACKED · · · 300M+ active

AVG RESOLUTION · · · 15ms per record

UPDATED · · · · · · 2026-05-19

01

Missing or null identifiers

forces fuzzy fallback · Lack of UPC/DUNS requires text-based heuristics

02

The N² comparison bottleneck

compute scaling · Without aggressive blocking, compute scales quadratically

03

Semantic equivalence

string distance fails · 'IBM' vs 'Intl Business Machines' look entirely different

04

Conflicting attribute updates

survivorship logic · Same entity, but one source has stale or incorrect data

05

Over-merging false positives

data destruction · Aggressive fuzzy matching collapsing distinct entities

// 06 — DataFlirt's resolution engine

Resolve at ingestion,

not at query time.

DataFlirt runs entity resolution inline as data flows into the lakehouse. We use a multi-stage pipeline: deterministic matching on strong identifiers (like UPCs or tax IDs), followed by Locality-Sensitive Hashing (LSH) for blocking, and finally a gradient-boosted model for pairwise scoring. This ensures that when a client queries the dataset, they are hitting a clean, canonical entity table, not a raw web dump that requires a 40-minute dbt run to deduplicate.

Inline Resolution Job

Metrics from a daily B2B directory merge across 14 target sites.

job.id res-b2b-099

records.input 4.2M records

blocking.reduction 99.8%optimal

comparisons.made 18.4M pairs

matches.found 1.1M edges

false_positive.est < 0.01%within SLO

entities.output 3.1M canonical entities

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About fuzzy matching, blocking strategies, false positives, and how DataFlirt builds canonical datasets from messy web sources.

Ask us directly →

What is the difference between entity resolution and deduplication? +

Deduplication is the removal of exact duplicate records — usually handled by a simple hash or SQL DISTINCT. Entity resolution is the process of linking records that refer to the same entity but have different string representations (e.g., "Apple Inc." vs "Apple Incorporated"). Deduplication is a subset of entity resolution.

Why can't I just use SQL GROUP BY to merge records? +

SQL requires exact string matches. If one source lists a company as "Acme Corp" and another as "Acme Corp.", a GROUP BY clause will treat them as two distinct entities. Entity resolution uses fuzzy matching and probabilistic scoring to bridge these variations before the data ever reaches your SQL warehouse.

How does DataFlirt handle false positives during a merge? +

We tune our models for precision over recall. It is vastly preferable to have two records for one company (a false negative) than to merge two distinct companies into one (a false positive). False positives destroy data integrity by blending attributes that don't belong together. If our confidence score is below 0.88, the records remain separate.

What is 'blocking' in the context of entity resolution? +

Blocking is a performance optimization. If you have 1 million records, comparing every record to every other record requires 500 billion comparisons (O(N²)). Blocking groups records by a coarse key — like Zip Code or the first three letters of a name — so you only perform pairwise comparisons within that block, reducing compute by 99%.

Can I provide my own master data for you to resolve against? +

Yes. Enterprise clients frequently push their CRM or MDM data to our secure enclave. Our pipeline scrapes the target sites, resolves the extracted records against your internal master data, and delivers the payload mapped directly to your internal entity IDs.

How do you handle conflicting data during a merge? +

Through survivorship rules based on source-level trust scoring and recency. If we merge a record from a government registry and a local business directory, the government registry's address overwrites the local directory's address, but the local directory's phone number is retained if the registry lacks one.

$ dataflirt scope --new-project --target=entity-resolution READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h