← Glossary / Data Uniqueness

What is Data Uniqueness?

Data uniqueness is the measure of how many distinct, non-redundant entities exist within a dataset compared to the total record count. In scraping pipelines, duplicates are inevitable — caused by overlapping category pages, pagination loops, or multiple sellers listing the same SKU. Failing to enforce uniqueness at the extraction layer means downstream consumers will double-count inventory, skew pricing aggregations, and ultimately lose trust in the pipeline's output.

Entity ResolutionDeduplicationData QualityPrimary KeysFuzzy Matching
// 02 — definitions

One entity,
one record.

Why raw scraped data is inherently redundant, and the mechanics of collapsing multiple observations into a single canonical truth.

Ask a DataFlirt engineer →

TL;DR

Data uniqueness ensures that a real-world entity (a product, a company, a job posting) is represented exactly once in your delivered dataset. Achieving 100% uniqueness requires moving beyond simple exact-match deduplication and implementing fuzzy entity resolution to catch slight variations in scraped text.

01Definition & structure

Data uniqueness is a core data quality metric that ensures every real-world entity is represented by exactly one row in a dataset. In web scraping, raw output is almost never unique. A single product might appear on a brand page, a category page, and a promotional banner — resulting in three separate HTML fetches and three extracted records.

Enforcing uniqueness requires defining a primary key. If the target site provides a clean, stable ID (like a SKU or a database ID in a JSON payload), uniqueness is trivial. If it doesn't, you must construct a composite key from stable attributes or rely on fuzzy matching.

02The source of the noise

Duplicates in scraping pipelines rarely stem from crawler bugs; they are a byproduct of site architecture. Common culprits include:

  • Taxonomy overlaps: A laptop is listed under "Electronics", "Computers", and "Sale". A naive crawler traversing categories will fetch it three times.
  • URL tracking: Sites append session IDs or referrer tags to URLs. To a basic crawler, /item?id=1&s=abc and /item?id=1&s=xyz look like different pages.
  • Pagination shifts: If a new item is added to page 1 while you are crawling page 2, the items shift down, causing you to scrape the same item again on the next page.
03Entity resolution vs deduplication

Deduplication is deterministic. You hash a string (like a URL or a SKU), and if you see that hash again, you drop the record. It is fast, cheap, and can be done in-memory during the crawl.

Entity resolution is probabilistic. It handles cases where the same entity is represented differently — for example, scraping pricing from three different retailers who name the same TV slightly differently. This requires calculating string distances (like Levenshtein or Jaccard similarity), normalizing units, and applying machine learning models to confidently merge the records.

04How DataFlirt handles it

We treat uniqueness as a multi-stage pipeline guarantee. First, our crawler normalizes all URLs (stripping tracking parameters and sorting query strings) and checks them against a distributed Redis Bloom filter to prevent redundant network requests. Second, at the extraction layer, we generate a deterministic composite hash for every record. Finally, before delivery, our data engineering layer runs dbt models to enforce structural uniqueness, ensuring the client receives a perfectly clean, primary-keyed dataset.

05The false positive trap

The most dangerous mistake in data cleaning is over-aggressive deduplication. If you deduplicate a real estate feed based solely on "Address" and "Price", you will accidentally merge two different apartments in the same building that happen to rent for the same amount. A false positive in deduplication results in silent data loss — a failure mode that is much harder to detect and fix than simply having a few duplicate rows in a database.

// 03 — the metrics

How redundant
is the feed?

Uniqueness isn't just a binary state; it's a ratio that dictates storage costs and analytical validity. DataFlirt tracks these metrics per pipeline run to detect crawler loops and taxonomy overlaps.

Uniqueness Ratio = U = distinct_records / total_records
U = 1.0 means perfect uniqueness. U < 0.5 usually indicates a pagination loop. Standard Data Quality Metric
Jaccard Similarity = J(A,B) = |A ∩ B| / |A ∪ B|
Used for fuzzy text matching when exact primary keys are absent. Set Theory
DataFlirt Merge Confidence = C = w1(SKU) + w2(Price) + w3(Title_Sim)
Records with C > 0.92 are automatically merged into a single canonical row. Internal Entity Resolution SLO
// 04 — pipeline trace

Collapsing duplicates
in real time.

A live trace of an entity resolution job processing a batch of e-commerce product records. Notice how URL variations and slight title differences are resolved into a single canonical ID.

Redis Bloom FilterdbtEntity Resolution
edge.dataflirt.io — live
CAPTURED
// batch ingestion: 10,000 records
job.id: "dedup-b2b-099"
phase: "exact_match_filter"

// exact match (URL / Hash)
record_1: "https://target.com/item/123?ref=home"
record_2: "https://target.com/item/123?ref=promo"
action: dropped record_2 // URL normalization match

// fuzzy entity resolution
record_3.title: "Makita 18V Drill LXT"
record_4.title: "18V LXT Drill - Makita"
similarity.jaccard: 0.88
similarity.price: 1.0 // exact match
action: merged // canonical ID: MAK-18V-LXT

// job summary
input_records: 10,000
exact_duplicates_dropped: 2,140
fuzzy_entities_merged: 412
output_unique_records: 7,448
// 05 — sources of noise

Where duplicates
actually come from.

Ranked by frequency across DataFlirt's extraction pipelines. Most duplicates aren't crawler errors; they are structural realities of how target websites organize and present their data.

PIPELINES MONITORED ·   300+ active
AVG DUPLICATE RATE ·  ·   18.4% pre-filter
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Category taxonomy overlap

structural · An item listed in both 'New Arrivals' and 'Shoes'
02

URL parameter variations

tracking · Session IDs, UTM tags, or sort parameters altering the URL
03

Pagination state shifts

temporal · Items shifting pages while the crawler is mid-run
04

Multi-seller listings

marketplace · Same physical product sold by different vendors
05

Network retries

infra · Timeouts causing the crawler to fetch the same page twice
// 06 — our architecture

Resolve at the edge,

merge in the warehouse.

DataFlirt splits uniqueness enforcement into two distinct phases. Exact-match deduplication happens in-stream using Redis Bloom filters to drop identical payloads before they consume downstream compute. Fuzzy entity resolution — where 'Apple iPhone 15' and 'iPhone 15 - Apple' are merged — runs as a batch process in the delivery layer using dbt and vector similarity. This ensures we never drop a potentially distinct record prematurely, while still delivering a perfectly clean dataset to the client.

Uniqueness enforcement job

Live metrics from a dual-phase deduplication run on a real estate pipeline.

pipeline.id real-estate-uk-04
bloom_filter.hits 14,201dropped at edge
dbt.model stg_properties_dedup
composite_key postcode + bedrooms + price
fuzzy_merges 842 records
false_positive_rate < 0.01%within SLO
final_uniqueness 1.0

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About primary keys, fuzzy matching, false positives, and how DataFlirt guarantees clean data delivery.

Ask us directly →
What is the difference between deduplication and entity resolution? +
Deduplication usually refers to exact-match filtering — dropping records where the URL or a specific ID string is identical. Entity resolution is the harder problem of determining if two slightly different records (e.g., "Nike Air Max 90" and "Air Max 90 by Nike") represent the same real-world entity. Deduplication is a computational problem; entity resolution is a data science problem.
Should uniqueness be enforced by the crawler or the database? +
Both, but for different reasons. The crawler should use URL deduplication (via Bloom filters or seen-sets) to avoid wasting bandwidth and proxy traffic on pages it has already visited. The database or data warehouse must enforce structural uniqueness on the extracted records, because two different URLs can easily yield the exact same data.
How do you handle items that share a name but are actually different? +
This is the false positive trap. If you deduplicate purely on product title, you will merge a 128GB phone and a 256GB phone into one record. We use composite primary keys — hashing the title, price, and key attributes (like size or color) together. A record is only considered a duplicate if the entire composite key matches.
How does DataFlirt handle URL parameters that break uniqueness? +
Before a URL enters the crawl queue, it passes through a normalization layer. We strip known tracking parameters (utm_*, session_id, ref) and sort the remaining query parameters alphabetically. This ensures that ?b=2&a=1 and ?a=1&b=2 hash to the exact same value, preventing redundant fetches.
What happens if the target site doesn't expose a unique ID? +
We generate a synthetic primary key. We concatenate stable, immutable fields — such as the brand, the normalized product name, and the manufacturer part number — and generate an MD5 hash. This synthetic key becomes the anchor for all downstream deduplication and delta-file generation.
Is it legal to scrape the same data multiple times? +
Yes, fetching public data repeatedly is generally lawful, provided you respect rate limits and robots.txt directives. However, from an infrastructure perspective, it is highly inefficient. Fetching the same data twice increases your proxy costs and raises your risk of triggering anti-bot classifiers without adding any business value.
$ dataflirt scope --new-project --target=data-uniqueness READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h