← Glossary / Resource Deduplication

What is Resource Deduplication?

Q: Is resource deduplication necessary for plain HTTP scraping?

Rarely. If you are using httpx or requests , you are typically only fetching the HTML document or a specific JSON API endpoint anyway. You aren't downloading the CSS, fonts, or images unless you explicitly write code to do so. Deduplication is primarily an optimization for headless browser pipelines.

Resource deduplication is the practice of intercepting and caching static assets — fonts, CSS, tracking scripts, and heavy JavaScript bundles — across multiple headless browser sessions to prevent redundant downloads. When routing traffic through expensive residential proxies, downloading the same 2MB React bundle on every page view destroys unit economics. Deduplication shifts the cache layer from the edge to the scraper, drastically reducing egress costs and page load latency.

Bandwidth OptimizationHeadless BrowsersRequest InterceptionProxy Egress

// 02 — definitions

Stop downloading
the same bytes.

Why headless scraping pipelines bleed money on static assets, and how request interception stops the leak.

Ask a DataFlirt engineer →

TL;DR

Headless browsers default to isolated contexts, meaning they download the entire page payload from scratch every time. Resource deduplication uses request interception to serve known static assets from a local cache instead of routing them through the proxy. This typically cuts bandwidth consumption by 60–80% on modern single-page applications.

01Definition & structure

Resource deduplication is a bandwidth optimization technique used in headless browser scraping. It relies on request interception APIs (like Playwright's page.route()) to pause outbound network requests. If the requested URL matches a static asset (like a .js, .css, or .woff2 file) that the scraper has already downloaded, the request is fulfilled using a local copy stored in memory or disk. This prevents the browser from fetching the same heavy files repeatedly through the proxy network.

02The proxy cost problem

Unlike datacenter IPs which are often billed at a flat rate, residential and mobile proxies are billed by bandwidth (per GB). A modern single-page application (SPA) can easily weigh 3–5 MB per page load. If you scrape 100,000 pages, you are routing 300–500 GB of traffic through the proxy. Because 90% of that payload consists of identical framework bundles and fonts, you are paying thousands of dollars to download the exact same files over and over again.

03Implementation mechanics

Implementation requires a middleware layer between the browser and the network. When a request fires, the middleware checks the URL (often stripping cache-busting query parameters) against a local key-value store.

If it's a miss, the request is allowed to continue through the proxy, and the response body is captured and saved to the store.
If it's a hit, the request is aborted at the network layer, and the middleware injects the cached byte array directly into the browser as a synthetic HTTP 200 response.

04How DataFlirt handles it

We run a distributed deduplication architecture. Instead of each worker maintaining its own isolated cache, our Playwright nodes connect to a centralized Redis cluster. When Worker A downloads a new React bundle from a target site, it is immediately available to Worker B, C, and D. This global cache ensures that across a fleet of thousands of concurrent browsers, we achieve maximum egress efficiency from the very first minute of a crawl.

05The anti-bot timing trap

A common mistake when building deduplication systems is serving the cached files too fast. If a browser claims to be a fresh residential user on a 4G connection, but it downloads a 2MB JavaScript file in 2 milliseconds, sophisticated anti-bot systems (like Akamai or DataDome) will flag the session as an automated anomaly. Production-grade deduplication must artificially delay the synthetic response to match the expected latency of the proxy connection.

// 03 — the economics

How much does
redundancy cost?

Residential proxy bandwidth is the single largest variable cost in a headless scraping pipeline. Deduplication directly attacks this multiplier by removing static assets from the proxy billing equation.

Naive egress cost = C = Pages × (HTML + Assets) × Proxy_Rate

Without deduplication, you pay the proxy rate for the entire payload on every page load. Standard headless execution

Deduplicated egress cost = C = (Pages × HTML × Proxy_Rate) + (Unique_Assets × Proxy_Rate)

You only pay for the static assets once. Subsequent loads are served from local memory. Optimized pipeline model

DataFlirt cache hit rate = H = Local_Served_Bytes / Total_Requested_Bytes

Our target is H > 0.85 for JS-heavy e-commerce targets. DataFlirt performance SLO

// 04 — playwright interception

Intercepting and
serving from cache.

A trace of a Playwright worker intercepting network requests. The HTML and dynamic XHR go through the proxy; the heavy static assets are served from local memory.

Playwrightroute.fulfillmemory cache

edge.dataflirt.io — live

CAPTURED

// page.route('**/*', handler)
request: "https://target.com/product/123"
type: "document" CACHE MISS
action: route.continue() // routed via proxy

request: "https://target.com/assets/app-v4.js"
type: "script" size: 1.8 MB
cache_lookup: HIT hash: 8f9a2b...
action: route.fulfill({ body: cachedBuffer })

request: "https://target.com/fonts/inter.woff2"
type: "font" size: 320 KB
cache_lookup: HIT
action: route.fulfill({ body: cachedBuffer })

// page load complete
total_payload: 2.4 MB
proxy_egress: 142 KB // 94% bandwidth saved

// 05 — payload breakdown

Where the bandwidth
actually goes.

Average payload distribution on a modern e-commerce product page. Deduplication targets the top three categories, which account for the vast majority of proxy egress.

SAMPLE SIZE · · · · 100k product pages

AVG PAYLOAD · · · · 3.2 MB per page

UPDATED · · · · · · 2026-05-19

Images & Media

1.8 MB avg · Product photos, banners, icons

JavaScript Bundles

900 KB avg · React/Vue, tracking, anti-bot scripts

Fonts

300 KB avg · Web fonts (WOFF2)

CSS Stylesheets

150 KB avg · Compiled styles

HTML & XHR (Data)

50 KB avg · The actual data you want

// 06 — our architecture

Cache globally,

execute locally.

DataFlirt implements resource deduplication at the cluster level. When a worker encounters a new static asset, it fetches it through the proxy and writes it to a Redis-backed distributed cache. Subsequent requests for that URL from any worker in the cluster are intercepted and fulfilled from memory. This ensures that even across thousands of isolated browser contexts, we only pay the proxy egress cost for a static asset exactly once.

worker-cache-stats

Live bandwidth metrics for a single scraping worker over a 1-hour window.

worker.id node-aws-eu-west-04

pages.processed 14,200

payload.total 45.4 GB

cache.hits 382,400 requests

bytes.served_local 39.8 GB

bytes.proxy_egress 5.6 GB

egress.savings 87.6%

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About bandwidth optimization, request interception, anti-bot implications, and how DataFlirt manages caching at scale.

Ask us directly →

What is the difference between resource deduplication and resource blocking? +

Resource blocking aborts the request entirely — the browser never receives the file. Deduplication intercepts the request and serves the file from a local cache. Blocking is cheaper but breaks page rendering and triggers anti-bot sensors that check if scripts loaded. Deduplication gives you the bandwidth savings of blocking while maintaining the execution fidelity of a full page load.

Why not just use the browser's built-in cache? +

To prevent session leakage and fingerprint tracking, production scrapers run each page load in an isolated browser context (essentially incognito mode). When the context is destroyed, the cache is destroyed. If you share a context to keep the cache, you share cookies and local storage, which leads to immediate bans. Request interception lets us cache assets globally while keeping contexts isolated.

Does caching static assets affect anti-bot fingerprinting? +

It can, if done naively. Some advanced anti-bot scripts measure the load time of specific assets. If a 2MB script loads in 1ms, the sensor knows it was intercepted or cached locally, which is suspicious for a "new" user. We simulate realistic network latency when fulfilling cached requests for known anti-bot domains to bypass these timing checks.

How do you handle cache-busting URLs? +

Many sites append random query strings to static assets (e.g., app.js?v=12345) to force cache misses. Our interception layer uses fuzzy matching and regex patterns to strip known cache-busting parameters before checking the Redis cache, ensuring we still get a hit even if the URL mutates slightly.

Is resource deduplication necessary for plain HTTP scraping? +

Rarely. If you are using httpx or requests, you are typically only fetching the HTML document or a specific JSON API endpoint anyway. You aren't downloading the CSS, fonts, or images unless you explicitly write code to do so. Deduplication is primarily an optimization for headless browser pipelines.

How much money does this actually save? +

Residential proxies typically cost between $2 and $10 per GB. If a product page payload is 3MB, 1,000 pages cost ~3GB ($6 - $30). With an 85% cache hit rate, that drops to 0.45GB ($0.90 - $4.50). At a scale of millions of pages per month, deduplication is the difference between a profitable pipeline and a loss-making one.

$ dataflirt scope --new-project --target=resource-deduplication READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

What is Resource Deduplication?

Stop downloadingthe same bytes.

TL;DR

How much doesredundancy cost?

Intercepting andserving from cache.

Where the bandwidthactually goes.

Images & Media

JavaScript Bundles

Fonts

CSS Stylesheets

HTML & XHR (Data)

Cache globally,

worker-cache-stats

Stay ahead of the pipeline

Data engineeringintel, weekly.

Commonquestions.

Tell us whatto extract.We do the rest.

Related glossary terms

Resource Blocking

Payload Size Optimization

Egress Cost Optimization

Headless Browser