← Glossary / DNS Caching

What is DNS Caching?

DNS caching is the temporary storage of domain name resolution records closer to the client — in the OS, the HTTP client, or a local resolver — to eliminate the latency of repeated lookups. For high-throughput scraping pipelines, bypassing the OS default and managing your own DNS cache is mandatory. Relying on public resolvers for 10,000 requests per second guarantees rate limits, artificial latency, and eventual pipeline failure.

Network LayerLatencyTTLResolverThroughput
// 02 — definitions

Skip the
lookup.

Why resolving a hostname to an IP address should happen exactly once per TTL window, not once per HTTP request.

Ask a DataFlirt engineer →

TL;DR

DNS caching stores the IP address of a target domain locally for a specified Time-To-Live (TTL). In scraping, managing this cache at the application or worker level prevents DNS timeouts, reduces per-request latency by 20–100ms, and avoids triggering abuse thresholds at upstream resolvers like Cloudflare or Google.

01Definition & structure

DNS caching is the practice of storing the result of a DNS query (the IP address associated with a domain name) for a set period, known as the Time-To-Live (TTL). When a scraper requests api.target.com, the system checks the cache first. If the record is present and valid, the IP is returned instantly. If not, a full network query is sent to an upstream resolver.

Caching can occur at multiple layers: the browser/HTTP client, the operating system (e.g., systemd-resolved), a local network resolver, or an ISP's public resolver. For scraping, controlling the cache at the application or worker-node level is critical for performance.

02How it works in practice

When a scraping script initiates a request, the HTTP client asks the OS for the IP. If the OS cache is empty, it queries 1.1.1.1 or 8.8.8.8. The upstream resolver returns the IP along with a TTL (e.g., 300 seconds). For the next 5 minutes, any request to that domain uses the cached IP, reducing lookup time from ~40ms to ~0.1ms.

Once the 300 seconds pass, the record expires. The next request triggers a new upstream query. In high-throughput systems, thousands of concurrent threads might hit this expiration simultaneously, causing a "thundering herd" of DNS queries unless mitigated by prefetching.

03Caching vs. Connection Pooling

DNS caching and connection pooling solve related but distinct problems. Connection pooling (Keep-Alive) keeps the underlying TCP/TLS socket open between requests. If you reuse a socket, you don't need a DNS lookup at all.

However, scrapers frequently rotate proxies, user-agents, or IP addresses to avoid detection. Every time you rotate an IP or open a new session, you must create a new socket, which requires a DNS lookup. This is where DNS caching becomes the primary defense against latency and upstream rate limits.

04How DataFlirt handles it

We do not rely on the host OS for DNS. Every DataFlirt worker node runs a local instance of CoreDNS. This provides three massive advantages:

  • Serve Stale: If a TTL expires, we return the old IP instantly while fetching the new one in the background.
  • Prefetching: Popular target domains are refreshed automatically before their TTL expires.
  • Concurrency: CoreDNS handles millions of queries per second without the socket exhaustion issues common to default Linux resolvers.
05Did you know?

Standard HTTP libraries in Python (like requests or httpx) and Node.js (like axios) do not implement application-level DNS caching by default. They delegate entirely to the OS. If you run a script inside a minimal Docker container (which often lacks a robust OS-level caching daemon), you are likely performing a full network DNS lookup for every single request your scraper makes.

// 03 — the latency math

How much time
does caching save?

At scale, DNS resolution time dominates the connection phase. DataFlirt's worker nodes run local caching resolvers to keep cache hit rates above 99.9%.

Effective Latency = Leff = (1H) × Ldns + Lhttp
H is cache hit rate. A high H makes DNS latency mathematically negligible. Network Engineering 101
Upstream Query Volume = Q = RPS / (Workers × TTL)
Queries sent to public resolvers per second. Without caching, Q = RPS. DataFlirt Infrastructure
Stale Hit Risk = Pstale = TcacheTTLtarget
Time elapsed past TTL where the cached IP might be invalid. Resolver configuration
// 04 — resolver trace

A worker node's
DNS lifecycle.

Trace of a high-concurrency scraper hitting a target domain. The first request pays the lookup penalty; the next 9,999 do not.

CoreDNSA RecordTTL 300
edge.dataflirt.io — live
CAPTURED
// req 1: cache miss
dns.lookup: "api.target.com"
resolver.upstream: 1.1.1.1
dns.latency: 42ms
dns.response: 104.18.22.5 ttl: 300s

// req 2–10,000: cache hit
dns.lookup: "api.target.com"
resolver.local: HIT
dns.latency: 0.1ms

// req 10,001: TTL expired (301s later)
dns.lookup: "api.target.com"
resolver.local: STALE // background refresh triggered
dns.latency: 0.1ms // served stale while fetching
// 05 — failure modes

When DNS caching
breaks pipelines.

Improper DNS caching causes silent failures that look like network drops. Here is what typically goes wrong when scrapers rely on default OS resolvers.

DNS TIMEOUTS ·  ·  ·  ·   < 0.001%
CACHE HIT RATE ·  ·  ·    99.98%
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Upstream rate limits

NXDOMAIN / Timeout · Public resolvers block high RPS from single IPs
02

Stale IP routing

Connection Refused · Target rotated IPs, cache ignored TTL
03

OS resolver bottlenecks

High latency · Linux systemd-resolved concurrency limits hit
04

IPv6 blackholing

AAAA record drops · Target has broken IPv6, cache prefers it
05

DNS rebinding blocks

0.0.0.0 returned · Anti-bot poisons DNS for known scraper IPs
// 06 — architecture

Control the resolver,

or the resolver controls your throughput.

Default HTTP clients in Python or Node.js rely on the host OS for DNS resolution. At 50 requests per second, this is fine. At 5,000 requests per second, the OS resolver becomes a severe bottleneck, dropping UDP packets and causing artificial timeouts. DataFlirt bypasses the OS entirely, running dedicated caching resolvers on every worker node. We serve stale records during background refreshes, ensuring that a slow upstream DNS server never blocks an active extraction thread.

Worker DNS Configuration

Local resolver settings for a high-throughput scraping node.

resolver.engine CoreDNS 1.11
cache.hit_rate 99.98%
serve_stale true
prefetch true
upstream.primary 1.1.1.1 (TLS)
upstream.fallback 8.8.8.8
dns.timeouts 0.001%

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about DNS resolution, TTL management, and avoiding resolver-level blocks during high-volume scraping.

Ask us directly →
Why am I getting DNS resolution errors only when scraping fast? +
You are hitting rate limits at your upstream DNS provider (like your ISP, Google, or Cloudflare). Public resolvers will throttle or drop queries if a single IP sends thousands of lookups per second. You need a local caching resolver to absorb the volume.
Should I hardcode target IPs in my /etc/hosts file? +
No. Modern targets use CDNs and load balancers that rotate IPs frequently based on server health and geography. Hardcoding an IP guarantees your pipeline will eventually break with a Connection Refused error when the target decommissions that specific edge node.
How does DataFlirt handle targets with very low TTLs? +
Some CDNs set TTLs to 30 seconds for load balancing. We respect the TTL but enable 'serve stale' — if the TTL expires, we immediately return the old IP to the scraper while asynchronously fetching the new IP in the background. This incurs zero latency penalty for the active request.
Does Python's Requests library cache DNS? +
No. By default, requests (via urllib3 and the socket module) asks the OS to resolve the hostname for every new connection. If you aren't using connection pooling (Keep-Alive), you are performing a full DNS lookup on every single request.
What is DNS over HTTPS (DoH) and should scrapers use it? +
DoH encrypts DNS queries, preventing ISPs from seeing which domains you resolve. While good for privacy, it adds TLS handshake overhead to the lookup. For scraping, standard UDP DNS to a trusted upstream is faster, provided you cache the result locally.
Can anti-bot systems use DNS to block scrapers? +
Yes. Some enterprise WAFs will return poisoned DNS records (like 127.0.0.1 or 0.0.0.0) if the querying IP is a known datacenter proxy. Using residential proxies that resolve DNS through their local ISP bypasses this specific vector.
$ dataflirt scope --new-project --target=dns-caching READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h