← Glossary / robots.txt

What is robots.txt?

robots.txt is a plain text file at the root of a domain that dictates which paths web crawlers are allowed to access and how fast they should request them. While it lacks strict legal enforcement in most jurisdictions, ignoring its directives is the fastest way to trigger network-layer bans and burn your proxy pool. For production data pipelines, it serves as the baseline contract for sustainable, high-volume extraction.

Web ScrapingComplianceCrawlingRFC 9309Rate Limiting
// 02 — definitions

The crawler's
first stop.

The informal standard that governs automated access to the web, and why respecting it is an operational necessity rather than just a courtesy.

Ask a DataFlirt engineer →

TL;DR

The Robots Exclusion Protocol (RFC 9309) uses User-agent, Disallow, and Crawl-delay directives to manage bot traffic. While malicious bots ignore it entirely, enterprise scraping operations parse and cache it to calculate safe concurrency limits and avoid honeypot traps that lead to permanent IP blacklisting.

01Definition & structure
The robots.txt file is a plain text document hosted at the root of a website (e.g., /robots.txt). It adheres to the Robots Exclusion Protocol, providing instructions to automated clients about which areas of the site should not be processed or scanned. It consists of records containing a User-agent line followed by one or more Disallow or Allow directives, and optionally a Crawl-delay to throttle request rates.
02How it works in practice
A well-behaved crawler always requests the robots.txt file before fetching any other URL on a new domain. The crawler parses the file, looks for the block of rules matching its specific User-agent (or the * wildcard), and caches these rules. Before making any subsequent HTTP request, the crawler's internal router checks the target URL against the cached Disallow paths. If there is a match, the request is dropped locally before it ever hits the network.
03The Crawl-delay directive
While originally a non-standard extension, Crawl-delay is heavily utilized by major platforms to protect their infrastructure. It specifies the minimum number of seconds a crawler must wait between successive requests to the same server. Ignoring this value is highly visible to server administrators and Web Application Firewalls (WAFs), making it the most common trigger for automated IP bans during large-scale extraction jobs.
04How DataFlirt handles it
We treat robots.txt as a strict routing contract. Our ingestion layer automatically parses the file and injects the exclusion rules into our distributed task queues. When a target specifies a restrictive Crawl-delay, our scheduler automatically distributes the workload across a wider segment of our residential proxy pool. This ensures that no single IP violates the target's requested rate limit, allowing us to maintain high pipeline throughput without triggering defensive countermeasures.
05Did you know: Honeypot paths
Security teams frequently use robots.txt offensively. They will add a Disallow: /trap/ directive and then place hidden links to /trap/ in the HTML of their site. Because human users cannot see the link, only an automated crawler that ignores the robots.txt file will ever request that path. Once the WAF sees a request to the honeypot, it instantly and permanently blacklists the offending IP address.
// 03 - the math

Calculating safe
crawl rates.

A robots.txt file doesn't just tell you where to go - it tells you how fast to move. DataFlirt's scheduler translates these text directives into hard mathematical bounds for our distributed workers.

Max Requests Per Second (RPS) = 1 / Crawl-delay
A Crawl-delay of 5 means a maximum of 0.2 RPS per IP. Standard Crawler Implementation
Effective Concurrency = Target_RPS × Avg_Latency
How many parallel connections you can hold without violating the delay. DataFlirt Scheduler Model
DataFlirt Compliance Score = 1 - (Disallowed_Hits / Total_Requests)
Maintained at 1.0 across all standard compliance pipelines. Internal SLO
// 04 - what the parser sees

Ingesting the
exclusion protocol.

Before a crawl begins, the pipeline fetches the robots.txt file, parses the rules for the specific User-Agent, and configures the routing layer.

RFC 9309ParserSitemap Discovery
edge.dataflirt.io — live
CAPTURED
// [00:00.01] Fetching robots.txt
GET https://target-ecommerce.com/robots.txt
status: 200 OK

// [00:00.04] Parsing directives
User-agent: *
Disallow: /checkout/
Disallow: /cart/
Disallow: /api/internal/
Crawl-delay: 2

// [00:00.08] Locating sitemaps
Sitemap: https://target-ecommerce.com/sitemap_index.xml
sitemaps_found: 14

// [00:00.12] Configuring scheduler
max_rps: 0.5 // derived from Crawl-delay: 2
blocked_paths_loaded: 3
pipeline.status: READY
// 05 - failure modes

The cost of
ignoring the rules.

While robots.txt is not a firewall, security teams use it to configure their firewalls. Here is what happens when scrapers ignore the directives, ranked by frequency across our incident logs.

INCIDENTS ANALYZED ·  ·   12,400+
PRIMARY CAUSE ·  ·  ·  ·  Ignored Crawl-delay
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

IP / Subnet Blacklisting

Network layer · WAFs automatically ban IPs hitting disallowed paths.
02

Honeypot Traps

Anti-bot · Invisible links placed in disallowed directories.
03

ToS Violation Claims

Legal · Used as evidence of bad faith in legal disputes.
04

Tarpitting

Performance · Server intentionally holds connections open indefinitely.
05

Wasted Compute

Infrastructure · Crawling infinite calendar or search result loops.
// 06 - our architecture

Compliant by default,

aggressive only when explicitly configured.

DataFlirt's ingestion engine fetches and caches the robots.txt file before a single worker is spawned. We map the Disallow paths directly into our routing layer, ensuring no request is ever dispatched to a forbidden endpoint. We use the Crawl-delay to dynamically throttle our distributed proxy pool. Sustainable access is always cheaper than burning IPs and fighting WAFs.

robots-parser.log

Live output from the DataFlirt pre-flight compliance check.

target.domain global-retailer.com
robots.status 200 OK
rules.parsed 42 directives
crawl_delay 5 seconds
sitemap.auto_queue enabled
honeypot.avoidance active
scheduler.state throttled to 0.2 RPS

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about robots.txt compliance, legal implications, and how to handle restrictive directives at scale.

Ask us directly →
Is it illegal to scrape a site if robots.txt disallows it? +
In most jurisdictions, robots.txt is a technical convention, not a binding legal contract. However, courts have repeatedly used a scraper's disregard for robots.txt as evidence of 'bad faith' or unauthorized access in Terms of Service disputes. Operationally, ignoring it is a fast track to getting your IP blocked.
What happens if a site doesn't have a robots.txt file? +
If the server returns a 404 Not Found for the robots.txt path, standard crawler behavior assumes all paths are allowed and there is no explicit crawl delay. However, you must still respect the server's capacity and monitor for HTTP 429 Too Many Requests responses.
How does DataFlirt handle sites with a Crawl-delay of 10 or more? +
A Crawl-delay of 10 seconds restricts a single IP to 8,640 requests per day. For large catalogs, we distribute the crawl across a massive residential proxy pool. The delay is respected per-IP, allowing us to achieve high aggregate throughput while remaining strictly compliant with the target's requested per-client rate.
Can a site serve different robots.txt files based on User-Agent? +
Yes. Many sites use dynamic robots.txt files that serve restrictive rules to known scraping tools (like Python-urllib or Scrapy) while serving permissive rules to Googlebot. This is why DataFlirt parses the file using the exact User-Agent and TLS fingerprint that will be used for the actual crawl.
What is a honeypot in the context of robots.txt? +
A honeypot is a hidden link on a webpage that points to a directory explicitly disallowed in robots.txt. Human users never see or click the link. If a bot ignores robots.txt and follows the link, the server immediately flags the IP as a malicious scraper and permanently bans it.
Do I need to check robots.txt if I am only using an API? +
Yes. Many modern web applications define API rate limits and disallowed endpoints within their root robots.txt file. Even if you are bypassing the HTML frontend and hitting JSON endpoints directly, parsing the exclusion rules prevents accidental WAF triggers.
$ dataflirt scope --new-project --target=robots.txt READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h