What is robots.txt?
robots.txt is a plain text file at the root of a domain that dictates which paths web crawlers are allowed to access and how fast they should request them. While it lacks strict legal enforcement in most jurisdictions, ignoring its directives is the fastest way to trigger network-layer bans and burn your proxy pool. For production data pipelines, it serves as the baseline contract for sustainable, high-volume extraction.