What is Deduplication Queue?
A deduplication queue is the stateful gatekeeper of a web crawler, responsible for ensuring that a specific URL or payload is only processed once per crawl cycle. It sits between the link discovery phase and the fetch execution phase. Without a robust deduplication layer, crawlers fall into infinite pagination loops, waste proxy bandwidth on redundant header links, and inadvertently launch denial-of-service attacks against target servers.