← Glossary / Sitemap Crawling

What is Sitemap Crawling?

Sitemap crawling is the process of discovering target URLs by parsing a website's XML sitemaps rather than recursively following HTML links. It relies on the publisher voluntarily declaring their canonical pages, last-modified timestamps, and structural hierarchy. For large-scale data pipelines, it is the most efficient path to full catalog coverage. Relying on sitemaps instead of DOM traversal drastically reduces compute costs, but trusting a stale sitemap guarantees your pipeline will miss fresh data.

URL DiscoveryXML ParsingCrawl BudgetIncremental UpdatesData Freshness
// 02 — definitions

The publisher's
map.

Why reverse-engineering site navigation is a waste of compute when the target has already published a machine-readable index of every canonical URL.

Ask a DataFlirt engineer →

TL;DR

Sitemap crawling shifts the burden of URL discovery from the scraper to the target server. Instead of rendering JavaScript and clicking pagination buttons to find products, you download a compressed XML file containing millions of URLs and their last-modified dates. It is the foundation of incremental scraping, allowing pipelines to only fetch pages that have actually changed.

01Definition & structure
A sitemap is an XML file that lists a website's URLs along with metadata about each URL (when it was last updated, how often it changes, and its relative importance). Because the protocol caps files at 50MB and 50,000 URLs, large sites use a sitemap index — an XML file that points to dozens or thousands of compressed sub-sitemaps. Sitemap crawling is the act of downloading and parsing these files to populate a scraper's URL queue.
02How it works in practice
A discovery worker requests the target's /robots.txt to locate the sitemap index. It fetches the index, extracts the URLs of the sub-sitemaps, and downloads them concurrently. Using a streaming XML parser, it extracts the <loc> and <lastmod> tags. If the pipeline is running incrementally, it compares the <lastmod> date against the last known scrape time, discarding unchanged URLs and pushing the rest to a message queue for the fetch workers.
03The lastmod trust issue
The biggest trap in sitemap crawling is blindly trusting the <lastmod> tag. Many e-commerce platforms update this timestamp when a product description changes, but fail to update it when the price or inventory level changes via a separate database table. If your pipeline relies on <lastmod> to trigger a re-scrape, you will silently miss pricing updates.
04How DataFlirt handles it
We treat sitemaps as hints, not absolute truth. Our discovery engine ingests sitemaps to guarantee broad catalog coverage, but we continuously profile the accuracy of the target's XML generation. If we detect that a site updates prices without bumping the sitemap timestamp, we automatically shift that target to a hybrid model: using the sitemap for discovering new URLs, but relying on time-to-live (TTL) expiration to force re-scrapes of existing URLs.
05Did you know?
You can often find hidden or unlinked API endpoints by looking at a site's sitemap index. Many modern single-page applications (SPAs) generate JSON sitemaps or expose the backend API routes used to build the XML sitemap. Parsing these underlying APIs is often cleaner than parsing the XML itself.
// 03 — discovery math

How efficient is
sitemap discovery?

Sitemap crawling is orders of magnitude cheaper than recursive DOM crawling. DataFlirt's scheduler calculates the cost-benefit of trusting the sitemap versus forcing a deep crawl.

Discovery Efficiency = E = URLs_found / requests_made
Sitemap E ≈ 50,000. Recursive E < 100. Sitemaps win on network overhead. Standard crawl metrics
Freshness Delta = ΔT = actual_update_timesitemap_lastmod
High ΔT means the sitemap is stale and cannot be trusted for incremental runs. DataFlirt trust scoring
Sitemap Size Limit = max_bytes = 52,428,800
50MB uncompressed per file. Larger files must be split into a sitemap index. Sitemaps.org Protocol
// 04 — sitemap ingestion trace

Parsing 2.4M URLs
in 12 seconds.

A live trace of DataFlirt's discovery worker ingesting a compressed sitemap index from a major US retailer, validating dates, and queuing the delta.

XML streamgzipdelta queue
edge.dataflirt.io — live
CAPTURED
// fetch index
GET /sitemap_index.xml.gz 200 OK
unzip: 49 sub-sitemaps found

// stream parsing sub-sitemaps
parsing: /products_01.xml.gz ... 50,000 URLs
parsing: /products_02.xml.gz ... 50,000 URLs
parsing: /products_03.xml.gz ... malformed XML // recovering via regex

// delta calculation
total_urls_discovered: 2,412,050
filter: <lastmod> > 2026-05-18T00:00:00Z
stale_urls_dropped: 2,398,100
urls_to_fetch: 13,950

// queueing
pushing_to: redis_queue:target_retailer_delta
status: READY
// 05 — failure modes

Why sitemaps
lie to you.

Publishers generate sitemaps for Googlebot, not for you. When SEO plugins misconfigure the XML generation, your pipeline inherits the errors. These are the most common sitemap defects we monitor.

SITEMAPS MONITORED ·  ·   14,200+
DEFECT RATE ·  ·  ·  ·    18.4%
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Stale <lastmod> dates

trust failure · Page updated but XML date unchanged
02

Orphaned URLs

wasted compute · Sitemap links to 404/deleted pages
03

Missing canonical pages

data loss · New products missing from the index
04

Invalid XML encoding

parse error · Unescaped ampersands breaking parsers
05

Exceeding 50MB/50k limits

truncation · File cuts off mid-node
// 06 — discovery architecture

Trust the map,

but verify the terrain.

DataFlirt uses sitemaps as the primary discovery mechanism for 80% of our e-commerce and news pipelines. But because publishers frequently cache sitemaps for days or fail to update <lastmod> tags, we run a hybrid model. We ingest the sitemap for baseline coverage, but simultaneously run a shallow recursive crawl on high-velocity category pages. If the recursive crawler finds a new product that isn't in the sitemap, we dynamically downgrade the sitemap's trust score and expand the recursive crawl budget.

sitemap-ingest-worker

Live state of a sitemap discovery job on a high-volume pipeline.

target.domain example-retail.com
discovery.mode hybridsitemap + shallow
sitemap.trust_score 0.94reliable
index.size 1.2 GB uncompressed
urls.extracted 4,102,881
urls.delta 14,200 new/modified
pipeline.status queue populated

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About sitemap protocols, incremental scraping, handling massive XML files, and how DataFlirt ensures discovery completeness.

Ask us directly →
What is the difference between sitemap crawling and recursive crawling? +
Recursive crawling starts at a homepage, extracts all <a href> links, and follows them to build a map of the site. Sitemap crawling downloads a pre-compiled XML list of URLs provided by the site owner. Sitemaps are vastly faster and cheaper, but recursive crawling is necessary when a site lacks a sitemap or fails to keep it updated.
What if the target website doesn't have a sitemap? +
You fall back to recursive crawling or API interception. First, check /robots.txt — the sitemap location is usually declared there. If it's truly absent, you must traverse category and pagination links. This increases the request volume required just to discover URLs by a factor of 100x or more.
Can I trust the <lastmod> date for incremental scraping? +
Only after verifying it. Many CMS platforms update the <lastmod> tag only when the core article text changes, ignoring price updates, stock changes, or new comments. If your pipeline relies on capturing price changes, a stale <lastmod> will cause you to miss data. We profile each target's sitemap accuracy before relying on it for delta updates.
How does DataFlirt handle massive 100GB sitemap indexes? +
We never load them into memory. We use streaming XML parsers (like Python's lxml.etree.iterparse or Go's encoding/xml.Decoder) to process the gzip stream on the fly. URLs are evaluated against the known-state database, and only the deltas are pushed to the fetch queue. A 100GB index can be processed in minutes with minimal RAM.
Are sitemaps considered public data? +
Yes. By definition, a sitemap is published specifically to invite automated agents to discover and index the listed URLs. Accessing a sitemap is the most standard, expected automated behavior on the web. However, you must still respect the crawl rates and directives specified in the accompanying robots.txt file.
Why do I get 404s from URLs listed in the sitemap? +
Sitemap generation is often decoupled from the actual database state. A product might be deleted or deactivated, but the sitemap cache isn't purged until a nightly cron job runs. Your scraper must gracefully handle 404s during the fetch phase and treat them as a signal to soft-delete the record in your downstream dataset.
$ dataflirt scope --new-project --target=sitemap-crawling READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h