← Glossary / Seed URL

What is Seed URL?

Seed URLs are the bootstrap entry points loaded into a crawler's URL frontier before any link extraction has run — the first coordinates given to a system that has no prior knowledge of a site's structure. Pick seeds carelessly and you'll spend 30% of your crawl budget on pagination and boilerplate before you reach a single product page; pick them precisely and you can skip entire site sections that contain nothing of value.

Crawler bootstrapURL frontierCrawl scopeInfrastructureLink graph
// 02 — definitions

Where the crawl
begins.

Seed URLs are not just starting points — they're a crawl strategy. The set of seeds you choose defines the shape, depth, and efficiency of everything the crawler discovers afterward.

Ask a DataFlirt engineer →

TL;DR

A seed URL is any URL manually provided to initialize the crawl frontier. Seeds determine which corners of a site the crawler reaches first and, for focused crawls with a depth cap, which corners it reaches at all. For large e-commerce targets, seeds are rarely just the homepage — they're typically sitemap index files, category root URLs, and known API endpoints that expose structured product data directly.

01Definition & structure
A seed URL is any URL explicitly provided to the crawler before it has fetched anything. Seeds are:
  • Static — maintained in config, not discovered at runtime
  • Scoped — typically chosen to cover distinct sections of a target site with minimal overlap
  • Versioned — should be treated as infrastructure: committed, reviewed, and updated when site structure changes
Once the crawler starts fetching, seeds are indistinguishable from other frontier entries — the distinction is only at initialization time. What makes seeds important is that they determine reachability: any page that is only reachable through a path that starts from a seed will only be found if the right seed is in the initial set.
02How it works in practice
At crawl initialization, seeds are loaded directly into the URL frontier — bypassing the deduplication and priority-scoring stages that incoming discovered URLs go through. They're the first entries in every host's back-queue. Workers start fetching seeds immediately, and the link extractor begins populating the frontier from the first response. On a well-seeded crawl, the frontier grows faster than workers can drain it within the first few minutes — seeds are the spark, not the fuel.
03Sitemap seeds vs. discovery seeds
There are two philosophically different seed strategies:
  • Sitemap seeding — load the sitemap XML and extract all listed URLs as seeds. Produces a known, bounded frontier immediately. Misses pages the site owner chose not to list. Best for product detail pages.
  • Discovery seeding — seed a small number of structural entry points (homepage, category roots) and let link extraction do the work. Broader coverage but slower to converge and more likely to hit crawl traps. Best for news sites or blogs where content is not enumerated in a sitemap.
Most production e-commerce crawls combine both: sitemap seeds for known products, discovery seeds for new category pages the sitemap doesn't yet list.
04How DataFlirt handles it
Every new target goes through a seed audit before the first crawl run. We check for sitemap availability, probe known API endpoints, map category taxonomy, and verify robots.txt scope. Seed configs are stored in YAML, version-controlled in Git, and linked to each pipeline config. We run a seed health check at the start of every scheduled crawl — any seed returning non-200 for two consecutive runs triggers a Slack alert and pauses that seed pending review. This prevents stale seeds from silently degrading crawl coverage.
05The homepage seed mistake
The most common seed mistake in scraping projects is using only the homepage. It feels intuitive — the homepage links to everything, right? In practice, large retail sites have product pages 5–8 hops from the homepage. With a crawl depth cap of 4 (which most pipelines need to avoid crawl traps), homepage-only seeding will miss entire catalog sections. We've seen clients with homepage-only seeds achieve 30–40% product coverage on the same targets where sitemap seeds yield 90%+ — on identical infrastructure.
// 03 — the math

How seeds shape
crawl coverage.

Seed selection is a coverage optimization problem. These formulas show how seed count, crawl depth, and branching factor combine to determine what fraction of a site you actually reach.

Pages reachable at depth D = N(D) = S · BD
S = seed count, B = avg links per page (branching factor), D = max crawl depth. Branching dominates quickly. Standard graph theory
Coverage gap from seed placement = G = 1 − (Rseeds / Rtotal)
R_seeds = pages reachable from seeds at depth D; R_total = all pages in the site. G = 0 means full coverage. Focused crawl literature
Seed efficiency score = E = unique_targets_reached / (S · BD)
E close to 1.0 means seeds were well-placed. Low E means overlap — seeds covering the same subgraph. Internal DataFlirt metric
// 04 — seed configuration

Bootstrapping a product
catalog crawl.

A production seed configuration for a mid-size Indian e-commerce target. The homepage is not the first seed — the sitemap index is.

sitemap.xmlcategory rootsdepth cap: 4
edge.dataflirt.io — live
CAPTURED
// seed set — loaded at crawler init
seeds: [
"https://shop.example.in/sitemap_index.xml", // tier-1: structured discovery
"https://shop.example.in/sitemap_products.xml", // direct product URL dump
"https://shop.example.in/categories/", // category tree root
"https://shop.example.in/brands/", // brand index
"https://shop.example.in/new-arrivals/", // freshness anchor
]

// crawl scope config
max_depth: 4 // from any seed, not from homepage
allowed_patterns: ["/product/", "/p/", "/item/"]
blocked_patterns: ["/account/", "/cart/", "/checkout/"]

// frontier state after seed load
frontier.size: 5 // seeds only — discovery hasn't run yet
estimated_reach: ~420,000 product URLs at depth ≤ 4
// 05 — seed strategies

Five ways to seed
a site crawl.

The right seed strategy depends on what you're crawling, what the site exposes, and how quickly you need coverage. For recurring pipelines, you revisit seed selection every time the target's site structure changes — which for large retail sites is 2–4 times per year.

TYPICAL SEED COUNT ·  ·   5–500 URLs
SITEMAP COVERAGE ·  ·  ·  60–95% of pages
DEPTH FROM HOMEPAGE ·   3–8 hops typical
01

XML sitemap index

best coverage · If available, enumerates nearly all target URLs directly
02

Category / taxonomy roots

structured · Covers product tree systematically; misses one-off pages
03

Homepage only

common mistake · Relies on crawl reaching everything via link discovery
04

Search API endpoints

API-first · Pagination over search results yields structured product data
05

Previously scraped URL set

delta crawl · Re-seeds from last run's known-good URLs; fast but misses new pages
// 06 — our seed pipeline

Sitemaps first,

link discovery fills the gaps.

For every new target, we run a site audit before setting seeds: check for sitemap_index.xml, probe known API endpoints, and map the category taxonomy manually. Seeds are committed to a config file and version-controlled — so when a site restructures, we have a diff of what changed and can update seeds without re-auditing from scratch. Seed configs for our recurring pipelines are reviewed quarterly or on schema drift alerts, whichever comes first.

Seed audit — new target onboarding

Checklist output from our automated site audit for a new Indian retail target.

sitemap_index found at /sitemap_index.xmluse as seed
product_sitemap 412,000 URLs listedadded
robots.txt Crawl-delay: 3 · no disallow on /p/compliant
category_roots 34 found via nav parse
js_rendered_nav React SPA — needs Playwright for navflagged
api_endpoints GraphQL at /api/graphqlprobing
seed_count_final 48 seedsapproved

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About seed URL selection, sitemap parsing, crawl depth, and why the starting URLs matter more than most teams realize.

Ask us directly →
Can't I just use the homepage as my only seed? +
You can, but you'll pay for it in crawl time and coverage gaps. Most large e-commerce sites have 5–8 hops between the homepage and a product detail page. At a 3-second crawl delay per hop, traversing that path serially takes minutes per product. Seeding directly from sitemap URLs or category roots reduces effective depth to 1–2 hops and can cut crawl time by 60–80% on the same site.
How do I find good seeds for a site I've never crawled before? +
Start with /sitemap.xml and /sitemap_index.xml — most sites have them and they're the highest-signal seed source. Then check /robots.txt for Sitemap: directives pointing to additional sitemaps. After that: inspect the main nav for category roots, check for known API patterns (/api/products, /graphql), and look at what Googlebot is hitting via Google Search Console if you have access.
What's the difference between seeds and the URL frontier? +
Seeds are the manual input that bootstraps the frontier. Once the crawler starts fetching, every link it extracts gets added to the frontier — seeds are a vanishingly small fraction of the total frontier within the first few minutes. The frontier is the live, growing queue; seeds are the static initialization file you maintain per target.
How do seeds affect crawl depth calculations? +
Depth is measured from the seed, not from the homepage. If you seed directly from a product sitemap, every product URL is at depth 0 — no hops required. This is why sitemap seeding is so efficient: it collapses the discovery graph entirely for pages the sitemap already lists, leaving link extraction to find only the pages the sitemap missed.
What happens when a target site restructures and seeds go stale? +
Stale seeds return 404s or redirect chains. The crawler should log redirect chains and 404 rates per seed — a spike there is the first signal that a site has restructured. On our pipelines, we alert when any seed URL returns non-200 for three consecutive runs, which triggers a seed review before the next scheduled crawl.
Should seeds be different for a full crawl vs. a delta/refresh crawl? +
Yes. A full crawl seeds from sitemaps and category roots to maximize coverage. A delta crawl seeds from URLs known to change frequently — new arrivals pages, sale category roots, recently modified sitemap entries — and from your own last-run product URL list. Running a full-crawl seed strategy on a daily refresh pipeline wastes budget re-crawling stable pages.
$ dataflirt scope --new-project --target=seed-url READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h