List Crawling: Why It Matters and Why Most Engineers Get It Wrong
If you are a data engineer building production pipelines in 2026, list crawling is almost certainly your highest-volume workload. Product catalogs, job boards, business directories, financial data tables, SERP pages — the internet’s most commercially valuable data is rendered in list format. The global web scraping software market was valued at approximately USD 1.1 billion in 2024 and is projected to grow at a compound annual rate exceeding 18% through 2030. A substantial share of that economic activity is list crawling: systematic extraction from repeated, paginated, or dynamically loaded structured data.
And yet, most engineers approach list crawling as a solved problem — grab the selector, iterate pages, done. In practice, the failure modes are sophisticated. Pagination caps silently truncate catalog coverage. Infinite scroll implementations switch underlying APIs between deployments with no warning. CSS selectors drift after redesigns and fail without raising exceptions. Anti-bot systems fingerprint list crawlers specifically because their request cadence is more regular than human browsing. LLM-augmented extraction pipelines are now part of the production toolkit but require careful architecture to avoid token waste and latency spikes.
This guide is written for senior engineers and data engineers who already know how to fetch a page. We are going to go deep on architecture, edge cases, production patterns, and the specific failure modes that separate amateur list crawlers from reliable, high-throughput data pipelines.
What List Crawling Actually Means (And What It Doesn’t)
List crawling is the automated traversal and structured data extraction from web pages that present data in repeated, list-like formats. This encompasses product catalog pages, job board listings, business directory entries, search result pages, review feeds, and data tables — any page structure where the same HTML template is repeated N times per page, across M pages.
What list crawling is not: it is not a synonym for general web crawling (which traverses arbitrary link graphs), and it is not simple single-page scraping. The defining characteristics of list crawling as a distinct engineering problem are:
- A URL frontier that respects list pagination boundaries (page numbers, cursor tokens, offset parameters)
- A parser that maps repeating selectors across homogeneous DOM structures
- A deduplication layer to handle overlapping pages on dynamic sites
- A completeness guarantee strategy — ensuring full catalog coverage despite pagination caps and filter-based truncation
Understanding this framing changes your architecture decisions significantly. A general-purpose crawler that stumbles across list pages by link discovery will miss paginated coverage. A list crawler that treats all list pages as equivalent will hit pagination caps at page 50 and silently undercount.
DataFlirt Recommended Reading: Before diving into list crawling architecture, ensure you understand the broader free web scraping tool landscape: Best Free Web Scraping Tools in 2026 for Developers
The Five Site Structures You Will Encounter in List Crawling
1. Numbered Paginated Lists
The most common structure. Data is split across pages accessible via a URL parameter (?page=2, ?offset=50, /page/2/) or a “Next” button that resolves to a predictable URL. The challenge is not the happy path — it is the edges: sites that cap visible pages at 20–100 regardless of actual catalog depth, inconsistent parameter names across site sections, and pages that return HTTP 200 with empty content rather than 404 when you exceed bounds.
Identification signal: View Page Source contains all data (no JavaScript required). Pagination controls are anchor tags with href attributes.
2. Cursor and Token-Based Pagination
Increasingly common on modern e-commerce and API-backed sites. Instead of page numbers, the “next page” token is embedded in the current page response — in a <meta> tag, a JSON blob in a script element, or a data-* attribute on the pagination control. Each request must parse the cursor from the current response before issuing the next.
Identification signal: URL does not contain a predictable numeric parameter. Inspecting the response HTML reveals a token or cursor value that changes with each page.
3. Infinite Scroll Lists
Content loads as the user scrolls down. From the engineering perspective, this means the page DOM starts incomplete and extends as scroll events trigger XHR or Fetch API calls to a backend endpoint. The naive approach — headless browser scroll simulation — works but is expensive in compute and time.
Identification signal: Only partial content visible in View Page Source. Network panel in DevTools shows XHR/Fetch requests triggered during scroll.
4. Faceted Catalog Lists with Pagination Caps
The structurally hardest case. The site shows paginated lists of products or listings, but caps visibility at N pages (commonly 20–50). A category with 10,000 products returns only the first 500 (20 pages × 25 results). No error is raised — the data is simply invisible.
Identification signal: The total result count shown on the page (e.g., “1,247 results”) is vastly larger than what pagination allows you to access. The last page of pagination shows far fewer results than total count / results per page.
5. Static Data Tables
HTML <table> elements with headers and rows, or CSS-styled table-like grids. These require a different parsing approach from card-based lists. Multi-page tables may use server-side pagination with predictable URL patterns or client-side filtering that still relies on a data API.
Identification signal: <table> elements in View Page Source with <thead> and <tbody> structure, or a grid of <div> elements with consistent column patterns.
Virtual Environment Setup and Prerequisites
Before writing any list crawling code, establish a clean Python environment. This is non-negotiable for production work — dependency conflicts between Scrapy, Playwright, and their async runtimes are a frequent source of silent failures.
# Python 3.11+ recommended for Scrapy and Playwright async compatibility
python --version # Confirm 3.11+
# Create isolated environment
python -m venv .listcrawl-env
source .listcrawl-env/bin/activate # Windows: .listcrawl-env\Scripts\activate
# Core list crawling dependencies
pip install scrapy scrapy-redis playwright asyncio httpx selectolax \
lxml beautifulsoup4 itemadapter anthropic google-genai
# Install Playwright browser binaries
playwright install chromium firefox
playwright install-deps chromium # Install OS-level dependencies (Linux)
Verify that Scrapy’s Twisted async loop and Playwright’s asyncio loop do not conflict by never importing both in the same module without an explicit loop isolation strategy (discussed later in the distributed architecture section).
List Crawling Pattern 1: Paginated List Scraping with Scrapy
For high-volume paginated list scraping, Scrapy remains the production-grade default. Its request deduplication, AutoThrottle middleware, and retry handling eliminate the boilerplate that plagues hand-rolled paginated scrapers.
The following spider handles three pagination schema variants in a single implementation:
# spiders/catalog_list_spider.py
import scrapy
from itemadapter import ItemAdapter
from urllib.parse import urlencode, urlparse, parse_qs, urljoin
import re
class CatalogListSpider(scrapy.Spider):
"""
Production paginated list scraping spider.
Handles: numbered page params, path-segment pagination, cursor-based pagination.
Prerequisites: scrapy, scrapy-redis, itemadapter
pip install scrapy scrapy-redis itemadapter
"""
name = "catalog_list"
custom_settings = {
"CONCURRENT_REQUESTS": 32,
"DOWNLOAD_DELAY": 0.75,
"AUTOTHROTTLE_ENABLED": True,
"AUTOTHROTTLE_START_DELAY": 0.5,
"AUTOTHROTTLE_TARGET_CONCURRENCY": 16,
"AUTOTHROTTLE_MAX_DELAY": 10,
"ROBOTSTXT_OBEY": True,
"HTTPCACHE_ENABLED": True, # Critical for paginated list scraping dev/debug
"HTTPCACHE_EXPIRATION_SECS": 3600,
"DUPEFILTER_CLASS": "scrapy.dupefilters.RFPDupeFilter",
"RETRY_TIMES": 3,
"RETRY_HTTP_CODES": [429, 500, 502, 503, 504],
"DEFAULT_REQUEST_HEADERS": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-GB,en;q=0.9",
},
}
# --- Configure your target here ---
start_urls = ["https://example.com/products"]
# Pagination schema: "param" | "path" | "cursor"
PAGINATION_SCHEMA = "param"
PAGE_PARAM = "page" # Only used when PAGINATION_SCHEMA == "param"
ITEMS_SELECTOR = "div.product-card"
NEXT_PAGE_SELECTOR = "a.pagination-next::attr(href)"
CURSOR_SELECTOR = "meta[name='next-cursor']::attr(content)"
def parse(self, response):
"""
Primary handler for list page responses.
Extracts items and discovers next page URL.
"""
items = response.css(self.ITEMS_SELECTOR)
if not items:
self.logger.warning(f"No items found on {response.url} — check selector or detect block")
return
self.logger.info(f"Found {len(items)} items on {response.url}")
for item in items:
yield self._extract_item(item, response)
# Discover and follow next page — handles all 3 pagination schemas
yield from self._follow_pagination(response)
def _extract_item(self, item, response):
"""
Extract structured data from a single list item.
Override in subclasses for domain-specific schemas.
IMPORTANT: Use .get("") with fallback, never .get() without fallback.
Missing selectors return None and cause downstream KeyError silently.
"""
return {
"name": item.css("h2.product-title::text, h3.product-title::text").get("").strip(),
"price": item.css(".price::text, [data-price]::text").get("").strip(),
"sku": item.attrib.get("data-sku", item.attrib.get("data-id", "")),
"url": response.urljoin(
item.css("a::attr(href)").get("")
),
"image_url": item.css("img::attr(src), img::attr(data-src)").get(""),
"source_page": response.url,
}
def _follow_pagination(self, response):
"""
Pagination schema dispatcher.
Returns a generator of Scrapy Request objects.
Handles the most common list crawling pagination edge cases:
- Parameter-based pagination with auto-increment
- Path-segment pagination
- Cursor/token-based pagination (next-cursor in meta or JSON)
"""
if self.PAGINATION_SCHEMA in ("param", "path"):
# Try explicit next-link first (most reliable)
next_href = response.css(self.NEXT_PAGE_SELECTOR).get("")
if next_href:
yield response.follow(next_href, callback=self.parse)
return
# Fallback: auto-increment the page parameter
if self.PAGINATION_SCHEMA == "param":
parsed = urlparse(response.url)
params = parse_qs(parsed.query)
current_page = int(params.get(self.PAGE_PARAM, ["1"])[0])
params[self.PAGE_PARAM] = [str(current_page + 1)]
next_url = response.url.split("?")[0] + "?" + urlencode(
{k: v[0] for k, v in params.items()}
)
# Guard: stop if we circled back (some sites return page 1 when exceeded)
if next_url != response.url:
yield response.follow(next_url, callback=self.parse)
elif self.PAGINATION_SCHEMA == "cursor":
# Cursor from meta tag — common in API-backed catalog lists
cursor = response.css(self.CURSOR_SELECTOR).get("")
if not cursor:
# Also check for cursor in inline JSON script blocks
import json
for script in response.css("script[type='application/json']::text").getall():
try:
data = json.loads(script)
cursor = data.get("nextCursor") or data.get("pagination", {}).get("cursor", "")
if cursor:
break
except (json.JSONDecodeError, AttributeError):
continue
if cursor:
next_url = response.url.split("?")[0] + f"?cursor={cursor}"
yield response.follow(next_url, callback=self.parse)
Pagination Cap Evasion: The Critical Pattern Nobody Talks About
The pagination cap problem is the most under-documented failure mode in paginated list scraping. Here is a concrete implementation of the filter decomposition strategy:
# spiders/faceted_catalog_spider.py
import scrapy
from itertools import product as iter_product
class FacetedCatalogSpider(scrapy.Spider):
"""
Faceted catalog list crawling with pagination cap evasion.
Most platforms cap visible pages at 20–100 regardless of catalog size.
Strategy: decompose catalog using price bands and category filters
so no single filter combination exceeds the pagination cap.
Example: A catalog with 50,000 products at 25 per page with a 100-page cap
gives access to only 2,500 items per filter. Using 20 price bands each
with sub-categories gives ~100% coverage.
"""
name = "faceted_catalog"
BASE_URL = "https://example.com/products"
RESULTS_PER_PAGE = 25
MAX_PAGES_VISIBLE = 100 # Platform's pagination cap
MAX_ITEMS_PER_FILTER = 2500 # RESULTS_PER_PAGE × MAX_PAGES_VISIBLE
# Price band decomposition — tune these to your target catalog distribution
PRICE_BANDS = [
(0, 10), (10, 25), (25, 50), (50, 100),
(100, 200), (200, 500), (500, 1000), (1000, 9999)
]
# Category filter values — extract from the site's facet navigation
CATEGORIES = ["electronics", "clothing", "books", "home", "sports"]
def start_requests(self):
"""
Generate decomposed URL set from all filter combinations.
Each combination should yield fewer than MAX_ITEMS_PER_FILTER results.
"""
for (price_min, price_max), category in iter_product(self.PRICE_BANDS, self.CATEGORIES):
url = (
f"{self.BASE_URL}"
f"?category={category}"
f"&price_min={price_min}"
f"&price_max={price_max}"
f"&sort=newest" # Consistent sort prevents duplicates across runs
)
yield scrapy.Request(
url,
callback=self.parse_list,
meta={
"price_band": (price_min, price_max),
"category": category,
"filter_item_count": 0,
}
)
def parse_list(self, response):
# Check if this filter combination exceeds our cap — if so, log a warning
# You would ideally sub-divide further with additional filters
total_count_el = response.css(".total-results::text").get("0")
try:
total_count = int(total_count_el.replace(",", "").strip())
except ValueError:
total_count = 0
if total_count > self.MAX_ITEMS_PER_FILTER:
self.logger.warning(
f"Filter combination {response.meta['category']} / "
f"{response.meta['price_band']} returns {total_count} items "
f"— exceeds cap of {self.MAX_ITEMS_PER_FILTER}. "
f"Consider adding more filter dimensions."
)
for item in response.css("div.product-card"):
yield {
"name": item.css("h2::text").get("").strip(),
"price": item.css(".price::text").get("").strip(),
"url": response.urljoin(item.css("a::attr(href)").get("")),
"filter_category": response.meta["category"],
"filter_price_band": response.meta["price_band"],
}
next_page = response.css("a.pagination-next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse_list, meta=response.meta)
DataFlirt Recommended Reading: Proper proxy rotation is essential for any paginated list scraping operation at scale. Read our Best IP Rotation Strategies for High-Volume Scraping Projects to avoid IP bans during long paginated crawl runs.
List Crawling Pattern 2: Infinite Scroll Crawling — The Right Way and the Expensive Way
Infinite scroll crawling has two fundamentally different approaches, and the choice between them has a 100–1000x performance difference.
The Expensive Way: Browser Scroll Simulation
Headless browser scroll simulation is the approach most tutorials show. It works, but it is computationally expensive: each browser instance consumes 150–400MB RAM, scroll simulation requires active waiting for DOM mutations, and throughput is measured in tens of pages per minute rather than thousands.
Use this approach only when the site cannot be reverse-engineered, or when you need screenshot-level fidelity for visual validation.
# infinite_scroll_playwright.py — browser scroll simulation
# Prerequisites: pip install playwright asyncio
# playwright install chromium
import asyncio
import json
from playwright.async_api import async_playwright
async def crawl_infinite_scroll(url: str, max_scroll_attempts: int = 50) -> list[dict]:
"""
Infinite scroll crawling via browser scroll simulation.
Use this as a fallback when API reverse-engineering is not possible.
CAVEATS:
- High memory per instance (200–400MB)
- Throughput: ~10–30 pages/minute vs ~500+ for direct API approach
- Element staleness: previously found elements may detach after scroll
- Scroll trigger: some sites use percentage-based triggers, not bottom-of-page
"""
results = []
async with async_playwright() as pw:
browser = await pw.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
"--disable-dev-shm-usage", # Essential in containers with limited /dev/shm
]
)
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
)
)
page = await context.new_page()
# Block image/font/media resources to reduce bandwidth during infinite scroll crawling
await page.route(
"**/*.{png,jpg,jpeg,gif,svg,webp,ico,woff,woff2,mp4,mp3}",
lambda route: route.abort()
)
await page.goto(url, wait_until="domcontentloaded")
await asyncio.sleep(2) # Allow initial render
prev_height = -1
scroll_count = 0
while scroll_count < max_scroll_attempts:
# Scroll to bottom
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
# Wait for network idle (catches XHR triggered by scroll)
try:
await page.wait_for_load_state("networkidle", timeout=4000)
except Exception:
# Timeout is acceptable — means no new requests were triggered
pass
new_height = await page.evaluate("document.body.scrollHeight")
if new_height == prev_height:
# No new content loaded — we've reached the end
break
prev_height = new_height
scroll_count += 1
# Collect all loaded items AFTER scroll completion
# Important: query all elements AFTER all scrolls to avoid stale element handles
raw_items = await page.evaluate("""
() => {
return Array.from(document.querySelectorAll('div.item-card')).map(el => ({
title: el.querySelector('h3')?.innerText?.trim() ?? '',
price: el.querySelector('.price')?.innerText?.trim() ?? '',
id: el.dataset.id ?? '',
url: el.querySelector('a')?.href ?? '',
}));
}
""")
results = raw_items
await browser.close()
return results
if __name__ == "__main__":
items = asyncio.run(crawl_infinite_scroll("https://example.com/feed"))
print(f"Collected {len(items)} items via scroll simulation")
print(json.dumps(items[:3], indent=2))
The Right Way: XHR API Reverse Engineering
The productive pattern for infinite scroll crawling is to identify the underlying data API endpoint that the browser’s scroll events trigger, then replicate those requests directly — no browser process required.
# infinite_scroll_api_reverse.py — direct API approach (preferred)
# Prerequisites: pip install httpx asyncio
# Requires: manual DevTools Network inspection to identify the endpoint
import asyncio
import httpx
import json
from typing import AsyncGenerator
# ---- HOW TO IDENTIFY THE ENDPOINT ----
# 1. Open DevTools → Network → Filter by XHR/Fetch
# 2. Load the infinite scroll page and scroll once
# 3. Find the request that loaded new items — note URL, method, headers, params
# 4. Right-click → Copy as cURL
# 5. Adapt to the httpx client below
API_ENDPOINT = "https://example.com/api/v2/items"
PAGE_SIZE = 24
# Headers extracted from real browser request — copy from DevTools
BROWSER_HEADERS = {
"Accept": "application/json, text/plain, */*",
"Accept-Language": "en-GB,en;q=0.9",
"Content-Type": "application/json",
"X-Requested-With": "XMLHttpRequest", # Many APIs require this
"Referer": "https://example.com/feed", # Critical for APIs that validate referer
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
),
}
async def crawl_infinite_scroll_via_api(
max_items: int = 10000,
concurrency: int = 5,
) -> list[dict]:
"""
Replicate infinite scroll XHR requests directly.
This approach is 100–1000x faster than browser scroll simulation.
The cursor-based pattern handles: offset params, cursor tokens, and page numbers.
CAVEATS:
- API endpoints may require session cookies from an initial browser visit
- Some APIs rotate the pagination token; you must extract it from each response
- Rate limits on the API endpoint are often more aggressive than the HTML tier
"""
all_items = []
semaphore = asyncio.Semaphore(concurrency)
async def fetch_page(client: httpx.AsyncClient, offset: int) -> dict | None:
async with semaphore:
try:
params = {
"offset": offset,
"limit": PAGE_SIZE,
"sort": "popular", # Keep sort consistent across pages
}
resp = await client.get(
API_ENDPOINT,
params=params,
headers=BROWSER_HEADERS,
timeout=15.0,
)
resp.raise_for_status()
return resp.json()
except (httpx.HTTPStatusError, httpx.TimeoutException, json.JSONDecodeError) as e:
print(f"[ERROR] offset={offset}: {e}")
return None
async with httpx.AsyncClient(
http2=True, # Many APIs serve HTTP/2 — use it for connection multiplexing
follow_redirects=True,
) as client:
# First request to determine total count
first_page = await fetch_page(client, 0)
if not first_page:
return []
# Adapt this key path to your target API response schema
total_count = first_page.get("total", first_page.get("count", max_items))
items_this_page = first_page.get("items", first_page.get("results", []))
all_items.extend(items_this_page)
# Calculate remaining offsets
remaining_offsets = list(range(PAGE_SIZE, min(total_count, max_items), PAGE_SIZE))
# Fetch remaining pages concurrently, respecting semaphore
tasks = [fetch_page(client, offset) for offset in remaining_offsets]
pages = await asyncio.gather(*tasks)
for page_data in pages:
if page_data:
all_items.extend(page_data.get("items", page_data.get("results", [])))
return all_items
if __name__ == "__main__":
items = asyncio.run(crawl_infinite_scroll_via_api(max_items=5000))
print(f"Collected {len(items)} items via direct API")
print(json.dumps(items[:2], indent=2))
DataFlirt Recommended Reading: Dynamic JavaScript sites require specific approach decisions. Our Best Approaches to Scraping Dynamic JavaScript Sites Without Getting Blocked covers the full decision tree from rendering requirements to anti-bot bypass.
List Crawling Pattern 3: Table Data Extraction
HTML table extraction seems simple but has well-documented edge cases: headers spread across multiple rows, merged cells with colspan/rowspan, CSS-styled table analogues without <table> elements, and server-side or client-side paginated tables.
# table_list_crawler.py — production table extraction
# Prerequisites: pip install selectolax httpx asyncio
import asyncio
import httpx
from selectolax.parser import HTMLParser
from typing import Any
# selectolax is 10–30x faster than BeautifulSoup for high-volume parsing
# It uses the Modest HTML5 parser (C extension) — drop-in replacement for most use cases
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-GB,en;q=0.9",
}
def extract_table(html: str, table_selector: str = "table") -> list[dict[str, Any]]:
"""
Extract structured data from an HTML table.
Handles:
- Multi-row headers (takes last header row as canonical)
- Missing <thead> (infers header from first row with <th> elements)
- Rows with fewer cells than header (fills with empty string)
- selectolax parser for high-throughput table extraction
CAVEATS:
- Does not handle colspan/rowspan merged cells (requires custom traversal)
- For CSS-grid "tables" without <table> tags, adapt the selector and structure
"""
parser = HTMLParser(html)
table = parser.css_first(table_selector)
if not table:
return []
# Detect header rows — prefer <thead> rows, fall back to first row with <th>
headers = []
thead = table.css_first("thead")
if thead:
# Multiple header rows: take the last one (usually the most specific)
header_rows = thead.css("tr")
if header_rows:
last_header_row = header_rows[-1]
headers = [
th.text(strip=True)
for th in last_header_row.css("th, td")
]
if not headers:
# No <thead>: look for first <tr> with <th> elements
for row in table.css("tr"):
cells = row.css("th")
if cells:
headers = [th.text(strip=True) for th in cells]
break
if not headers:
print("[WARN] Could not detect table headers — using column indices")
# Extract data rows from <tbody>, or all rows if no <tbody>
tbody = table.css_first("tbody")
rows_container = tbody if tbody else table
results = []
for row in rows_container.css("tr"):
cells = row.css("td")
if not cells:
continue # Skip header rows within tbody
cell_values = [cell.text(strip=True) for cell in cells]
if headers:
# Pad short rows with empty strings to avoid KeyError downstream
padded = cell_values + [""] * (len(headers) - len(cell_values))
row_data = dict(zip(headers, padded[:len(headers)]))
else:
row_data = {f"col_{i}": v for i, v in enumerate(cell_values)}
results.append(row_data)
return results
async def crawl_paginated_table(
base_url: str,
table_selector: str = "table",
next_page_selector: str = "a.pagination-next::attr(href)",
) -> list[dict]:
"""
Crawl a multi-page table with paginated list scraping.
Extracts structured data from each page and follows pagination.
"""
all_rows = []
current_url = base_url
async with httpx.AsyncClient(
headers=HEADERS,
follow_redirects=True,
http2=True,
) as client:
while current_url:
resp = await client.get(current_url, timeout=15.0)
resp.raise_for_status()
page_rows = extract_table(resp.text, table_selector)
all_rows.extend(page_rows)
print(f"[OK] {current_url} → {len(page_rows)} rows (total: {len(all_rows)})")
# Parse next page link using selectolax
parser = HTMLParser(resp.text)
next_link_node = parser.css_first(next_page_selector.replace("::attr(href)", ""))
if next_link_node:
next_href = next_link_node.attributes.get("href", "")
current_url = next_href if next_href.startswith("http") else base_url.rstrip("/") + next_href
else:
current_url = None # No next page — crawl complete
return all_rows
if __name__ == "__main__":
rows = asyncio.run(
crawl_paginated_table(
"https://example.com/data-table",
table_selector="table.data-table",
)
)
print(f"Total rows extracted: {len(rows)}")
if rows:
print("Sample:", rows[0])
List Crawling Pattern 4: LLM-Augmented Structured Data Extraction
The structural fragility of CSS selectors in list crawling is a long-term reliability problem. A site redesign that shifts a class name from .price to .price-display silently produces empty fields with no exception raised. At scale, across hundreds of domains, this breakage is continuous.
LLM-augmented structured data extraction resolves this by routing HTML through a language model that understands the semantic content rather than the DOM structure. The trade-off is latency and cost — but for pipelines running unmonitored across weeks, the reliability gain outweighs both.
Gemini 3.1 for High-Volume List Extraction (Google GenAI SDK)
# llm_list_extraction_gemini.py
# Prerequisites: pip install google-genai httpx asyncio
# Set GOOGLE_API_KEY environment variable
import asyncio
import json
import os
import httpx
from google import genai
from google.genai import types
# --- CAVEAT ---
# Gemini 3.1 Flash is optimised for structured extraction with large HTML context.
# Always slice HTML before sending — models have context windows but token cost
# scales linearly. Sending 200KB of raw HTML per list page is wasteful.
# The pre-processing step below strips non-content HTML to reduce tokens ~60–80%.
client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])
def strip_html_for_llm(html: str, max_chars: int = 40000) -> str:
"""
Pre-process HTML before LLM extraction.
Removes scripts, styles, SVG, and meta — preserving only content HTML.
Reduces token count by 60–80% on typical product list pages.
"""
from selectolax.parser import HTMLParser
parser = HTMLParser(html)
# Remove non-content nodes
for tag in parser.css("script, style, svg, link, meta, noscript, iframe"):
tag.decompose()
# Extract body text as cleaned HTML
body = parser.css_first("body")
if body:
return body.html[:max_chars] if body.html else ""
return html[:max_chars]
async def extract_list_items_gemini(
html: str,
extraction_schema: str,
model: str = "gemini-3.1-flash",
) -> list[dict]:
"""
Extract structured list items from HTML using Gemini 3.1 Flash.
Returns a list of dicts conforming to extraction_schema.
CAVEATS:
- Always validate JSON output — LLMs occasionally return partial JSON
- Set temperature=0.1 for structured extraction (lower = more deterministic)
- Cache results by URL hash to avoid re-extraction on re-crawls
- Gemini 3.1 Flash handles ~100k tokens context — sufficient for most list pages
"""
cleaned_html = strip_html_for_llm(html)
prompt = f"""Extract ALL items from this HTML page as a JSON array.
Each item should have these fields: {extraction_schema}
Rules:
- Return ONLY a valid JSON array, no explanation, no markdown fences
- If a field is missing, use an empty string ""
- Do not invent values — only extract what is explicitly in the HTML
- Include ALL items visible on the page, not just the first few
HTML:
{cleaned_html}"""
try:
response = client.models.generate_content(
model=model,
contents=[types.Part.from_text(prompt)],
config=types.GenerateContentConfig(
response_mime_type="application/json",
temperature=0.1,
max_output_tokens=4096,
),
)
raw = response.text.strip()
# Strip accidental markdown fences if model ignores mime_type instruction
if raw.startswith("```"):
raw = raw.split("```")[1]
if raw.startswith("json"):
raw = raw[4:]
parsed = json.loads(raw)
return parsed if isinstance(parsed, list) else [parsed]
except (json.JSONDecodeError, AttributeError) as e:
print(f"[ERROR] Gemini extraction failed: {e}")
return []
async def llm_paginated_list_crawl_gemini(
start_url: str,
extraction_schema: str = "name, price, url, sku, availability",
max_pages: int = 20,
) -> list[dict]:
"""
Full list crawling pipeline using Gemini 3.1 Flash for extraction.
Handles pagination automatically via next-link detection in the LLM response.
For very high volume, prefer the hybrid approach:
- CSS selectors for stable fields (price, SKU)
- LLM for unstable or schema-free fields (description, specs, tags)
"""
all_items = []
current_url = start_url
pages_crawled = 0
async with httpx.AsyncClient(
headers={
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
),
"Accept-Language": "en-GB,en;q=0.9",
},
follow_redirects=True,
http2=True,
) as http_client:
while current_url and pages_crawled < max_pages:
resp = await http_client.get(current_url, timeout=20.0)
resp.raise_for_status()
items = await extract_list_items_gemini(resp.text, extraction_schema)
all_items.extend(items)
pages_crawled += 1
print(f"[Page {pages_crawled}] {current_url} → {len(items)} items extracted")
# Detect next page via simple link parsing (not LLM — keep this cheap)
from selectolax.parser import HTMLParser
parser = HTMLParser(resp.text)
next_node = parser.css_first("a[rel='next'], a.pagination-next, li.next a")
if next_node:
next_href = next_node.attributes.get("href", "")
if next_href:
current_url = next_href if next_href.startswith("http") else start_url.rstrip("/") + next_href
else:
current_url = None
else:
current_url = None
return all_items
Claude Sonnet 4.6 for Precision Structured Data Extraction
# llm_list_extraction_claude.py — using Anthropic SDK
# Prerequisites: pip install anthropic httpx selectolax
# Set ANTHROPIC_API_KEY environment variable
import anthropic
import asyncio
import json
import os
import httpx
from selectolax.parser import HTMLParser
anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def strip_html_for_llm(html: str, max_chars: int = 30000) -> str:
parser = HTMLParser(html)
for tag in parser.css("script, style, svg, link, meta, noscript, iframe"):
tag.decompose()
body = parser.css_first("body")
return (body.html[:max_chars] if body and body.html else html[:max_chars])
def extract_list_items_claude(
html: str,
schema_description: str,
model: str = "claude-sonnet-4-6",
) -> list[dict]:
"""
Extract structured list items using Claude Sonnet 4.6.
Claude Sonnet 4.6 is preferred for:
- Complex nested schemas (e.g., product variants, nested specs)
- Multi-locale HTML where value types need semantic disambiguation
- Pipelines where schema precision matters more than throughput cost
Use claude-opus-4-6 for the highest-precision extraction on complex pages.
Use claude-sonnet-4-6 (default here) for the best cost/precision balance.
CAVEAT: Claude does not have a native response_mime_type JSON mode in all SDK
versions. Use explicit JSON-only instructions and validate on the output.
"""
cleaned_html = strip_html_for_llm(html)
message = anthropic_client.messages.create(
model=model,
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"""You are a structured data extraction engine.
Extract ALL list items from the HTML below as a JSON array.
Required schema per item: {schema_description}
Output rules:
- Return ONLY a valid JSON array. No explanation. No markdown. No backticks.
- Missing fields should be empty strings, not null
- Extract every item visible on the page — do not truncate
- Preserve original formatting for price values (include currency symbols)
HTML:
{cleaned_html}"""
}
],
)
raw = message.content[0].text.strip()
# Defensive cleaning — strip any accidental markdown
if raw.startswith("```"):
parts = raw.split("```")
raw = parts[1].lstrip("json").strip() if len(parts) > 1 else raw
try:
parsed = json.loads(raw)
return parsed if isinstance(parsed, list) else [parsed]
except json.JSONDecodeError as e:
print(f"[ERROR] Claude extraction returned invalid JSON: {e}")
print(f"[DEBUG] Raw output (first 300 chars): {raw[:300]}")
return []
async def hybrid_list_crawl(
start_url: str,
css_selectors: dict,
llm_fallback_fields: list[str],
max_pages: int = 10,
) -> list[dict]:
"""
Hybrid extraction: CSS selectors for stable fields, Claude for volatile ones.
This is the recommended production pattern for list crawling:
- CSS selectors handle price, SKU, URL — fast and zero-cost
- Claude handles spec tables, feature lists, schema-free attributes — reliable across redesigns
- Results are merged per item
css_selectors format: {"field_name": "css_selector::pseudo_element"}
llm_fallback_fields: list of field names to route through Claude
"""
all_items = []
current_url = start_url
async with httpx.AsyncClient(
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"},
follow_redirects=True,
) as client:
pages = 0
while current_url and pages < max_pages:
resp = await client.get(current_url, timeout=20.0)
resp.raise_for_status()
parser = HTMLParser(resp.text)
# Extract stable fields via CSS selectors (fast, zero-cost)
items_nodes = parser.css("div.product-card, li.listing-item")
css_extracted = []
for node in items_nodes:
item_data = {}
for field, selector in css_selectors.items():
# Handle pseudo-elements by stripping them for selectolax
clean_sel = selector.replace("::text", "").replace("::attr(href)", "")
target_node = node.css_first(clean_sel)
if target_node:
if "::text" in selector:
item_data[field] = target_node.text(strip=True)
elif "::attr(" in selector:
attr = selector.split("::attr(")[1].rstrip(")")
item_data[field] = target_node.attributes.get(attr, "")
else:
item_data[field] = target_node.text(strip=True)
else:
item_data[field] = ""
css_extracted.append(item_data)
# Route LLM fallback fields through Claude for the full page HTML
if llm_fallback_fields:
llm_schema = ", ".join(llm_fallback_fields)
llm_extracted = extract_list_items_claude(resp.text, llm_schema)
# Merge: CSS-extracted items + LLM-extracted items by position
for i, css_item in enumerate(css_extracted):
if i < len(llm_extracted):
css_item.update({
k: v for k, v in llm_extracted[i].items()
if k in llm_fallback_fields
})
all_items.extend(css_extracted)
pages += 1
# Pagination
next_node = parser.css_first("a[rel='next'], a.next-page")
if next_node:
next_href = next_node.attributes.get("href", "")
current_url = next_href if next_href.startswith("http") else start_url + next_href
else:
current_url = None
return all_items
DataFlirt Recommended Reading: For a comprehensive comparison of all LLM-powered scraping tools and pipeline integration patterns, see our Best Scraping Tools Powered by LLMs in 2026.
Distributed List Crawling Architecture for Production Scale
Single-process list crawling hits a practical ceiling at approximately 300–600 requests/second for HTTP-only crawling and 10–30 pages/minute for headless browser-based infinite scroll crawling. For crawls exceeding millions of pages, the architecture must scale horizontally.
Scrapy + scrapy-redis: The Distributed List Crawling Standard
# settings.py — distributed list crawling with scrapy-redis
# Prerequisites: pip install scrapy scrapy-redis
# Requires: Redis instance accessible at REDIS_URL
# --- Distributed Queue Settings ---
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = "redis://redis-service:6379" # Kubernetes service or managed Redis endpoint
SCHEDULER_PERSIST = True # Don't flush the queue on restart
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
# --- Throughput Settings ---
CONCURRENT_REQUESTS = 64
DOWNLOAD_DELAY = 0.3
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 32
# --- Pipeline Settings ---
ITEM_PIPELINES = {
"scrapy_redis.pipelines.RedisPipeline": 100, # Optional: buffer to Redis
"myproject.pipelines.PostgresPipeline": 200, # Persist to PostgreSQL
"myproject.pipelines.DeduplicationPipeline": 50, # Hash-based dedup
}
# --- Distributed crawling requires explicit START_URLS management ---
# Push starting URLs to Redis directly rather than using start_urls
# redis-cli -h redis-service LPUSH myspider:start_urls "https://example.com/products?page=1"
# kubernetes/list-crawler-cronjob.yaml
# Horizontal scaling for distributed list crawling
apiVersion: batch/v1
kind: CronJob
metadata:
name: catalog-list-crawler
spec:
schedule: "0 */4 * * *" # Every 4 hours — tune to data freshness requirements
concurrencyPolicy: Forbid # Prevent overlapping runs for the same catalog
jobTemplate:
spec:
parallelism: 5 # 5 concurrent Scrapy workers sharing the Redis queue
completions: 5
template:
spec:
containers:
- name: scrapy-worker
image: your-registry/list-crawler:latest
env:
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: redis-credentials
key: url
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: llm-api-keys
key: anthropic
- name: GOOGLE_API_KEY
valueFrom:
secretKeyRef:
name: llm-api-keys
key: google
command:
- "scrapy"
- "crawl"
- "catalog_list"
- "-s"
- "REDIS_URL=$(REDIS_URL)"
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
restartPolicy: OnFailure
PostgreSQL Output Pipeline with Upsert Semantics
# pipelines.py — production output pipeline for list crawling data
import psycopg2
from psycopg2.extras import execute_values
import hashlib
import json
class PostgresListCrawlPipeline:
"""
High-throughput PostgreSQL pipeline for list crawling output.
Uses execute_values for batch inserts (10–50x faster than single row inserts).
ON CONFLICT DO UPDATE handles re-crawls without duplicates.
Prerequisites: pip install psycopg2-binary
"""
BATCH_SIZE = 500 # Flush to DB every 500 items
def open_spider(self, spider):
self.conn = psycopg2.connect(
host=spider.settings.get("PG_HOST", "localhost"),
dbname=spider.settings.get("PG_DB", "scrapedb"),
user=spider.settings.get("PG_USER", "postgres"),
password=spider.settings.get("PG_PASSWORD", ""),
)
self.cursor = self.conn.cursor()
self.batch = []
self._create_table()
def _create_table(self):
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS crawled_list_items (
url_hash CHAR(64) PRIMARY KEY,
name TEXT,
price TEXT,
sku TEXT,
url TEXT,
raw_data JSONB,
first_seen TIMESTAMPTZ DEFAULT NOW(),
last_seen TIMESTAMPTZ DEFAULT NOW(),
spider_name TEXT
);
""")
self.conn.commit()
def process_item(self, item, spider):
url_hash = hashlib.sha256(item.get("url", "").encode()).hexdigest()
self.batch.append((
url_hash,
item.get("name", ""),
item.get("price", ""),
item.get("sku", ""),
item.get("url", ""),
json.dumps(dict(item)),
spider.name,
))
if len(self.batch) >= self.BATCH_SIZE:
self._flush()
return item
def _flush(self):
if not self.batch:
return
execute_values(
self.cursor,
"""
INSERT INTO crawled_list_items (url_hash, name, price, sku, url, raw_data, spider_name)
VALUES %s
ON CONFLICT (url_hash) DO UPDATE SET
price = EXCLUDED.price,
raw_data = EXCLUDED.raw_data,
last_seen = NOW()
""",
self.batch,
)
self.conn.commit()
self.batch.clear()
def close_spider(self, spider):
self._flush()
self.cursor.close()
self.conn.close()
Anti-Bot Considerations Specific to List Crawling
List crawling generates a traffic signature that anti-bot systems recognise immediately: consistent inter-request intervals, sequential URL patterns, no CSS/image/font resource loading, and a crawl depth that goes no deeper than list pages. This is a bot profile, and most production anti-bot vendors score it aggressively.
The DataFlirt engineering team has identified five list crawling-specific bot signals to mitigate:
1. Regular inter-request cadence. Human browsing within a category list shows variable dwell time per page — 5 to 45 seconds. A crawler hitting pages every 750ms is a textbook bot pattern. Implement Gaussian-distributed delays: time.sleep(random.gauss(mu=2.5, sigma=0.8)) rather than fixed delays.
2. No subresource loading. Real browsers load CSS, images, and fonts. Scrapy by default loads only the HTML document. For targets with browser fingerprinting at the network layer, using scrapy-playwright to render pages as a real browser eliminates this signal.
3. Sequential URL patterns in the access log. Crawling pages 1, 2, 3, 4 in order is an obvious bot pattern. Randomize crawl order by shuffling the URL frontier after discovery.
4. No session persistence. Real users maintain session cookies across pages. Configure Scrapy’s COOKIES_ENABLED = True and allow cookies to persist within a crawl session.
5. Datacenter IP ranges. The most important factor. Residential proxy rotation aligned to the target site’s primary audience geography is the single most effective anti-block measure for list crawling operations.
DataFlirt Recommended Reading: If your list crawling targets are protected by Cloudflare or similar enterprise-grade bot protection, our Top 5 Cloudflare Bypass Methods and the Tools Behind Them covers the full evasion stack.
Node.js List Crawling with Crawlee: The Full-Stack Alternative
For engineering teams whose data stack is JavaScript-native, Crawlee provides a unified list crawling framework combining an HTTP crawler (Cheerio-backed) and a browser crawler (Playwright-backed) under a single API.
// crawlee_list_crawler.js
// Prerequisites: node >= 18
// npm install crawlee playwright
// npx playwright install chromium
import { CheerioCrawler, PlaywrightCrawler, Dataset, RequestQueue } from 'crawlee';
// --- HTTP-tier crawler for paginated list scraping (fast, low-resource) ---
const httpListCrawler = new CheerioCrawler({
maxConcurrency: 20,
requestHandlerTimeoutSecs: 30,
async requestHandler({ request, $, enqueueLinks, log }) {
log.info(`[HTTP] Crawling list page: ${request.url}`);
// Extract list items using Cheerio (jQuery-like selector API)
const items = [];
$('div.product-card, li.listing-item').each((_, el) => {
const item = {
name: $(el).find('h2, h3').first().text().trim(),
price: $(el).find('.price, [data-price]').first().text().trim(),
url: new URL($(el).find('a').first().attr('href') || '', request.url).href,
sku: $(el).attr('data-sku') || $(el).attr('data-id') || '',
};
// Only push items with at least a name
if (item.name) items.push(item);
});
await Dataset.pushData(items);
log.info(` → Extracted ${items.length} items`);
// Follow all pagination links automatically
await enqueueLinks({
selector: 'a[rel="next"], a.pagination-next, a.next-page',
label: 'LIST_PAGE',
});
},
failedRequestHandler({ request, log }) {
log.error(`HTTP crawler failed: ${request.url}`);
},
});
// --- Browser-tier crawler for infinite scroll crawling (JavaScript-rendered lists) ---
const browserListCrawler = new PlaywrightCrawler({
maxConcurrency: 5, // Lower concurrency — browser instances are expensive
requestHandlerTimeoutSecs: 90,
launchContext: {
launchOptions: {
headless: true,
args: [
'--disable-blink-features=AutomationControlled',
'--no-sandbox',
'--disable-dev-shm-usage',
],
},
},
async requestHandler({ request, page, log }) {
log.info(`[Browser] Crawling infinite scroll page: ${request.url}`);
// Block images and fonts to reduce bandwidth during infinite scroll crawling
await page.route('**/*.{png,jpg,jpeg,gif,svg,webp,woff,woff2}', route => route.abort());
await page.goto(request.url, { waitUntil: 'domcontentloaded' });
// Infinite scroll simulation
let prevHeight = -1;
let scrollAttempts = 0;
const MAX_SCROLL = 30;
while (scrollAttempts < MAX_SCROLL) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
// Wait for network to settle after scroll triggers XHR
await page.waitForTimeout(1200 + Math.random() * 800);
const newHeight = await page.evaluate(() => document.body.scrollHeight);
if (newHeight === prevHeight) break;
prevHeight = newHeight;
scrollAttempts++;
}
// Collect all loaded items after scroll completion
const items = await page.$$eval(
'div.item-card, div.product-card, li.listing-item',
(nodes) => nodes.map(el => ({
name: el.querySelector('h2, h3')?.innerText?.trim() ?? '',
price: el.querySelector('.price, [data-price]')?.innerText?.trim() ?? '',
url: el.querySelector('a')?.href ?? '',
id: el.dataset.id ?? el.dataset.sku ?? '',
}))
);
await Dataset.pushData(items);
log.info(` → Extracted ${items.length} items after scroll`);
},
});
// Run both crawlers depending on target type
async function runListCrawl(urls, useBrowser = false) {
const queue = await RequestQueue.open();
await Promise.all(urls.map(url => queue.addRequest({ url })));
if (useBrowser) {
await browserListCrawler.run(urls);
} else {
await httpListCrawler.run(urls);
}
const dataset = await Dataset.open();
await dataset.exportToJSON('list_crawl_output.json');
console.log('List crawling complete. Output: list_crawl_output.json');
}
// Entry point — set useBrowser=true for infinite scroll / JS-rendered lists
runListCrawl(['https://example.com/products'], useBrowser: false);
Monitoring Your List Crawl Pipeline
A list crawling pipeline degrades in four predictable ways: selector drift, pagination cap hit (silent coverage drop), block rate increase, and proxy pool exhaustion. All four are undetectable without instrumentation.
Minimum viable monitoring for production list crawling:
# monitoring.py — Prometheus metrics for list crawling pipelines
from prometheus_client import Counter, Gauge, Histogram
LIST_PAGES_CRAWLED = Counter(
"list_crawl_pages_total",
"Total list pages crawled",
["spider_name", "domain", "pagination_type"]
)
LIST_ITEMS_EXTRACTED = Counter(
"list_crawl_items_total",
"Total list items extracted",
["spider_name", "domain"]
)
EMPTY_PAGE_RATE = Gauge(
"list_crawl_empty_page_rate",
"Rate of pages returning 0 items (selector drift or block indicator)",
["spider_name", "domain"]
)
PAGINATION_DEPTH_REACHED = Histogram(
"list_crawl_pagination_depth",
"Pagination depth reached per crawl run",
["spider_name", "domain"],
buckets=[1, 5, 10, 20, 50, 100, 200, 500]
)
# Alert rule (pseudo-Grafana):
# Alert when empty_page_rate > 0.05 for 15 minutes
# This indicates selector drift OR a block — both require immediate intervention
DataFlirt Recommended Reading: For a full production monitoring stack covering CAPTCHA rates, proxy pool health, and throughput metrics, see Best Monitoring and Alerting Tools for Production Scraping Pipelines.
Final Verdict: Choosing Your List Crawling Stack in 2026
| Use Case | Recommended Stack | Why |
|---|---|---|
| Large-scale static paginated list scraping | Scrapy + scrapy-redis | Unmatched throughput, distributed queue, retry middleware |
| Paginated catalog with cap evasion required | Scrapy + facet decomposition | Filter-based URL frontier bypasses page caps |
| Infinite scroll (JS-heavy) | XHR reverse engineering + httpx | 100–1000x faster than browser simulation |
| Infinite scroll (cannot reverse-engineer) | Playwright async | Most capable browser automation for scroll simulation |
| HTML table extraction | selectolax + httpx | C-extension parser 10–30x faster than BeautifulSoup |
| Node.js teams, mixed HTTP+browser | Crawlee | Unified framework, dataset API, TypeScript-native |
| Schema-free / LLM-augmented extraction | Scrapy + Gemini 3.1 Flash | Cost-efficient volume extraction with JSON output |
| Precision schema extraction | Scrapy + Claude Sonnet 4.6 / Opus 4.6 | Highest JSON fidelity for complex nested schemas |
| Production scale, >1M pages/day | Scrapy + scrapy-redis + Kubernetes | Horizontal pod autoscaling, crash-resilient queue |
The DataFlirt engineering team’s recommended production pattern: a Scrapy HTTP tier with scrapy-redis for distributed paginated list scraping, a Playwright worker pool behind a Redis message queue for infinite scroll and JavaScript-rendered detail pages, and a Gemini 3.1 Flash or Claude Sonnet 4.6 extraction layer in the item pipeline for schema-resilient structured data extraction. Deploy with Kubernetes CronJobs, monitor with Prometheus + Grafana, and back the URL frontier with Redis sorted sets for priority-weighted freshness management.
Recommended Reading from DataFlirt
Engineering teams building production list crawling infrastructure will find these DataFlirt guides directly relevant to the layers discussed above:
- Best Free Web Scraping Tools in 2026 for Developers — Full comparative analysis of every open-source scraping framework including Scrapy, Playwright, Crawlee, and Camoufox
- Best Approaches to Scraping Dynamic JavaScript Sites Without Getting Blocked — Deep dive on Playwright and infinite scroll crawling patterns for JavaScript-heavy sites
- How to Bypass Google CAPTCHA — Web Scraping Guide — Full evasion stack for SERP list crawling including TLS fingerprint spoofing, stealth browser configuration, and audio CAPTCHA fallback
- 5 Best IP Rotation Strategies for High-Volume Scraping Projects — Essential companion for preventing IP bans during long paginated list crawl runs
- Best Scraping Tools Powered by LLMs in 2026 — Full LLM extraction pipeline comparison for structured data extraction
- Top 5 Cloudflare Bypass Methods and the Tools Behind Them — For list crawling targets protected by enterprise-grade anti-bot systems
- Best Databases for Storing Scraped Data at Scale — Pipeline integration for the output side of your list crawling stack
- Top 10 Open-Source Web Scraping Tools Worth Using in 2026 — Expanded open-source crawler landscape beyond the tools covered in this guide
- Best Proxy Management Tools to Rotate and Manage Proxies at Scale — Proxy rotation architecture for distributed list crawling deployments
- Top 7 Scraping Infrastructure Patterns Used by High-Volume Data Teams — Enterprise-grade pipeline patterns that extend directly from the distributed list crawling architecture described in this guide
- Web Scraping GDPR — Compliance considerations for EU-targeted list crawling operations, particularly relevant for e-commerce and business directory scraping
Frequently Asked Questions
What is list crawling and how is it different from general web scraping?
List crawling is the systematic extraction of structured data from pages that render information in repeated, list-like formats — product catalogs, job boards, directories, and search result pages. General web scraping targets arbitrary page content and traverses diverse link graphs. List crawling specifically requires pagination state management, a URL frontier scoped to list boundaries, and a parser built around repeating DOM structures. The distinction matters architecturally: a general crawler that discovers list pages by link traversal will typically miss deep pagination coverage, while a purpose-built list crawler is designed from the ground up for completeness across all pages of a given list.
What Python tools are best for paginated list scraping in 2026?
For high-volume paginated list scraping, Scrapy with scrapy-redis is the gold standard — its middleware ecosystem, AutoThrottle, and distributed queue support are unmatched at the maturity level required for production. For JavaScript-rendered paginated catalogs, Playwright with asyncio provides the most capable browser automation. For lightweight scraping of static paginated pages, httpx combined with selectolax (a C-extension HTML parser roughly 10–30x faster than BeautifulSoup for high-volume parsing) is the most efficient combination. Always benchmark against your actual target pagination schema before committing to an architecture — the right choice depends heavily on whether the target site requires JavaScript rendering.
How do I handle infinite scroll crawling without a headless browser?
Reverse engineering the site’s underlying XHR or Fetch API endpoints is the most efficient approach. Most infinite scroll implementations load data from a paginated JSON API — the browser scroll event simply triggers new fetch requests to a backend endpoint. Open browser DevTools, filter the Network panel to XHR/Fetch requests, scroll the page once, and identify the data endpoint, its parameters, and required headers. Replicate those requests directly with httpx or curl_cffi. This eliminates the 150–400MB memory overhead per browser instance and increases throughput by 100–1000x compared to scroll simulation.
How do I bypass pagination limits when crawling product catalogs?
Most e-commerce platforms cap visible paginated results at 20–100 pages regardless of actual catalog depth. The correct approach is to decompose the catalog using categorical filters so that each combination surfaces fewer results than the pagination cap. Use price range bands, subcategory facets, brand filters, or date ranges to create multiple overlapping URL sets that collectively cover the full catalog. Combine filter-based decomposition with sitemap parsing to discover product-level URLs that bypass the listing layer entirely. scrapy-redis distributed queue management with multiple Kubernetes worker pods is the recommended scaling pattern for catalogs exceeding 1 million items.
Can LLMs replace CSS selectors for structured data extraction from lists?
Yes, and for pipelines requiring long-term reliability this is increasingly the production recommendation. CSS selectors break silently when a site redesigns — no exception is raised, the field simply becomes empty. An LLM extraction layer using Gemini 3.1 Flash or Claude Sonnet 4.6 degrades gracefully rather than silently failing. The trade-off is latency (2–8 seconds per LLM call) and token cost. The optimal production pattern is a two-tier extraction pipeline: CSS selectors for high-confidence, stable attributes (price, SKU, URL), and LLM extraction for ambiguous or frequently changing fields (spec tables, feature descriptions, schema-free attributes). Cache LLM extraction results by URL hash to avoid redundant API calls on re-crawls.
What is the best architecture for list crawling at scale across millions of pages?
The DataFlirt-recommended pattern is a two-tier architecture: a Scrapy HTTP tier backed by scrapy-redis for catalog-level list crawling (URL discovery, pagination traversal, item-level URL collection), and a Playwright browser worker pool behind a Redis or SQS message queue for JavaScript-rendered detail pages. Deploy both tiers as Kubernetes CronJobs with horizontal pod autoscaling. Use Prometheus and Grafana to monitor per-worker throughput, error rates, empty-page rates (a selector drift and block indicator), and block rates. The URL frontier should be a Redis sorted set with crawl priority scores to ensure freshness-sensitive URLs are re-crawled before stale ones.