List Crawling in 2026: Guide to Paginated, and Infinite Scroll

Q: How do I bypass pagination limits when crawling product catalogs?

Most e-commerce platforms cap visible paginated results at 50–100 pages regardless of catalog depth. The correct pattern is to decompose the catalog using categorical filters — price range bands, subcategory facets, brand filters, and date ranges — so that each filter combination surfaces fewer results than the pagination cap. Combine filter-based decomposition with sitemap parsing to discover product-level URLs that bypass the listing layer entirely. For very large catalogs, scrapy-redis distributed queue management with multiple worker pods is the recommended scaling pattern.

Q: Can LLMs replace CSS selectors for structured data extraction from lists?

Yes, and this is increasingly the production-grade recommendation for pipelines requiring long-term reliability. CSS selectors break silently on site redesigns. An LLM extraction layer — using Gemini 3.1 Flash or Claude Sonnet 4.6 — degrades gracefully rather than silently failing. The trade-off is latency and token cost per extraction. The optimal pattern is a two-tier pipeline: use CSS selectors for high-confidence, stable attributes (price, SKU), and route ambiguous or frequently changing fields through the LLM extraction layer. Always cache LLM extraction results to avoid redundant API calls on re-crawls.

List Crawling: Why It Matters and Why Most Engineers Get It Wrong

If you are a data engineer building production pipelines in 2026, list crawling is almost certainly your highest-volume workload. Product catalogs, job boards, business directories, financial data tables, SERP pages — the internet’s most commercially valuable data is rendered in list format. The global web scraping software market was valued at approximately USD 1.1 billion in 2024 and is projected to grow at a compound annual rate exceeding 18% through 2030. A substantial share of that economic activity is list crawling: systematic extraction from repeated, paginated, or dynamically loaded structured data.

And yet, most engineers approach list crawling as a solved problem — grab the selector, iterate pages, done. In practice, the failure modes are sophisticated. Pagination caps silently truncate catalog coverage. Infinite scroll implementations switch underlying APIs between deployments with no warning. CSS selectors drift after redesigns and fail without raising exceptions. Anti-bot systems fingerprint list crawlers specifically because their request cadence is more regular than human browsing. LLM-augmented extraction pipelines are now part of the production toolkit but require careful architecture to avoid token waste and latency spikes.

This guide is written for senior engineers and data engineers who already know how to fetch a page. We are going to go deep on architecture, edge cases, production patterns, and the specific failure modes that separate amateur list crawlers from reliable, high-throughput data pipelines.

What List Crawling Actually Means (And What It Doesn’t)

List crawling is the automated traversal and structured data extraction from web pages that present data in repeated, list-like formats. This encompasses product catalog pages, job board listings, business directory entries, search result pages, review feeds, and data tables — any page structure where the same HTML template is repeated N times per page, across M pages.

What list crawling is not: it is not a synonym for general web crawling (which traverses arbitrary link graphs), and it is not simple single-page scraping. The defining characteristics of list crawling as a distinct engineering problem are:

A URL frontier that respects list pagination boundaries (page numbers, cursor tokens, offset parameters)
A parser that maps repeating selectors across homogeneous DOM structures
A deduplication layer to handle overlapping pages on dynamic sites
A completeness guarantee strategy — ensuring full catalog coverage despite pagination caps and filter-based truncation

Understanding this framing changes your architecture decisions significantly. A general-purpose crawler that stumbles across list pages by link discovery will miss paginated coverage. A list crawler that treats all list pages as equivalent will hit pagination caps at page 50 and silently undercount.

DataFlirt Recommended Reading: Before diving into list crawling architecture, ensure you understand the broader free web scraping tool landscape: Best Free Web Scraping Tools in 2026 for Developers

The Five Site Structures You Will Encounter in List Crawling

1. Numbered Paginated Lists

The most common structure. Data is split across pages accessible via a URL parameter (?page=2, ?offset=50, /page/2/) or a “Next” button that resolves to a predictable URL. The challenge is not the happy path — it is the edges: sites that cap visible pages at 20–100 regardless of actual catalog depth, inconsistent parameter names across site sections, and pages that return HTTP 200 with empty content rather than 404 when you exceed bounds.

Identification signal: View Page Source contains all data (no JavaScript required). Pagination controls are anchor tags with href attributes.

2. Cursor and Token-Based Pagination

Increasingly common on modern e-commerce and API-backed sites. Instead of page numbers, the “next page” token is embedded in the current page response — in a <meta> tag, a JSON blob in a script element, or a data-* attribute on the pagination control. Each request must parse the cursor from the current response before issuing the next.

Identification signal: URL does not contain a predictable numeric parameter. Inspecting the response HTML reveals a token or cursor value that changes with each page.

3. Infinite Scroll Lists

Content loads as the user scrolls down. From the engineering perspective, this means the page DOM starts incomplete and extends as scroll events trigger XHR or Fetch API calls to a backend endpoint. The naive approach — headless browser scroll simulation — works but is expensive in compute and time.

Identification signal: Only partial content visible in View Page Source. Network panel in DevTools shows XHR/Fetch requests triggered during scroll.

4. Faceted Catalog Lists with Pagination Caps

The structurally hardest case. The site shows paginated lists of products or listings, but caps visibility at N pages (commonly 20–50). A category with 10,000 products returns only the first 500 (20 pages × 25 results). No error is raised — the data is simply invisible.

Identification signal: The total result count shown on the page (e.g., “1,247 results”) is vastly larger than what pagination allows you to access. The last page of pagination shows far fewer results than total count / results per page.

5. Static Data Tables

HTML <table> elements with headers and rows, or CSS-styled table-like grids. These require a different parsing approach from card-based lists. Multi-page tables may use server-side pagination with predictable URL patterns or client-side filtering that still relies on a data API.

Identification signal: <table> elements in View Page Source with <thead> and <tbody> structure, or a grid of <div> elements with consistent column patterns.

Virtual Environment Setup and Prerequisites

Before writing any list crawling code, establish a clean Python environment. This is non-negotiable for production work — dependency conflicts between Scrapy, Playwright, and their async runtimes are a frequent source of silent failures.

# Python 3.11+ recommended for Scrapy and Playwright async compatibility
python --version  # Confirm 3.11+

# Create isolated environment
python -m venv .listcrawl-env
source .listcrawl-env/bin/activate  # Windows: .listcrawl-env\Scripts\activate

# Core list crawling dependencies
pip install scrapy scrapy-redis playwright asyncio httpx selectolax \
            lxml beautifulsoup4 itemadapter anthropic google-genai

# Install Playwright browser binaries
playwright install chromium firefox
playwright install-deps chromium  # Install OS-level dependencies (Linux)

Verify that Scrapy’s Twisted async loop and Playwright’s asyncio loop do not conflict by never importing both in the same module without an explicit loop isolation strategy (discussed later in the distributed architecture section).

List Crawling Pattern 1: Paginated List Scraping with Scrapy

For high-volume paginated list scraping, Scrapy remains the production-grade default. Its request deduplication, AutoThrottle middleware, and retry handling eliminate the boilerplate that plagues hand-rolled paginated scrapers.

The following spider handles three pagination schema variants in a single implementation:

# spiders/catalog_list_spider.py
import scrapy
from itemadapter import ItemAdapter
from urllib.parse import urlencode, urlparse, parse_qs, urljoin
import re

class CatalogListSpider(scrapy.Spider):
    """
    Production paginated list scraping spider.
    Handles: numbered page params, path-segment pagination, cursor-based pagination.
    
    Prerequisites: scrapy, scrapy-redis, itemadapter
    pip install scrapy scrapy-redis itemadapter
    """
    name = "catalog_list"

    custom_settings = {
        "CONCURRENT_REQUESTS": 32,
        "DOWNLOAD_DELAY": 0.75,
        "AUTOTHROTTLE_ENABLED": True,
        "AUTOTHROTTLE_START_DELAY": 0.5,
        "AUTOTHROTTLE_TARGET_CONCURRENCY": 16,
        "AUTOTHROTTLE_MAX_DELAY": 10,
        "ROBOTSTXT_OBEY": True,
        "HTTPCACHE_ENABLED": True,       # Critical for paginated list scraping dev/debug
        "HTTPCACHE_EXPIRATION_SECS": 3600,
        "DUPEFILTER_CLASS": "scrapy.dupefilters.RFPDupeFilter",
        "RETRY_TIMES": 3,
        "RETRY_HTTP_CODES": [429, 500, 502, 503, 504],
        "DEFAULT_REQUEST_HEADERS": {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-GB,en;q=0.9",
        },
    }

    # --- Configure your target here ---
    start_urls = ["https://example.com/products"]
    
    # Pagination schema: "param" | "path" | "cursor"
    PAGINATION_SCHEMA = "param"
    PAGE_PARAM = "page"  # Only used when PAGINATION_SCHEMA == "param"
    ITEMS_SELECTOR = "div.product-card"
    NEXT_PAGE_SELECTOR = "a.pagination-next::attr(href)"
    CURSOR_SELECTOR = "meta[name='next-cursor']::attr(content)"

    def parse(self, response):
        """
        Primary handler for list page responses.
        Extracts items and discovers next page URL.
        """
        items = response.css(self.ITEMS_SELECTOR)
        
        if not items:
            self.logger.warning(f"No items found on {response.url} — check selector or detect block")
            return
        
        self.logger.info(f"Found {len(items)} items on {response.url}")

        for item in items:
            yield self._extract_item(item, response)

        # Discover and follow next page — handles all 3 pagination schemas
        yield from self._follow_pagination(response)

    def _extract_item(self, item, response):
        """
        Extract structured data from a single list item.
        Override in subclasses for domain-specific schemas.
        
        IMPORTANT: Use .get("") with fallback, never .get() without fallback.
        Missing selectors return None and cause downstream KeyError silently.
        """
        return {
            "name": item.css("h2.product-title::text, h3.product-title::text").get("").strip(),
            "price": item.css(".price::text, [data-price]::text").get("").strip(),
            "sku": item.attrib.get("data-sku", item.attrib.get("data-id", "")),
            "url": response.urljoin(
                item.css("a::attr(href)").get("")
            ),
            "image_url": item.css("img::attr(src), img::attr(data-src)").get(""),
            "source_page": response.url,
        }

    def _follow_pagination(self, response):
        """
        Pagination schema dispatcher.
        Returns a generator of Scrapy Request objects.
        
        Handles the most common list crawling pagination edge cases:
        - Parameter-based pagination with auto-increment
        - Path-segment pagination
        - Cursor/token-based pagination (next-cursor in meta or JSON)
        """
        if self.PAGINATION_SCHEMA in ("param", "path"):
            # Try explicit next-link first (most reliable)
            next_href = response.css(self.NEXT_PAGE_SELECTOR).get("")
            if next_href:
                yield response.follow(next_href, callback=self.parse)
                return

            # Fallback: auto-increment the page parameter
            if self.PAGINATION_SCHEMA == "param":
                parsed = urlparse(response.url)
                params = parse_qs(parsed.query)
                current_page = int(params.get(self.PAGE_PARAM, ["1"])[0])
                params[self.PAGE_PARAM] = [str(current_page + 1)]
                next_url = response.url.split("?")[0] + "?" + urlencode(
                    {k: v[0] for k, v in params.items()}
                )
                # Guard: stop if we circled back (some sites return page 1 when exceeded)
                if next_url != response.url:
                    yield response.follow(next_url, callback=self.parse)

        elif self.PAGINATION_SCHEMA == "cursor":
            # Cursor from meta tag — common in API-backed catalog lists
            cursor = response.css(self.CURSOR_SELECTOR).get("")
            if not cursor:
                # Also check for cursor in inline JSON script blocks
                import json
                for script in response.css("script[type='application/json']::text").getall():
                    try:
                        data = json.loads(script)
                        cursor = data.get("nextCursor") or data.get("pagination", {}).get("cursor", "")
                        if cursor:
                            break
                    except (json.JSONDecodeError, AttributeError):
                        continue

            if cursor:
                next_url = response.url.split("?")[0] + f"?cursor={cursor}"
                yield response.follow(next_url, callback=self.parse)

Pagination Cap Evasion: The Critical Pattern Nobody Talks About

The pagination cap problem is the most under-documented failure mode in paginated list scraping. Here is a concrete implementation of the filter decomposition strategy:

# spiders/faceted_catalog_spider.py
import scrapy
from itertools import product as iter_product

class FacetedCatalogSpider(scrapy.Spider):
    """
    Faceted catalog list crawling with pagination cap evasion.
    
    Most platforms cap visible pages at 20–100 regardless of catalog size.
    Strategy: decompose catalog using price bands and category filters
    so no single filter combination exceeds the pagination cap.
    
    Example: A catalog with 50,000 products at 25 per page with a 100-page cap
    gives access to only 2,500 items per filter. Using 20 price bands each
    with sub-categories gives ~100% coverage.
    """
    name = "faceted_catalog"
    
    BASE_URL = "https://example.com/products"
    RESULTS_PER_PAGE = 25
    MAX_PAGES_VISIBLE = 100        # Platform's pagination cap
    MAX_ITEMS_PER_FILTER = 2500    # RESULTS_PER_PAGE × MAX_PAGES_VISIBLE
    
    # Price band decomposition — tune these to your target catalog distribution
    PRICE_BANDS = [
        (0, 10), (10, 25), (25, 50), (50, 100),
        (100, 200), (200, 500), (500, 1000), (1000, 9999)
    ]
    
    # Category filter values — extract from the site's facet navigation
    CATEGORIES = ["electronics", "clothing", "books", "home", "sports"]

    def start_requests(self):
        """
        Generate decomposed URL set from all filter combinations.
        Each combination should yield fewer than MAX_ITEMS_PER_FILTER results.
        """
        for (price_min, price_max), category in iter_product(self.PRICE_BANDS, self.CATEGORIES):
            url = (
                f"{self.BASE_URL}"
                f"?category={category}"
                f"&price_min={price_min}"
                f"&price_max={price_max}"
                f"&sort=newest"      # Consistent sort prevents duplicates across runs
            )
            yield scrapy.Request(
                url,
                callback=self.parse_list,
                meta={
                    "price_band": (price_min, price_max),
                    "category": category,
                    "filter_item_count": 0,
                }
            )

    def parse_list(self, response):
        # Check if this filter combination exceeds our cap — if so, log a warning
        # You would ideally sub-divide further with additional filters
        total_count_el = response.css(".total-results::text").get("0")
        try:
            total_count = int(total_count_el.replace(",", "").strip())
        except ValueError:
            total_count = 0
        
        if total_count > self.MAX_ITEMS_PER_FILTER:
            self.logger.warning(
                f"Filter combination {response.meta['category']} / "
                f"{response.meta['price_band']} returns {total_count} items "
                f"— exceeds cap of {self.MAX_ITEMS_PER_FILTER}. "
                f"Consider adding more filter dimensions."
            )

        for item in response.css("div.product-card"):
            yield {
                "name": item.css("h2::text").get("").strip(),
                "price": item.css(".price::text").get("").strip(),
                "url": response.urljoin(item.css("a::attr(href)").get("")),
                "filter_category": response.meta["category"],
                "filter_price_band": response.meta["price_band"],
            }

        next_page = response.css("a.pagination-next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse_list, meta=response.meta)

DataFlirt Recommended Reading: Proper proxy rotation is essential for any paginated list scraping operation at scale. Read our Best IP Rotation Strategies for High-Volume Scraping Projects to avoid IP bans during long paginated crawl runs.

List Crawling Pattern 2: Infinite Scroll Crawling — The Right Way and the Expensive Way

Infinite scroll crawling has two fundamentally different approaches, and the choice between them has a 100–1000x performance difference.

The Expensive Way: Browser Scroll Simulation

Headless browser scroll simulation is the approach most tutorials show. It works, but it is computationally expensive: each browser instance consumes 150–400MB RAM, scroll simulation requires active waiting for DOM mutations, and throughput is measured in tens of pages per minute rather than thousands.

Use this approach only when the site cannot be reverse-engineered, or when you need screenshot-level fidelity for visual validation.

# infinite_scroll_playwright.py — browser scroll simulation
# Prerequisites: pip install playwright asyncio
# playwright install chromium

import asyncio
import json
from playwright.async_api import async_playwright

async def crawl_infinite_scroll(url: str, max_scroll_attempts: int = 50) -> list[dict]:
    """
    Infinite scroll crawling via browser scroll simulation.
    Use this as a fallback when API reverse-engineering is not possible.
    
    CAVEATS:
    - High memory per instance (200–400MB)
    - Throughput: ~10–30 pages/minute vs ~500+ for direct API approach
    - Element staleness: previously found elements may detach after scroll
    - Scroll trigger: some sites use percentage-based triggers, not bottom-of-page
    """
    results = []

    async with async_playwright() as pw:
        browser = await pw.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
                "--disable-dev-shm-usage",  # Essential in containers with limited /dev/shm
            ]
        )
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
            )
        )
        page = await context.new_page()

        # Block image/font/media resources to reduce bandwidth during infinite scroll crawling
        await page.route(
            "**/*.{png,jpg,jpeg,gif,svg,webp,ico,woff,woff2,mp4,mp3}",
            lambda route: route.abort()
        )

        await page.goto(url, wait_until="domcontentloaded")
        await asyncio.sleep(2)   # Allow initial render

        prev_height = -1
        scroll_count = 0

        while scroll_count < max_scroll_attempts:
            # Scroll to bottom
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            
            # Wait for network idle (catches XHR triggered by scroll)
            try:
                await page.wait_for_load_state("networkidle", timeout=4000)
            except Exception:
                # Timeout is acceptable — means no new requests were triggered
                pass

            new_height = await page.evaluate("document.body.scrollHeight")

            if new_height == prev_height:
                # No new content loaded — we've reached the end
                break

            prev_height = new_height
            scroll_count += 1

        # Collect all loaded items AFTER scroll completion
        # Important: query all elements AFTER all scrolls to avoid stale element handles
        raw_items = await page.evaluate("""
            () => {
                return Array.from(document.querySelectorAll('div.item-card')).map(el => ({
                    title: el.querySelector('h3')?.innerText?.trim() ?? '',
                    price: el.querySelector('.price')?.innerText?.trim() ?? '',
                    id: el.dataset.id ?? '',
                    url: el.querySelector('a')?.href ?? '',
                }));
            }
        """)

        results = raw_items
        await browser.close()

    return results


if __name__ == "__main__":
    items = asyncio.run(crawl_infinite_scroll("https://example.com/feed"))
    print(f"Collected {len(items)} items via scroll simulation")
    print(json.dumps(items[:3], indent=2))

The Right Way: XHR API Reverse Engineering

The productive pattern for infinite scroll crawling is to identify the underlying data API endpoint that the browser’s scroll events trigger, then replicate those requests directly — no browser process required.

# infinite_scroll_api_reverse.py — direct API approach (preferred)
# Prerequisites: pip install httpx asyncio
# Requires: manual DevTools Network inspection to identify the endpoint

import asyncio
import httpx
import json
from typing import AsyncGenerator

# ---- HOW TO IDENTIFY THE ENDPOINT ----
# 1. Open DevTools → Network → Filter by XHR/Fetch
# 2. Load the infinite scroll page and scroll once
# 3. Find the request that loaded new items — note URL, method, headers, params
# 4. Right-click → Copy as cURL
# 5. Adapt to the httpx client below

API_ENDPOINT = "https://example.com/api/v2/items"
PAGE_SIZE = 24

# Headers extracted from real browser request — copy from DevTools
BROWSER_HEADERS = {
    "Accept": "application/json, text/plain, */*",
    "Accept-Language": "en-GB,en;q=0.9",
    "Content-Type": "application/json",
    "X-Requested-With": "XMLHttpRequest",   # Many APIs require this
    "Referer": "https://example.com/feed",   # Critical for APIs that validate referer
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
    ),
}

async def crawl_infinite_scroll_via_api(
    max_items: int = 10000,
    concurrency: int = 5,
) -> list[dict]:
    """
    Replicate infinite scroll XHR requests directly.
    
    This approach is 100–1000x faster than browser scroll simulation.
    The cursor-based pattern handles: offset params, cursor tokens, and page numbers.
    
    CAVEATS:
    - API endpoints may require session cookies from an initial browser visit
    - Some APIs rotate the pagination token; you must extract it from each response
    - Rate limits on the API endpoint are often more aggressive than the HTML tier
    """
    all_items = []
    semaphore = asyncio.Semaphore(concurrency)

    async def fetch_page(client: httpx.AsyncClient, offset: int) -> dict | None:
        async with semaphore:
            try:
                params = {
                    "offset": offset,
                    "limit": PAGE_SIZE,
                    "sort": "popular",    # Keep sort consistent across pages
                }
                resp = await client.get(
                    API_ENDPOINT,
                    params=params,
                    headers=BROWSER_HEADERS,
                    timeout=15.0,
                )
                resp.raise_for_status()
                return resp.json()
            except (httpx.HTTPStatusError, httpx.TimeoutException, json.JSONDecodeError) as e:
                print(f"[ERROR] offset={offset}: {e}")
                return None

    async with httpx.AsyncClient(
        http2=True,     # Many APIs serve HTTP/2 — use it for connection multiplexing
        follow_redirects=True,
    ) as client:
        # First request to determine total count
        first_page = await fetch_page(client, 0)
        if not first_page:
            return []

        # Adapt this key path to your target API response schema
        total_count = first_page.get("total", first_page.get("count", max_items))
        items_this_page = first_page.get("items", first_page.get("results", []))
        all_items.extend(items_this_page)

        # Calculate remaining offsets
        remaining_offsets = list(range(PAGE_SIZE, min(total_count, max_items), PAGE_SIZE))

        # Fetch remaining pages concurrently, respecting semaphore
        tasks = [fetch_page(client, offset) for offset in remaining_offsets]
        pages = await asyncio.gather(*tasks)

        for page_data in pages:
            if page_data:
                all_items.extend(page_data.get("items", page_data.get("results", [])))

    return all_items


if __name__ == "__main__":
    items = asyncio.run(crawl_infinite_scroll_via_api(max_items=5000))
    print(f"Collected {len(items)} items via direct API")
    print(json.dumps(items[:2], indent=2))

DataFlirt Recommended Reading: Dynamic JavaScript sites require specific approach decisions. Our Best Approaches to Scraping Dynamic JavaScript Sites Without Getting Blocked covers the full decision tree from rendering requirements to anti-bot bypass.

List Crawling Pattern 3: Table Data Extraction

HTML table extraction seems simple but has well-documented edge cases: headers spread across multiple rows, merged cells with colspan/rowspan, CSS-styled table analogues without <table> elements, and server-side or client-side paginated tables.

# table_list_crawler.py — production table extraction
# Prerequisites: pip install selectolax httpx asyncio

import asyncio
import httpx
from selectolax.parser import HTMLParser
from typing import Any

# selectolax is 10–30x faster than BeautifulSoup for high-volume parsing
# It uses the Modest HTML5 parser (C extension) — drop-in replacement for most use cases

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
}

def extract_table(html: str, table_selector: str = "table") -> list[dict[str, Any]]:
    """
    Extract structured data from an HTML table.
    
    Handles:
    - Multi-row headers (takes last header row as canonical)
    - Missing <thead> (infers header from first row with <th> elements)
    - Rows with fewer cells than header (fills with empty string)
    - selectolax parser for high-throughput table extraction
    
    CAVEATS:
    - Does not handle colspan/rowspan merged cells (requires custom traversal)
    - For CSS-grid "tables" without <table> tags, adapt the selector and structure
    """
    parser = HTMLParser(html)
    table = parser.css_first(table_selector)

    if not table:
        return []

    # Detect header rows — prefer <thead> rows, fall back to first row with <th>
    headers = []
    thead = table.css_first("thead")
    if thead:
        # Multiple header rows: take the last one (usually the most specific)
        header_rows = thead.css("tr")
        if header_rows:
            last_header_row = header_rows[-1]
            headers = [
                th.text(strip=True)
                for th in last_header_row.css("th, td")
            ]
    
    if not headers:
        # No <thead>: look for first <tr> with <th> elements
        for row in table.css("tr"):
            cells = row.css("th")
            if cells:
                headers = [th.text(strip=True) for th in cells]
                break
    
    if not headers:
        print("[WARN] Could not detect table headers — using column indices")
    
    # Extract data rows from <tbody>, or all rows if no <tbody>
    tbody = table.css_first("tbody")
    rows_container = tbody if tbody else table
    
    results = []
    for row in rows_container.css("tr"):
        cells = row.css("td")
        if not cells:
            continue  # Skip header rows within tbody
        
        cell_values = [cell.text(strip=True) for cell in cells]
        
        if headers:
            # Pad short rows with empty strings to avoid KeyError downstream
            padded = cell_values + [""] * (len(headers) - len(cell_values))
            row_data = dict(zip(headers, padded[:len(headers)]))
        else:
            row_data = {f"col_{i}": v for i, v in enumerate(cell_values)}
        
        results.append(row_data)
    
    return results


async def crawl_paginated_table(
    base_url: str,
    table_selector: str = "table",
    next_page_selector: str = "a.pagination-next::attr(href)",
) -> list[dict]:
    """
    Crawl a multi-page table with paginated list scraping.
    Extracts structured data from each page and follows pagination.
    """
    all_rows = []
    current_url = base_url

    async with httpx.AsyncClient(
        headers=HEADERS,
        follow_redirects=True,
        http2=True,
    ) as client:
        while current_url:
            resp = await client.get(current_url, timeout=15.0)
            resp.raise_for_status()
            
            page_rows = extract_table(resp.text, table_selector)
            all_rows.extend(page_rows)
            print(f"[OK] {current_url} → {len(page_rows)} rows (total: {len(all_rows)})")

            # Parse next page link using selectolax
            parser = HTMLParser(resp.text)
            next_link_node = parser.css_first(next_page_selector.replace("::attr(href)", ""))
            if next_link_node:
                next_href = next_link_node.attributes.get("href", "")
                current_url = next_href if next_href.startswith("http") else base_url.rstrip("/") + next_href
            else:
                current_url = None  # No next page — crawl complete

    return all_rows


if __name__ == "__main__":
    rows = asyncio.run(
        crawl_paginated_table(
            "https://example.com/data-table",
            table_selector="table.data-table",
        )
    )
    print(f"Total rows extracted: {len(rows)}")
    if rows:
        print("Sample:", rows[0])

List Crawling Pattern 4: LLM-Augmented Structured Data Extraction

The structural fragility of CSS selectors in list crawling is a long-term reliability problem. A site redesign that shifts a class name from .price to .price-display silently produces empty fields with no exception raised. At scale, across hundreds of domains, this breakage is continuous.

LLM-augmented structured data extraction resolves this by routing HTML through a language model that understands the semantic content rather than the DOM structure. The trade-off is latency and cost — but for pipelines running unmonitored across weeks, the reliability gain outweighs both.

Gemini 3.1 for High-Volume List Extraction (Google GenAI SDK)

# llm_list_extraction_gemini.py
# Prerequisites: pip install google-genai httpx asyncio
# Set GOOGLE_API_KEY environment variable

import asyncio
import json
import os
import httpx
from google import genai
from google.genai import types

# --- CAVEAT ---
# Gemini 3.1 Flash is optimised for structured extraction with large HTML context.
# Always slice HTML before sending — models have context windows but token cost
# scales linearly. Sending 200KB of raw HTML per list page is wasteful.
# The pre-processing step below strips non-content HTML to reduce tokens ~60–80%.

client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

def strip_html_for_llm(html: str, max_chars: int = 40000) -> str:
    """
    Pre-process HTML before LLM extraction.
    Removes scripts, styles, SVG, and meta — preserving only content HTML.
    Reduces token count by 60–80% on typical product list pages.
    """
    from selectolax.parser import HTMLParser
    parser = HTMLParser(html)
    
    # Remove non-content nodes
    for tag in parser.css("script, style, svg, link, meta, noscript, iframe"):
        tag.decompose()
    
    # Extract body text as cleaned HTML
    body = parser.css_first("body")
    if body:
        return body.html[:max_chars] if body.html else ""
    return html[:max_chars]


async def extract_list_items_gemini(
    html: str,
    extraction_schema: str,
    model: str = "gemini-3.1-flash",
) -> list[dict]:
    """
    Extract structured list items from HTML using Gemini 3.1 Flash.
    
    Returns a list of dicts conforming to extraction_schema.
    
    CAVEATS:
    - Always validate JSON output — LLMs occasionally return partial JSON
    - Set temperature=0.1 for structured extraction (lower = more deterministic)
    - Cache results by URL hash to avoid re-extraction on re-crawls
    - Gemini 3.1 Flash handles ~100k tokens context — sufficient for most list pages
    """
    cleaned_html = strip_html_for_llm(html)

    prompt = f"""Extract ALL items from this HTML page as a JSON array.
Each item should have these fields: {extraction_schema}

Rules:
- Return ONLY a valid JSON array, no explanation, no markdown fences
- If a field is missing, use an empty string ""
- Do not invent values — only extract what is explicitly in the HTML
- Include ALL items visible on the page, not just the first few

HTML:
{cleaned_html}"""

    try:
        response = client.models.generate_content(
            model=model,
            contents=[types.Part.from_text(prompt)],
            config=types.GenerateContentConfig(
                response_mime_type="application/json",
                temperature=0.1,
                max_output_tokens=4096,
            ),
        )
        
        raw = response.text.strip()
        
        # Strip accidental markdown fences if model ignores mime_type instruction
        if raw.startswith("```"):
            raw = raw.split("```")[1]
            if raw.startswith("json"):
                raw = raw[4:]
        
        parsed = json.loads(raw)
        return parsed if isinstance(parsed, list) else [parsed]
        
    except (json.JSONDecodeError, AttributeError) as e:
        print(f"[ERROR] Gemini extraction failed: {e}")
        return []


async def llm_paginated_list_crawl_gemini(
    start_url: str,
    extraction_schema: str = "name, price, url, sku, availability",
    max_pages: int = 20,
) -> list[dict]:
    """
    Full list crawling pipeline using Gemini 3.1 Flash for extraction.
    Handles pagination automatically via next-link detection in the LLM response.
    
    For very high volume, prefer the hybrid approach:
    - CSS selectors for stable fields (price, SKU)
    - LLM for unstable or schema-free fields (description, specs, tags)
    """
    all_items = []
    current_url = start_url
    pages_crawled = 0

    async with httpx.AsyncClient(
        headers={
            "User-Agent": (
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
            ),
            "Accept-Language": "en-GB,en;q=0.9",
        },
        follow_redirects=True,
        http2=True,
    ) as http_client:
        while current_url and pages_crawled < max_pages:
            resp = await http_client.get(current_url, timeout=20.0)
            resp.raise_for_status()
            
            items = await extract_list_items_gemini(resp.text, extraction_schema)
            all_items.extend(items)
            pages_crawled += 1
            print(f"[Page {pages_crawled}] {current_url} → {len(items)} items extracted")
            
            # Detect next page via simple link parsing (not LLM — keep this cheap)
            from selectolax.parser import HTMLParser
            parser = HTMLParser(resp.text)
            next_node = parser.css_first("a[rel='next'], a.pagination-next, li.next a")
            if next_node:
                next_href = next_node.attributes.get("href", "")
                if next_href:
                    current_url = next_href if next_href.startswith("http") else start_url.rstrip("/") + next_href
                else:
                    current_url = None
            else:
                current_url = None

    return all_items

Claude Sonnet 4.6 for Precision Structured Data Extraction

# llm_list_extraction_claude.py — using Anthropic SDK
# Prerequisites: pip install anthropic httpx selectolax
# Set ANTHROPIC_API_KEY environment variable

import anthropic
import asyncio
import json
import os
import httpx
from selectolax.parser import HTMLParser

anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def strip_html_for_llm(html: str, max_chars: int = 30000) -> str:
    parser = HTMLParser(html)
    for tag in parser.css("script, style, svg, link, meta, noscript, iframe"):
        tag.decompose()
    body = parser.css_first("body")
    return (body.html[:max_chars] if body and body.html else html[:max_chars])


def extract_list_items_claude(
    html: str,
    schema_description: str,
    model: str = "claude-sonnet-4-6",
) -> list[dict]:
    """
    Extract structured list items using Claude Sonnet 4.6.
    
    Claude Sonnet 4.6 is preferred for:
    - Complex nested schemas (e.g., product variants, nested specs)
    - Multi-locale HTML where value types need semantic disambiguation
    - Pipelines where schema precision matters more than throughput cost
    
    Use claude-opus-4-6 for the highest-precision extraction on complex pages.
    Use claude-sonnet-4-6 (default here) for the best cost/precision balance.
    
    CAVEAT: Claude does not have a native response_mime_type JSON mode in all SDK
    versions. Use explicit JSON-only instructions and validate on the output.
    """
    cleaned_html = strip_html_for_llm(html)
    
    message = anthropic_client.messages.create(
        model=model,
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": f"""You are a structured data extraction engine.
Extract ALL list items from the HTML below as a JSON array.

Required schema per item: {schema_description}

Output rules:
- Return ONLY a valid JSON array. No explanation. No markdown. No backticks.
- Missing fields should be empty strings, not null
- Extract every item visible on the page — do not truncate
- Preserve original formatting for price values (include currency symbols)

HTML:
{cleaned_html}"""
            }
        ],
    )
    
    raw = message.content[0].text.strip()
    
    # Defensive cleaning — strip any accidental markdown
    if raw.startswith("```"):
        parts = raw.split("```")
        raw = parts[1].lstrip("json").strip() if len(parts) > 1 else raw
    
    try:
        parsed = json.loads(raw)
        return parsed if isinstance(parsed, list) else [parsed]
    except json.JSONDecodeError as e:
        print(f"[ERROR] Claude extraction returned invalid JSON: {e}")
        print(f"[DEBUG] Raw output (first 300 chars): {raw[:300]}")
        return []


async def hybrid_list_crawl(
    start_url: str,
    css_selectors: dict,
    llm_fallback_fields: list[str],
    max_pages: int = 10,
) -> list[dict]:
    """
    Hybrid extraction: CSS selectors for stable fields, Claude for volatile ones.
    
    This is the recommended production pattern for list crawling:
    - CSS selectors handle price, SKU, URL — fast and zero-cost
    - Claude handles spec tables, feature lists, schema-free attributes — reliable across redesigns
    - Results are merged per item
    
    css_selectors format: {"field_name": "css_selector::pseudo_element"}
    llm_fallback_fields: list of field names to route through Claude
    """
    all_items = []
    current_url = start_url

    async with httpx.AsyncClient(
        headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"},
        follow_redirects=True,
    ) as client:
        pages = 0
        while current_url and pages < max_pages:
            resp = await client.get(current_url, timeout=20.0)
            resp.raise_for_status()
            parser = HTMLParser(resp.text)

            # Extract stable fields via CSS selectors (fast, zero-cost)
            items_nodes = parser.css("div.product-card, li.listing-item")
            css_extracted = []
            for node in items_nodes:
                item_data = {}
                for field, selector in css_selectors.items():
                    # Handle pseudo-elements by stripping them for selectolax
                    clean_sel = selector.replace("::text", "").replace("::attr(href)", "")
                    target_node = node.css_first(clean_sel)
                    if target_node:
                        if "::text" in selector:
                            item_data[field] = target_node.text(strip=True)
                        elif "::attr(" in selector:
                            attr = selector.split("::attr(")[1].rstrip(")")
                            item_data[field] = target_node.attributes.get(attr, "")
                        else:
                            item_data[field] = target_node.text(strip=True)
                    else:
                        item_data[field] = ""
                css_extracted.append(item_data)

            # Route LLM fallback fields through Claude for the full page HTML
            if llm_fallback_fields:
                llm_schema = ", ".join(llm_fallback_fields)
                llm_extracted = extract_list_items_claude(resp.text, llm_schema)
                
                # Merge: CSS-extracted items + LLM-extracted items by position
                for i, css_item in enumerate(css_extracted):
                    if i < len(llm_extracted):
                        css_item.update({
                            k: v for k, v in llm_extracted[i].items()
                            if k in llm_fallback_fields
                        })

            all_items.extend(css_extracted)
            pages += 1

            # Pagination
            next_node = parser.css_first("a[rel='next'], a.next-page")
            if next_node:
                next_href = next_node.attributes.get("href", "")
                current_url = next_href if next_href.startswith("http") else start_url + next_href
            else:
                current_url = None

    return all_items

DataFlirt Recommended Reading: For a comprehensive comparison of all LLM-powered scraping tools and pipeline integration patterns, see our Best Scraping Tools Powered by LLMs in 2026.

Distributed List Crawling Architecture for Production Scale

Single-process list crawling hits a practical ceiling at approximately 300–600 requests/second for HTTP-only crawling and 10–30 pages/minute for headless browser-based infinite scroll crawling. For crawls exceeding millions of pages, the architecture must scale horizontally.

Scrapy + scrapy-redis: The Distributed List Crawling Standard

# settings.py — distributed list crawling with scrapy-redis
# Prerequisites: pip install scrapy scrapy-redis
# Requires: Redis instance accessible at REDIS_URL

# --- Distributed Queue Settings ---
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = "redis://redis-service:6379"    # Kubernetes service or managed Redis endpoint
SCHEDULER_PERSIST = True                    # Don't flush the queue on restart
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"

# --- Throughput Settings ---
CONCURRENT_REQUESTS = 64
DOWNLOAD_DELAY = 0.3
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 32

# --- Pipeline Settings ---
ITEM_PIPELINES = {
    "scrapy_redis.pipelines.RedisPipeline": 100,        # Optional: buffer to Redis
    "myproject.pipelines.PostgresPipeline": 200,         # Persist to PostgreSQL
    "myproject.pipelines.DeduplicationPipeline": 50,    # Hash-based dedup
}

# --- Distributed crawling requires explicit START_URLS management ---
# Push starting URLs to Redis directly rather than using start_urls
# redis-cli -h redis-service LPUSH myspider:start_urls "https://example.com/products?page=1"

# kubernetes/list-crawler-cronjob.yaml
# Horizontal scaling for distributed list crawling
apiVersion: batch/v1
kind: CronJob
metadata:
  name: catalog-list-crawler
spec:
  schedule: "0 */4 * * *"    # Every 4 hours — tune to data freshness requirements
  concurrencyPolicy: Forbid  # Prevent overlapping runs for the same catalog
  jobTemplate:
    spec:
      parallelism: 5          # 5 concurrent Scrapy workers sharing the Redis queue
      completions: 5
      template:
        spec:
          containers:
          - name: scrapy-worker
            image: your-registry/list-crawler:latest
            env:
            - name: REDIS_URL
              valueFrom:
                secretKeyRef:
                  name: redis-credentials
                  key: url
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: llm-api-keys
                  key: anthropic
            - name: GOOGLE_API_KEY
              valueFrom:
                secretKeyRef:
                  name: llm-api-keys
                  key: google
            command:
              - "scrapy"
              - "crawl"
              - "catalog_list"
              - "-s"
              - "REDIS_URL=$(REDIS_URL)"
            resources:
              requests:
                memory: "512Mi"
                cpu: "500m"
              limits:
                memory: "1Gi"
                cpu: "1000m"
          restartPolicy: OnFailure

PostgreSQL Output Pipeline with Upsert Semantics

# pipelines.py — production output pipeline for list crawling data
import psycopg2
from psycopg2.extras import execute_values
import hashlib
import json

class PostgresListCrawlPipeline:
    """
    High-throughput PostgreSQL pipeline for list crawling output.
    
    Uses execute_values for batch inserts (10–50x faster than single row inserts).
    ON CONFLICT DO UPDATE handles re-crawls without duplicates.
    
    Prerequisites: pip install psycopg2-binary
    """
    BATCH_SIZE = 500    # Flush to DB every 500 items

    def open_spider(self, spider):
        self.conn = psycopg2.connect(
            host=spider.settings.get("PG_HOST", "localhost"),
            dbname=spider.settings.get("PG_DB", "scrapedb"),
            user=spider.settings.get("PG_USER", "postgres"),
            password=spider.settings.get("PG_PASSWORD", ""),
        )
        self.cursor = self.conn.cursor()
        self.batch = []
        self._create_table()

    def _create_table(self):
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS crawled_list_items (
                url_hash    CHAR(64) PRIMARY KEY,
                name        TEXT,
                price       TEXT,
                sku         TEXT,
                url         TEXT,
                raw_data    JSONB,
                first_seen  TIMESTAMPTZ DEFAULT NOW(),
                last_seen   TIMESTAMPTZ DEFAULT NOW(),
                spider_name TEXT
            );
        """)
        self.conn.commit()

    def process_item(self, item, spider):
        url_hash = hashlib.sha256(item.get("url", "").encode()).hexdigest()
        self.batch.append((
            url_hash,
            item.get("name", ""),
            item.get("price", ""),
            item.get("sku", ""),
            item.get("url", ""),
            json.dumps(dict(item)),
            spider.name,
        ))
        if len(self.batch) >= self.BATCH_SIZE:
            self._flush()
        return item

    def _flush(self):
        if not self.batch:
            return
        execute_values(
            self.cursor,
            """
            INSERT INTO crawled_list_items (url_hash, name, price, sku, url, raw_data, spider_name)
            VALUES %s
            ON CONFLICT (url_hash) DO UPDATE SET
                price     = EXCLUDED.price,
                raw_data  = EXCLUDED.raw_data,
                last_seen = NOW()
            """,
            self.batch,
        )
        self.conn.commit()
        self.batch.clear()

    def close_spider(self, spider):
        self._flush()
        self.cursor.close()
        self.conn.close()

Anti-Bot Considerations Specific to List Crawling

List crawling generates a traffic signature that anti-bot systems recognise immediately: consistent inter-request intervals, sequential URL patterns, no CSS/image/font resource loading, and a crawl depth that goes no deeper than list pages. This is a bot profile, and most production anti-bot vendors score it aggressively.

The DataFlirt engineering team has identified five list crawling-specific bot signals to mitigate:

1. Regular inter-request cadence. Human browsing within a category list shows variable dwell time per page — 5 to 45 seconds. A crawler hitting pages every 750ms is a textbook bot pattern. Implement Gaussian-distributed delays: time.sleep(random.gauss(mu=2.5, sigma=0.8)) rather than fixed delays.

2. No subresource loading. Real browsers load CSS, images, and fonts. Scrapy by default loads only the HTML document. For targets with browser fingerprinting at the network layer, using scrapy-playwright to render pages as a real browser eliminates this signal.

3. Sequential URL patterns in the access log. Crawling pages 1, 2, 3, 4 in order is an obvious bot pattern. Randomize crawl order by shuffling the URL frontier after discovery.

4. No session persistence. Real users maintain session cookies across pages. Configure Scrapy’s COOKIES_ENABLED = True and allow cookies to persist within a crawl session.

5. Datacenter IP ranges. The most important factor. Residential proxy rotation aligned to the target site’s primary audience geography is the single most effective anti-block measure for list crawling operations.

DataFlirt Recommended Reading: If your list crawling targets are protected by Cloudflare or similar enterprise-grade bot protection, our Top 5 Cloudflare Bypass Methods and the Tools Behind Them covers the full evasion stack.

Node.js List Crawling with Crawlee: The Full-Stack Alternative

For engineering teams whose data stack is JavaScript-native, Crawlee provides a unified list crawling framework combining an HTTP crawler (Cheerio-backed) and a browser crawler (Playwright-backed) under a single API.

// crawlee_list_crawler.js
// Prerequisites: node >= 18
// npm install crawlee playwright
// npx playwright install chromium

import { CheerioCrawler, PlaywrightCrawler, Dataset, RequestQueue } from 'crawlee';

// --- HTTP-tier crawler for paginated list scraping (fast, low-resource) ---
const httpListCrawler = new CheerioCrawler({
    maxConcurrency: 20,
    requestHandlerTimeoutSecs: 30,

    async requestHandler({ request, $, enqueueLinks, log }) {
        log.info(`[HTTP] Crawling list page: ${request.url}`);

        // Extract list items using Cheerio (jQuery-like selector API)
        const items = [];
        $('div.product-card, li.listing-item').each((_, el) => {
            const item = {
                name: $(el).find('h2, h3').first().text().trim(),
                price: $(el).find('.price, [data-price]').first().text().trim(),
                url: new URL($(el).find('a').first().attr('href') || '', request.url).href,
                sku: $(el).attr('data-sku') || $(el).attr('data-id') || '',
            };
            // Only push items with at least a name
            if (item.name) items.push(item);
        });

        await Dataset.pushData(items);
        log.info(`  → Extracted ${items.length} items`);

        // Follow all pagination links automatically
        await enqueueLinks({
            selector: 'a[rel="next"], a.pagination-next, a.next-page',
            label: 'LIST_PAGE',
        });
    },

    failedRequestHandler({ request, log }) {
        log.error(`HTTP crawler failed: ${request.url}`);
    },
});

// --- Browser-tier crawler for infinite scroll crawling (JavaScript-rendered lists) ---
const browserListCrawler = new PlaywrightCrawler({
    maxConcurrency: 5,    // Lower concurrency — browser instances are expensive
    requestHandlerTimeoutSecs: 90,

    launchContext: {
        launchOptions: {
            headless: true,
            args: [
                '--disable-blink-features=AutomationControlled',
                '--no-sandbox',
                '--disable-dev-shm-usage',
            ],
        },
    },

    async requestHandler({ request, page, log }) {
        log.info(`[Browser] Crawling infinite scroll page: ${request.url}`);

        // Block images and fonts to reduce bandwidth during infinite scroll crawling
        await page.route('**/*.{png,jpg,jpeg,gif,svg,webp,woff,woff2}', route => route.abort());

        await page.goto(request.url, { waitUntil: 'domcontentloaded' });

        // Infinite scroll simulation
        let prevHeight = -1;
        let scrollAttempts = 0;
        const MAX_SCROLL = 30;

        while (scrollAttempts < MAX_SCROLL) {
            await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));

            // Wait for network to settle after scroll triggers XHR
            await page.waitForTimeout(1200 + Math.random() * 800);

            const newHeight = await page.evaluate(() => document.body.scrollHeight);
            if (newHeight === prevHeight) break;

            prevHeight = newHeight;
            scrollAttempts++;
        }

        // Collect all loaded items after scroll completion
        const items = await page.$$eval(
            'div.item-card, div.product-card, li.listing-item',
            (nodes) => nodes.map(el => ({
                name: el.querySelector('h2, h3')?.innerText?.trim() ?? '',
                price: el.querySelector('.price, [data-price]')?.innerText?.trim() ?? '',
                url: el.querySelector('a')?.href ?? '',
                id: el.dataset.id ?? el.dataset.sku ?? '',
            }))
        );

        await Dataset.pushData(items);
        log.info(`  → Extracted ${items.length} items after scroll`);
    },
});

// Run both crawlers depending on target type
async function runListCrawl(urls, useBrowser = false) {
    const queue = await RequestQueue.open();
    await Promise.all(urls.map(url => queue.addRequest({ url })));

    if (useBrowser) {
        await browserListCrawler.run(urls);
    } else {
        await httpListCrawler.run(urls);
    }

    const dataset = await Dataset.open();
    await dataset.exportToJSON('list_crawl_output.json');
    console.log('List crawling complete. Output: list_crawl_output.json');
}

// Entry point — set useBrowser=true for infinite scroll / JS-rendered lists
runListCrawl(['https://example.com/products'], useBrowser: false);

Monitoring Your List Crawl Pipeline

A list crawling pipeline degrades in four predictable ways: selector drift, pagination cap hit (silent coverage drop), block rate increase, and proxy pool exhaustion. All four are undetectable without instrumentation.

Minimum viable monitoring for production list crawling:

# monitoring.py — Prometheus metrics for list crawling pipelines
from prometheus_client import Counter, Gauge, Histogram

LIST_PAGES_CRAWLED = Counter(
    "list_crawl_pages_total",
    "Total list pages crawled",
    ["spider_name", "domain", "pagination_type"]
)
LIST_ITEMS_EXTRACTED = Counter(
    "list_crawl_items_total",
    "Total list items extracted",
    ["spider_name", "domain"]
)
EMPTY_PAGE_RATE = Gauge(
    "list_crawl_empty_page_rate",
    "Rate of pages returning 0 items (selector drift or block indicator)",
    ["spider_name", "domain"]
)
PAGINATION_DEPTH_REACHED = Histogram(
    "list_crawl_pagination_depth",
    "Pagination depth reached per crawl run",
    ["spider_name", "domain"],
    buckets=[1, 5, 10, 20, 50, 100, 200, 500]
)

# Alert rule (pseudo-Grafana):
# Alert when empty_page_rate > 0.05 for 15 minutes
# This indicates selector drift OR a block — both require immediate intervention

DataFlirt Recommended Reading: For a full production monitoring stack covering CAPTCHA rates, proxy pool health, and throughput metrics, see Best Monitoring and Alerting Tools for Production Scraping Pipelines.

Final Verdict: Choosing Your List Crawling Stack in 2026

Use Case	Recommended Stack	Why
Large-scale static paginated list scraping	Scrapy + scrapy-redis	Unmatched throughput, distributed queue, retry middleware
Paginated catalog with cap evasion required	Scrapy + facet decomposition	Filter-based URL frontier bypasses page caps
Infinite scroll (JS-heavy)	XHR reverse engineering + httpx	100–1000x faster than browser simulation
Infinite scroll (cannot reverse-engineer)	Playwright async	Most capable browser automation for scroll simulation
HTML table extraction	selectolax + httpx	C-extension parser 10–30x faster than BeautifulSoup
Node.js teams, mixed HTTP+browser	Crawlee	Unified framework, dataset API, TypeScript-native
Schema-free / LLM-augmented extraction	Scrapy + Gemini 3.1 Flash	Cost-efficient volume extraction with JSON output
Precision schema extraction	Scrapy + Claude Sonnet 4.6 / Opus 4.6	Highest JSON fidelity for complex nested schemas
Production scale, >1M pages/day	Scrapy + scrapy-redis + Kubernetes	Horizontal pod autoscaling, crash-resilient queue

The DataFlirt engineering team’s recommended production pattern: a Scrapy HTTP tier with scrapy-redis for distributed paginated list scraping, a Playwright worker pool behind a Redis message queue for infinite scroll and JavaScript-rendered detail pages, and a Gemini 3.1 Flash or Claude Sonnet 4.6 extraction layer in the item pipeline for schema-resilient structured data extraction. Deploy with Kubernetes CronJobs, monitor with Prometheus + Grafana, and back the URL frontier with Redis sorted sets for priority-weighted freshness management.

Frequently Asked Questions

What is list crawling and how is it different from general web scraping?

List crawling is the systematic extraction of structured data from pages that render information in repeated, list-like formats — product catalogs, job boards, directories, and search result pages. General web scraping targets arbitrary page content and traverses diverse link graphs. List crawling specifically requires pagination state management, a URL frontier scoped to list boundaries, and a parser built around repeating DOM structures. The distinction matters architecturally: a general crawler that discovers list pages by link traversal will typically miss deep pagination coverage, while a purpose-built list crawler is designed from the ground up for completeness across all pages of a given list.

What Python tools are best for paginated list scraping in 2026?

For high-volume paginated list scraping, Scrapy with scrapy-redis is the gold standard — its middleware ecosystem, AutoThrottle, and distributed queue support are unmatched at the maturity level required for production. For JavaScript-rendered paginated catalogs, Playwright with asyncio provides the most capable browser automation. For lightweight scraping of static paginated pages, httpx combined with selectolax (a C-extension HTML parser roughly 10–30x faster than BeautifulSoup for high-volume parsing) is the most efficient combination. Always benchmark against your actual target pagination schema before committing to an architecture — the right choice depends heavily on whether the target site requires JavaScript rendering.

How do I handle infinite scroll crawling without a headless browser?

Reverse engineering the site’s underlying XHR or Fetch API endpoints is the most efficient approach. Most infinite scroll implementations load data from a paginated JSON API — the browser scroll event simply triggers new fetch requests to a backend endpoint. Open browser DevTools, filter the Network panel to XHR/Fetch requests, scroll the page once, and identify the data endpoint, its parameters, and required headers. Replicate those requests directly with httpx or curl_cffi. This eliminates the 150–400MB memory overhead per browser instance and increases throughput by 100–1000x compared to scroll simulation.

How do I bypass pagination limits when crawling product catalogs?

Most e-commerce platforms cap visible paginated results at 20–100 pages regardless of actual catalog depth. The correct approach is to decompose the catalog using categorical filters so that each combination surfaces fewer results than the pagination cap. Use price range bands, subcategory facets, brand filters, or date ranges to create multiple overlapping URL sets that collectively cover the full catalog. Combine filter-based decomposition with sitemap parsing to discover product-level URLs that bypass the listing layer entirely. scrapy-redis distributed queue management with multiple Kubernetes worker pods is the recommended scaling pattern for catalogs exceeding 1 million items.

Can LLMs replace CSS selectors for structured data extraction from lists?

Yes, and for pipelines requiring long-term reliability this is increasingly the production recommendation. CSS selectors break silently when a site redesigns — no exception is raised, the field simply becomes empty. An LLM extraction layer using Gemini 3.1 Flash or Claude Sonnet 4.6 degrades gracefully rather than silently failing. The trade-off is latency (2–8 seconds per LLM call) and token cost. The optimal production pattern is a two-tier extraction pipeline: CSS selectors for high-confidence, stable attributes (price, SKU, URL), and LLM extraction for ambiguous or frequently changing fields (spec tables, feature descriptions, schema-free attributes). Cache LLM extraction results by URL hash to avoid redundant API calls on re-crawls.

What is the best architecture for list crawling at scale across millions of pages?

The DataFlirt-recommended pattern is a two-tier architecture: a Scrapy HTTP tier backed by scrapy-redis for catalog-level list crawling (URL discovery, pagination traversal, item-level URL collection), and a Playwright browser worker pool behind a Redis or SQS message queue for JavaScript-rendered detail pages. Deploy both tiers as Kubernetes CronJobs with horizontal pod autoscaling. Use Prometheus and Grafana to monitor per-worker throughput, error rates, empty-page rates (a selector drift and block indicator), and block rates. The URL frontier should be a Redis sorted set with crawl priority scores to ensure freshness-sensitive URLs are re-crawled before stale ones.

List Crawling in 2026: Guide to Paginated, and Infinite Scroll

List Crawling: Why It Matters and Why Most Engineers Get It Wrong

What List Crawling Actually Means (And What It Doesn’t)

The Five Site Structures You Will Encounter in List Crawling

1. Numbered Paginated Lists

3. Infinite Scroll Lists

5. Static Data Tables

Virtual Environment Setup and Prerequisites

List Crawling Pattern 1: Paginated List Scraping with Scrapy

List Crawling Pattern 2: Infinite Scroll Crawling — The Right Way and the Expensive Way

The Expensive Way: Browser Scroll Simulation

The Right Way: XHR API Reverse Engineering

List Crawling Pattern 3: Table Data Extraction

List Crawling Pattern 4: LLM-Augmented Structured Data Extraction

Gemini 3.1 for High-Volume List Extraction (Google GenAI SDK)

Claude Sonnet 4.6 for Precision Structured Data Extraction

Distributed List Crawling Architecture for Production Scale

Scrapy + scrapy-redis: The Distributed List Crawling Standard

PostgreSQL Output Pipeline with Upsert Semantics

Anti-Bot Considerations Specific to List Crawling

Node.js List Crawling with Crawlee: The Full-Stack Alternative

Monitoring Your List Crawl Pipeline

Final Verdict: Choosing Your List Crawling Stack in 2026

Recommended Reading from DataFlirt

Frequently Asked Questions

What is list crawling and how is it different from general web scraping?

What Python tools are best for paginated list scraping in 2026?

How do I handle infinite scroll crawling without a headless browser?

Can LLMs replace CSS selectors for structured data extraction from lists?

What is the best architecture for list crawling at scale across millions of pages?

Latest from the Blog

How Marketing Leaders Can Revamp Content at Scale Using Web Scraping, Claude, and Gemini

Web Scraping Use Cases Across 37 Industries: The Definitive 2026 Guide

What Can Business Teams Do With Scraped Data? A Role-by-Role Deep Dive

Data Extraction for Every Industry

List Crawling: Why It Matters and Why Most Engineers Get It Wrong

What List Crawling Actually Means (And What It Doesn’t)

The Five Site Structures You Will Encounter in List Crawling

1. Numbered Paginated Lists

2. Cursor and Token-Based Pagination

3. Infinite Scroll Lists

4. Faceted Catalog Lists with Pagination Caps

5. Static Data Tables

Virtual Environment Setup and Prerequisites

List Crawling Pattern 1: Paginated List Scraping with Scrapy

Pagination Cap Evasion: The Critical Pattern Nobody Talks About

List Crawling Pattern 2: Infinite Scroll Crawling — The Right Way and the Expensive Way

The Expensive Way: Browser Scroll Simulation

The Right Way: XHR API Reverse Engineering

List Crawling Pattern 3: Table Data Extraction

List Crawling Pattern 4: LLM-Augmented Structured Data Extraction

Gemini 3.1 for High-Volume List Extraction (Google GenAI SDK)

Claude Sonnet 4.6 for Precision Structured Data Extraction

Distributed List Crawling Architecture for Production Scale

Scrapy + scrapy-redis: The Distributed List Crawling Standard

PostgreSQL Output Pipeline with Upsert Semantics

Anti-Bot Considerations Specific to List Crawling

Node.js List Crawling with Crawlee: The Full-Stack Alternative

Monitoring Your List Crawl Pipeline

Final Verdict: Choosing Your List Crawling Stack in 2026

Recommended Reading from DataFlirt

Frequently Asked Questions

What is list crawling and how is it different from general web scraping?

What Python tools are best for paginated list scraping in 2026?

How do I handle infinite scroll crawling without a headless browser?

How do I bypass pagination limits when crawling product catalogs?

Can LLMs replace CSS selectors for structured data extraction from lists?

What is the best architecture for list crawling at scale across millions of pages?

Web scraping insights, delivered to your inbox.

Latest from the Blog

How Marketing Leaders Can Revamp Content at Scale Using Web Scraping, Claude, and Gemini

Web Scraping Use Cases Across 37 Industries: The Definitive 2026 Guide

What Can Business Teams Do With Scraped Data? A Role-by-Role Deep Dive

Data Extraction for Every Industry

Web scraping insights,
delivered to your inbox.