How do I extract video blob URLs using XPath scraping?

Blob URLs (blob:https://...) are dynamically generated by the browser's Media Source Extensions API and cannot be resolved by XPath alone. The correct technique is Playwright network request interception — intercept XHR and Fetch events to capture the underlying HLS (.m3u8) or DASH (.mpd) manifest URLs that the player requests. XPath is then used against the video element attributes (data-src, data-video-id, poster) to correlate player metadata with the intercepted network URLs.

Python XPath Comprehensive Guide: Advanced DOM Scraping Techniques in 2026

Q: Can XPath extract data from inside iframes?

XPath cannot cross iframe boundaries in a browser context because each iframe is a separate document. You must switch the execution context to the iframe document first — using Playwright's frame_locator() or frames property — then apply XPath expressions against that sub-document. In lxml, if you have already extracted the iframe's HTML as a separate string, standard XPath scraping works normally against it.

Q: How do I handle XPath with XML namespaces when scraping?

lxml XPath requires namespace prefixes to be explicitly mapped. Pass a namespaces dict to the .xpath() call: tree.xpath('//ns:element', namespaces={'ns': 'http://example.com/ns'}). Alternatively, use the local-name() function to bypass namespace matching entirely: //*[local-name()='element']. The local-name() approach is more brittle but useful for rapid prototyping against unknown namespace schemas.

Q: Is LLM-generated XPath reliable enough for production scraping?

LLM-generated XPath expressions work best as a first-pass generator that your pipeline validates before committing. Models like Gemini 3.1 and Claude Sonnet produce syntactically correct and often semantically accurate XPath for well-structured HTML. The failure mode is over-indexing on positional predicates that break on pagination. Use LLM generation to bootstrap your XPath library, then apply automated validation against a known-good sample before deployment.

Who This Guide Is For — and Why XPath Scraping Still Matters in 2026

You already know how to scrape. You have written CSS selectors, parsed responses with BeautifulSoup, and run a Scrapy spider or two against cooperative HTML. The problem is that cooperative HTML stopped being the default a long time ago.

This guide is for data engineers and senior Python developers who are regularly confronted with DOM structures that break CSS selectors: tables whose row cells shift position between pages, pricing blocks wrapped in deeply nested conditional markup, product images hidden behind lazy-load scaffolding, video players that inject blob URIs via Media Source Extensions, and API responses in namespaced XML that need structured extraction without a schema. These are the real-world targets that expose the ceiling of CSS-first scraping approaches.

XPath scraping has a ceiling too — but it is considerably higher. XML Path Language 1.0, specified by the W3C in 1999 and still fully supported by every major HTML parser, gives you a complete navigational model over any document tree. You can traverse upward (ancestor axes), sideways (sibling axes), and across text content (string functions as predicate filters). You can express conditions that combine structural position with content matching. None of these capabilities exist in CSS selectors.

The global web scraping market is projected to exceed USD 2.8 billion by 2030, growing at a CAGR above 18%, with an increasing share of that pipeline complexity driven by SPAs, anti-bot obfuscation layers, and nested iframe architectures. Engineers who have mastered advanced XPath expressions navigate these environments efficiently. Engineers who have not spend hours debugging selectors that break every time the target site runs an A/B test.

This is the guide for the former group.

XPath vs CSS Selectors: The Engineering Decision

The choice between XPath and CSS is not a stylistic preference. It is a structural one dictated by your target DOM’s characteristics.

CSS selectors are faster to write, easier to read, and slightly faster to evaluate in benchmarks. They are the right default when: the target structure is stable, you need descendant-only traversal, and your predicates are purely class- or attribute-based. Most public HTML pages fall into this category. Use CSS selectors there.

XPath scraping becomes the correct engineering choice when any of the following conditions apply:

Upward traversal is required. CSS has no equivalent to ancestor:: or parent::. If you need to find a label element, then extract a sibling value cell from the same parent row, XPath is your only option in the standard parser layer.

Text content predicates are needed. //div[contains(text(), 'Price')] has no CSS equivalent. Matching against partial text, normalised whitespace, or string-start patterns is fundamental to XPath expressions and absent from CSS.

Namespace-qualified documents. Any XML response from a REST or SOAP API, any Atom feed, any SVG embedded in HTML — all of these carry XML namespaces that CSS selectors do not handle.

Positional logic relative to document structure. position(), last(), and count() let you express conditions like “the second cell of the last row” or “all but the header row” as single XPath expressions.

Conditional multi-axis extraction. Real-world DOM traversal often requires combining axes: “find the <dt> whose text is ‘SKU’, then retrieve the text of its following <dd> sibling.” This is trivially expressed in XPath scraping and structurally impossible in CSS.

Setting Up Your Python XPath Environment

Before writing a single XPath expression, establish an isolated environment. Dependency conflicts between lxml versions are a common, silent source of parser inconsistency across scraping pipelines.

# Python 3.11+ recommended for performance improvements in lxml 5.x
python -m venv .xpath-env
source .xpath-env/bin/activate  # Windows: .xpath-env\Scripts\activate

# Core dependencies
pip install lxml parsel scrapy playwright requests httpx

# System dependency for lxml on Debian/Ubuntu (required before pip install)
# sudo apt-get install libxml2-dev libxslt1-dev python3-dev

# Install Playwright browser binaries for JS-rendered DOM work
playwright install chromium firefox

The critical dependency to get right is lxml. Version 5.x ships with improved memory management for large HTML documents and tighter XPath 1.0 compliance for edge-case namespace handling. Verify your installation:

import lxml.etree as etree
import lxml.html as html
from lxml import __version__ as lxml_version

print(f"lxml version: {lxml_version}")
# Expect: 5.1.0 or higher for 2026 production use

The XPath Data Model: Axes, Node Tests, and Predicates

Every XPath expression decomposes into three parts: an axis specifier that defines direction of traversal, a node test that filters by node type or name, and zero or more predicates that add conditional filtering. Understanding this decomposition is the foundation of all advanced XPath scraping work.

ancestor::div[@class='product-container']
^^^^^^^^^  ^^^  ^^^^^^^^^^^^^^^^^^^^^^^^
  axis     node       predicate
  test

The axes available in XPath 1.0 and their directional semantics:

Axis	Direction	Common Use in XPath Scraping
`child::`	Immediate children	Default axis; `child::div` equals `div`
`descendant::`	All descendants	Deep search regardless of nesting depth
`descendant-or-self::`	Self + all descendants	The `//` abbreviation expands to this
`parent::`	Immediate parent	One level up
`ancestor::`	All ancestors up to root	Find containing context from inner element
`ancestor-or-self::`	Self + all ancestors	Useful for “within which section am I?”
`following-sibling::`	Siblings after current node	Extract value cell after a label cell
`preceding-sibling::`	Siblings before current node	Context-building from a known anchor
`following::`	All nodes after current	Cross-parent forward search
`preceding::`	All nodes before current	Rarely needed; expensive
`self::`	Current node	Validation predicates
`attribute::`	Attributes of current node	`attribute::href` = `@href`

The abbreviated syntax most engineers use (//, @, .) maps directly to these axes. Understanding the full form is essential for complex DOM traversal because it lets you compose multi-axis XPath expressions that the abbreviations cannot express.

Ancestor and Parent Axes: Traversing Up the DOM

The single most common scenario where CSS selectors fail and XPath scraping succeeds is the “label-value pair” pattern: you can reliably find the label, but the value you need is a sibling or cousin of that label’s parent. No CSS selector resolves this. XPath expressions do.

Consider a product detail page with this markup:

<div class="spec-table">
  <div class="spec-row obfuscated-cx12">
    <span class="label">Processor</span>
    <span class="value">Apple M4 Pro</span>
  </div>
  <div class="spec-row obfuscated-cx13">
    <span class="label">RAM</span>
    <span class="value">24GB</span>
  </div>
  <div class="spec-row obfuscated-cx14">
    <span class="label">Storage</span>
    <span class="value">512GB SSD</span>
  </div>
</div>

The obfuscated class suffixes (cx12, cx13) are generated dynamically and change with every deployment. A CSS selector targeting .obfuscated-cx12 .value breaks on the next build. The ancestor-anchored XPath expression is resilient:

import lxml.html as html

def extract_spec(tree, label_text: str) -> str:
    """
    Finds a spec value by its label text using ancestor-anchored XPath.
    Resilient to obfuscated class names on containing rows.
    """
    # Find the label span by text content
    # Traverse to its parent (the row), then find the sibling value span
    result = tree.xpath(
        "//span[@class='label'][normalize-space(text())=$label]"
        "/parent::div"
        "/span[@class='value']/text()",
        label=label_text  # Parameterised XPath — prevents injection and improves caching
    )
    return result[0].strip() if result else ""

doc = html.fromstring("""<div class="spec-table">...(above HTML)...</div>""")
processor = extract_spec(doc, "Processor")
# Returns: "Apple M4 Pro"

The parameterised XPath pattern (label=label_text passed as a keyword argument to .xpath()) is a production best practice that lxml XPath supports natively. It prevents XPath injection through malformed input strings and allows lxml’s internal XPath compiler to cache the compiled expression — a measurable performance gain on high-frequency extractions.

Sibling Axes: The Label-Value Pattern at Scale

The following-sibling:: and preceding-sibling:: axes solve the most common real-world DOM traversal problem in XPath scraping: data laid out in definition list or table formats where position is semantic.

import lxml.html as html

SAMPLE_HTML = """
<table class="pricing-matrix">
  <thead>
    <tr>
      <th>Plan</th><th>Monthly</th><th>Annual</th><th>Users</th>
    </tr>
  </thead>
  <tbody>
    <tr data-plan="starter">
      <td class="plan-name">Starter</td>
      <td class="price monthly">$29</td>
      <td class="price annual">$290</td>
      <td class="users">Up to 5</td>
    </tr>
    <tr data-plan="growth">
      <td class="plan-name">Growth</td>
      <td class="price monthly">$99</td>
      <td class="price annual">$990</td>
      <td class="users">Up to 25</td>
    </tr>
  </tbody>
</table>
"""

def extract_plan_pricing(tree) -> list[dict]:
    """
    Extracts pricing data using sibling-axis XPath expressions.
    Demonstrates relative positioning without hardcoded td[N] indices.
    """
    plans = []
    # Select each row by its data-plan attribute — semantic anchor
    for row in tree.xpath("//tr[@data-plan]"):
        plan_name_cell = row.xpath("td[@class='plan-name']")[0]
        
        # Use following-sibling to get cells relative to the plan name cell
        # This is resilient to column reordering unlike td[2], td[3]
        monthly = plan_name_cell.xpath(
            "following-sibling::td[contains(@class,'monthly')]/text()"
        )
        annual = plan_name_cell.xpath(
            "following-sibling::td[contains(@class,'annual')]/text()"
        )
        users = plan_name_cell.xpath(
            "following-sibling::td[@class='users']/text()"
        )
        
        plans.append({
            "plan": plan_name_cell.text_content().strip(),
            "monthly": monthly[0] if monthly else "",
            "annual": annual[0] if annual else "",
            "users": users[0] if users else "",
        })
    return plans

doc = html.fromstring(SAMPLE_HTML)
print(extract_plan_pricing(doc))

The following-sibling::td[contains(@class,'monthly')] pattern is preferable to td[2] because it survives column additions and reordering — common events in e-commerce product comparison tables that are A/B tested continuously.

Recommended reading: For a deeper treatment of Python scraping tool selection alongside XPath, see DataFlirt’s Best Free Web Scraping Tools for Developers — it covers how lxml, parsel, Scrapy, and Playwright fit together in a production stack.

Advanced Predicates: Combining Conditions and String Functions

Production XPath scraping requires combining multiple predicates and string manipulation functions. The XPath 1.0 string function library is rich enough to handle the majority of text-matching problems encountered in structured data extraction.

import lxml.html as html
from typing import List, Dict

COMPLEX_HTML = """
<div class="product-listing">
  <article class="item featured  sale" data-id="101" data-stock="12">
    <h3 class="title">  Wireless Headphones Pro  </h3>
    <div class="badge-group">
      <span class="badge sale">20% OFF</span>
      <span class="badge new">NEW</span>
    </div>
    <p class="price">$79.99</p>
    <p class="original-price">$99.99</p>
  </article>
  <article class="item" data-id="102" data-stock="0">
    <h3 class="title">USB-C Hub 7-Port</h3>
    <p class="price">$49.99</p>
  </article>
  <article class="item  sale" data-id="103" data-stock="3">
    <h3 class="title">Mechanical Keyboard TKL</h3>
    <span class="badge sale">15% OFF</span>
    <p class="price">$84.99</p>
    <p class="original-price">$99.99</p>
  </article>
</div>
"""

def extract_sale_items_in_stock(tree) -> List[Dict]:
    """
    Demonstrates compound predicate XPath expressions:
    - contains() for partial class name matching (space-separated class tokens)
    - normalize-space() for whitespace-tolerant text extraction
    - Numeric comparison predicates on data attributes
    - Multiple predicates on a single axis step
    """
    results = []
    
    # XPath expressions combining three conditions in sequence:
    # 1. Article has 'sale' in its class token list
    # 2. Article's data-stock is greater than 0 (string-to-number coercion)
    # 3. Article contains a .original-price element (confirming discount exists)
    sale_items = tree.xpath("""
        //article[
            contains(concat(' ', normalize-space(@class), ' '), ' sale ')
            and number(@data-stock) > 0
            and .//p[@class='original-price']
        ]
    """)
    
    for item in sale_items:
        # normalize-space() strips leading/trailing whitespace and collapses internals
        title = item.xpath("normalize-space(h3[@class='title']/text())")
        
        # Extract the discount badge text — may not exist on all sale items
        discount = item.xpath(
            ".//span[contains(@class,'badge') and contains(@class,'sale')]/text()"
        )
        
        price = item.xpath("p[@class='price']/text()")
        original = item.xpath("p[@class='original-price']/text()")
        stock = item.get("data-stock", "0")
        
        results.append({
            "id": item.get("data-id"),
            "title": title,
            "discount": discount[0] if discount else "",
            "price": price[0] if price else "",
            "original_price": original[0] if original else "",
            "stock": int(stock),
        })
    
    return results

doc = html.fromstring(COMPLEX_HTML)
items = extract_sale_items_in_stock(doc)
for item in items:
    print(item)

The contains(concat(' ', normalize-space(@class), ' '), ' sale ') idiom is the canonical XPath 1.0 technique for safe class token matching. A naïve contains(@class, 'sale') would falsely match classes like on-sale or wholesale. The concat() approach wraps the class string with spaces so you always match the full token.

Relative XPath: Anchoring to Stable Structural Nodes

The worst XPath expressions in production scrapers are absolute paths: /html/body/div[3]/div[1]/article[2]/p[4]. These are brittle by design and break every time the page layout changes. The correct approach is always relative XPath scraping anchored to semantically stable nodes.

import lxml.html as html

def build_resilient_extractor(doc, anchor_xpath: str, relative_xpaths: dict) -> list:
    """
    Pattern for building resilient extractors:
    1. Find a stable structural anchor (semantic ID, unique landmark attribute)
    2. Extract all targets relative to that anchor
    3. Never hardcode absolute path positions
    """
    results = []
    
    for anchor in doc.xpath(anchor_xpath):
        record = {}
        for field_name, relative_expr in relative_xpaths.items():
            values = anchor.xpath(relative_expr)
            record[field_name] = values[0].strip() if values else None
        results.append(record)
    
    return results

# Example: e-commerce product grid
# Anchor: any article with a data-product-id (stable attribute)
# All fields extracted relative to that anchor
extractor_config = {
    "anchor": "//article[@data-product-id]",
    "fields": {
        "name":       "descendant::*[contains(@class,'name') or contains(@class,'title')][1]/text()",
        "price":      "descendant::*[contains(@class,'price') and not(contains(@class,'original'))][1]/text()",
        "image_src":  "descendant::img[not(contains(@class,'icon'))]/@src",
        "image_data": "descendant::img/@data-src",  # lazy-load fallback
        "product_id": "@data-product-id",
        "rating":     "descendant::*[@itemprop='ratingValue']/@content",
    }
}

The [not(contains(@class,'original'))] predicate pattern is critical for price extraction — most e-commerce pages render both current and struck-through original prices in adjacent elements, and positional selectors like p.price:first-child are fragile. This XPath expression explicitly excludes nodes whose class names indicate they carry the original price.

Iframe XPath Scraping with Playwright

Iframes represent a hard boundary for DOM traversal. An XPath expression evaluated against the top-level document cannot reach nodes inside an iframe — even if the iframe is same-origin. You must switch execution context into the iframe document, then apply XPath expressions there.

import asyncio
from playwright.async_api import async_playwright
import lxml.html as html

async def scrape_iframe_content(url: str) -> list[dict]:
    """
    Demonstrates iframe context switching for XPath scraping.
    
    Prerequisites: playwright install chromium
    
    Common use case: pricing tables, embedded review widgets, 
    payment forms, and third-party ad slots loaded via iframe.
    """
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            )
        )
        page = await context.new_page()
        await page.goto(url, wait_until="networkidle")
        
        results = []
        
        # Strategy 1: Target iframe by its src domain (most resilient)
        for frame in page.frames:
            if "reviews-widget" in frame.url or "embedded-data" in frame.url:
                # Wait for iframe content to be interactive
                try:
                    await frame.wait_for_selector("div.review-card", timeout=10_000)
                except:
                    continue
                
                iframe_html = await frame.content()
                
                # Now use lxml XPath scraping on the extracted iframe document
                tree = html.fromstring(iframe_html)
                
                # XPath expressions run against the iframe's own DOM
                reviews = tree.xpath("//div[contains(@class,'review-card')]")
                
                for review in reviews:
                    author = review.xpath(
                        "descendant::span[@class='reviewer-name']/text()"
                    )
                    rating = review.xpath(
                        "descendant::*[@data-rating]/@data-rating"
                    )
                    body = review.xpath(
                        "normalize-space(descendant::p[@class='review-body'])"
                    )
                    results.append({
                        "author": author[0] if author else "",
                        "rating": rating[0] if rating else "",
                        "body": body,
                    })
        
        # Strategy 2: Using Playwright's frame_locator API
        # frame_locator accepts CSS selector targeting the <iframe> element
        try:
            frame_loc = page.frame_locator("iframe[title='Review Widget']")
            # Mix Playwright's locator with XPath expressions
            cards = frame_loc.locator("xpath=//div[@class='review-card']")
            count = await cards.count()
            
            for i in range(count):
                card = cards.nth(i)
                author_el = card.locator("xpath=.//span[@class='reviewer-name']")
                author_text = await author_el.inner_text()
                results.append({"author": author_text.strip()})
        except Exception as e:
            pass  # frame not found — log and continue
        
        await browser.close()
        return results

# asyncio.run(scrape_iframe_content("https://example.com/product/123"))

The frame_locator("iframe[title='Review Widget']") approach is more stable than iterating over page.frames by URL because title attributes on iframes are typically set by the embedding site and are semantically meaningful. If neither URL nor title is reliable, use the iframe’s position-in-DOM as a last resort: page.frame_locator("iframe:nth-of-type(2)").

Video Blob URL Extraction: Network Interception + XPath

Blob URLs (blob:https://...) are memory references generated by the browser’s Media Source Extensions API. They are transient, process-local, and non-resolvable over HTTP. You cannot extract a blob URL and play it in another browser tab. However, the underlying media manifest URLs — HLS .m3u8 files or DASH .mpd files — that feed the blob stream are fully accessible via network interception.

XPath scraping handles the metadata layer (player attributes, video element data attributes, poster images). Network interception captures the actual stream URLs.

import asyncio
import re
import json
from playwright.async_api import async_playwright, Request
import lxml.html as html

async def extract_video_metadata_and_stream(url: str) -> dict:
    """
    Two-layer video extraction strategy:
    Layer 1: XPath scraping against DOM for video metadata
    Layer 2: Playwright request interception for HLS/DASH manifest URLs
    
    Handles: YouTube embeds, Vimeo, JW Player, Video.js, and custom MSE players.
    
    Prerequisites:
      - pip install playwright
      - playwright install chromium
    """
    
    intercepted_manifests = []
    intercepted_api_calls = []
    
    async def handle_request(request: Request):
        req_url = request.url
        # Intercept HLS manifests
        if re.search(r'\.(m3u8)(\?|$)', req_url):
            intercepted_manifests.append({
                "type": "hls",
                "url": req_url,
                "headers": dict(request.headers),
            })
        # Intercept DASH manifests
        elif re.search(r'\.(mpd)(\?|$)', req_url):
            intercepted_manifests.append({
                "type": "dash",
                "url": req_url,
            })
        # Intercept video API calls that return stream tokens
        elif re.search(r'/(video|stream|playback|manifest)/', req_url):
            intercepted_api_calls.append(req_url)
    
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()
        
        # Wire the request interceptor BEFORE navigation
        page.on("request", handle_request)
        
        await page.goto(url, wait_until="networkidle", timeout=45_000)
        
        # Allow the player to initialize and start loading the stream
        await asyncio.sleep(3)
        
        # XPath scraping layer: extract all video element metadata from DOM
        page_html = await page.content()
        tree = html.fromstring(page_html)
        
        video_metadata = {}
        
        # XPath expressions for common video player attribute patterns
        # Each targets a different player framework's data model
        
        # Standard HTML5 video element
        video_src = tree.xpath("//video/@src")
        video_poster = tree.xpath("//video/@poster")
        
        # JW Player configuration
        jw_data = tree.xpath(
            "//div[contains(@class,'jwplayer') or @id='jwplayer']/@data-config"
        )
        
        # Video.js player
        videojs_src = tree.xpath(
            "//video[contains(@class,'video-js')]/@data-setup"
        )
        
        # Custom data attributes used by proprietary players
        # These vary — XPath's wildcard attribute search is useful here
        custom_stream = tree.xpath(
            "//*[@data-hls-url or @data-stream-url or @data-video-src]"
            "/@*[contains(name(),'url') or contains(name(),'src')]"
        )
        
        # Thumbnail/poster extraction for video correlating
        poster_images = tree.xpath(
            "//video/@poster | //img[contains(@class,'video-thumb')]/@src"
            " | //img[contains(@class,'thumbnail')]/@data-src"
        )
        
        video_metadata = {
            "direct_src": video_src,
            "poster": video_poster or poster_images,
            "jw_config": jw_data,
            "videojs_config": videojs_src,
            "custom_stream_attrs": custom_stream,
        }
        
        await browser.close()
    
    return {
        "page_url": url,
        "metadata": video_metadata,
        "hls_manifests": intercepted_manifests,
        "api_calls": intercepted_api_calls,
    }

The XPath expression //*[@data-hls-url or @data-stream-url or @data-video-src]/@*[contains(name(),'url') or contains(name(),'src')] is worth examining. It uses name() — a node-set function that returns the attribute name as a string — to match any attribute whose name contains ‘url’ or ‘src’, regardless of what element carries it. This is an intentionally broad sweep useful when you encounter a proprietary player you have not seen before.

Recommended reading: DataFlirt’s Best Approaches to Scraping Dynamic JavaScript Sites Without Getting Blocked covers the full Playwright configuration stack that pairs with the network interception pattern above.

Image Extraction: srcset, Lazy-Load, and data-src Patterns

Modern image delivery uses multiple attributes simultaneously — src, srcset, data-src, data-lazy-src, and WebP source sets inside <picture> elements. Comprehensive image XPath scraping must handle all variants.

import lxml.html as html
from dataclasses import dataclass
from typing import Optional, List

@dataclass
class ImageRecord:
    src: Optional[str]
    srcset: Optional[str]
    data_src: Optional[str]
    alt: Optional[str]
    width: Optional[str]
    height: Optional[str]
    loading: Optional[str]
    best_url: Optional[str]  # Resolved best-quality URL

def extract_images_comprehensive(tree) -> List[ImageRecord]:
    """
    Comprehensive image extraction using XPath expressions that cover:
    - Standard src attribute
    - Responsive srcset with multiple resolutions
    - Lazy-loaded images using data-src, data-lazy, data-original
    - Picture element with source sets
    - WebP source elements with type fallbacks
    
    XPath scraping strategy: extract all attributes at once per element,
    then resolve the best-quality URL in Python post-processing.
    """
    images = []
    
    # XPath expression that finds ALL image-bearing elements:
    # regular img tags, img inside picture, and noscript-wrapped lazy images
    img_elements = tree.xpath("""
        //img[
            @src or @data-src or @data-lazy or @data-original 
            or @srcset or @data-srcset
        ]
        | //noscript[.//img]/img
        | //picture/img
    """)
    
    for img in img_elements:
        # Extract every image-related attribute in one pass
        src = img.get("src", "").strip()
        srcset = img.get("srcset", img.get("data-srcset", "")).strip()
        
        # Lazy-load attribute cascade — different frameworks use different names
        data_src = (
            img.get("data-src") or
            img.get("data-lazy") or
            img.get("data-original") or
            img.get("data-lazy-src") or
            img.get("data-defer-src")
        )
        
        # Skip base64 placeholder images (common with lazy-load skeletons)
        if src.startswith("data:image") and not data_src:
            continue
        
        # Resolve best URL: data-src > srcset (highest res) > src
        best_url = data_src or _resolve_highest_srcset(srcset) or src or None
        
        images.append(ImageRecord(
            src=src or None,
            srcset=srcset or None,
            data_src=data_src,
            alt=img.get("alt"),
            width=img.get("width"),
            height=img.get("height"),
            loading=img.get("loading"),
            best_url=best_url,
        ))
    
    # Also extract WebP sources from <picture> elements
    # The XPath expression traverses to parent <picture> then finds <source>
    for source in tree.xpath("//picture/source[@srcset]"):
        srcset_val = source.get("srcset", "")
        img_type = source.get("type", "")
        # Add high-quality WebP sources as supplementary records
        if "webp" in img_type and srcset_val:
            images.append(ImageRecord(
                src=None,
                srcset=srcset_val,
                data_src=None,
                alt=None,
                width=source.get("width"),
                height=source.get("height"),
                loading=None,
                best_url=_resolve_highest_srcset(srcset_val),
            ))
    
    return images

def _resolve_highest_srcset(srcset: str) -> Optional[str]:
    """
    Parses a srcset string and returns the URL with the highest
    width descriptor (e.g., "img-400.jpg 400w, img-800.jpg 800w" → img-800.jpg).
    Falls back to the first URL if no width descriptor is present.
    """
    if not srcset:
        return None
    
    candidates = []
    for entry in srcset.split(","):
        parts = entry.strip().split()
        if len(parts) == 2:
            url, descriptor = parts
            try:
                width = int(descriptor.rstrip("wx"))
                candidates.append((width, url))
            except ValueError:
                continue
        elif len(parts) == 1:
            candidates.append((0, parts[0]))
    
    if not candidates:
        return None
    
    candidates.sort(key=lambda x: x[0], reverse=True)
    return candidates[0][1]

XPath Namespace Handling in Python

XML namespaces are the most common reason engineers abandon XPath scraping for API responses. The symptom: your XPath returns an empty list even though the node clearly exists in the document. The cause: the document uses a default namespace and your XPath expression ignores it.

import lxml.etree as etree
from io import BytesIO

# A realistic namespaced XML response — common from e-commerce APIs,
# real estate data feeds, and government open data portals
NAMESPACED_XML = b"""<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:commerce="http://schemas.example.com/commerce/v2"
      xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#">
  <entry>
    <title>Laptop Pro 16 M4</title>
    <commerce:price currency="USD">1999.00</commerce:price>
    <commerce:stock>42</commerce:stock>
    <geo:lat>37.7749</geo:lat>
    <geo:long>-122.4194</geo:long>
    <link href="https://example.com/product/laptop-pro-16"/>
  </entry>
  <entry>
    <title>Wireless Earbuds X3</title>
    <commerce:price currency="USD">149.00</commerce:price>
    <commerce:stock>0</commerce:stock>
    <geo:lat>40.7128</geo:lat>
    <geo:long>-74.0060</geo:long>
    <link href="https://example.com/product/earbuds-x3"/>
  </entry>
</feed>
"""

def extract_namespaced_feed(xml_bytes: bytes) -> list[dict]:
    """
    Two strategies for namespace-aware XPath expressions:
    
    Strategy A: Explicit namespace mapping — correct, performant, maintainable.
    Strategy B: local-name() bypass — quick prototype, breaks on name collisions.
    """
    tree = etree.parse(BytesIO(xml_bytes))
    
    # Strategy A: Explicit namespace map
    # Register prefixes that map to the document's namespace URIs
    # The prefix names you choose (atom, commerce, geo) are arbitrary —
    # they only need to match the URIs in the document
    nsmap = {
        "atom":     "http://www.w3.org/2005/Atom",
        "commerce": "http://schemas.example.com/commerce/v2",
        "geo":      "http://www.w3.org/2003/01/geo/wgs84_pos#",
    }
    
    results = []
    
    # lxml XPath with explicit namespace map — the production pattern
    entries = tree.xpath("//atom:entry", namespaces=nsmap)
    
    for entry in entries:
        title = entry.xpath("atom:title/text()", namespaces=nsmap)
        price = entry.xpath("commerce:price/text()", namespaces=nsmap)
        currency = entry.xpath("commerce:price/@currency", namespaces=nsmap)
        stock = entry.xpath("commerce:stock/text()", namespaces=nsmap)
        link = entry.xpath("atom:link/@href", namespaces=nsmap)
        lat = entry.xpath("geo:lat/text()", namespaces=nsmap)
        lon = entry.xpath("geo:long/text()", namespaces=nsmap)
        
        results.append({
            "title": title[0] if title else "",
            "price": float(price[0]) if price else None,
            "currency": currency[0] if currency else "",
            "stock": int(stock[0]) if stock else 0,
            "url": link[0] if link else "",
            "lat": float(lat[0]) if lat else None,
            "lon": float(lon[0]) if lon else None,
        })
    
    # Strategy B: local-name() bypass — useful for rapid prototyping
    # when namespace URIs are unknown or inconsistent across responses
    entries_b = tree.xpath("//*[local-name()='entry']")
    prices_b = tree.xpath(
        "//*[local-name()='entry']/*[local-name()='price']/text()"
    )
    
    return results

print(extract_namespaced_feed(NAMESPACED_XML))

The local-name() strategy deserves a direct warning: in documents that mix multiple namespaces where elements from different namespaces share the same local name, it will silently match the wrong nodes. Always use explicit namespace mappings in production XPath scraping pipelines. Use local-name() only as a diagnostic tool.

XPath Scraping in Scrapy with parsel

Scrapy’s parsel library is the production wrapper for XPath scraping in Python’s most widely deployed scraping framework. It exposes lxml XPath with a cleaner API and adds CSS selector support on the same Selector object.

# spiders/advanced_xpath_spider.py
import scrapy
from parsel import Selector
import json

class AdvancedXPathSpider(scrapy.Spider):
    name = "xpath_demo"
    start_urls = ["https://example.com/catalogue/"]
    
    custom_settings = {
        "CONCURRENT_REQUESTS": 32,
        "DOWNLOAD_DELAY": 0.8,
        "AUTOTHROTTLE_ENABLED": True,
        "ROBOTSTXT_OBEY": True,
    }
    
    def parse(self, response):
        # parsel's .xpath() is lxml XPath with a cleaner return API
        # .get() returns first result or None; .getall() returns all as list
        
        # XPath expressions using ancestor axis to resolve obfuscated containers
        for product in response.xpath("//article[@data-product-id]"):
            
            # Relative XPath from the article anchor
            name = product.xpath(
                "descendant::*[self::h1 or self::h2 or self::h3]"
                "[not(ancestor::nav)][1]/text()"
            ).get("").strip()
            
            # Price with disambiguation between current and original
            current_price = product.xpath(
                ".//span[contains(@class,'price')"
                " and not(contains(@class,'original'))"
                " and not(contains(@class,'was'))]/text()"
            ).get("").strip()
            
            # Handle structured data in JSON-LD (very common in 2026 e-commerce)
            json_ld = response.xpath(
                "//script[@type='application/ld+json']/text()"
            ).getall()
            
            structured_data = None
            for ld_block in json_ld:
                try:
                    data = json.loads(ld_block)
                    if isinstance(data, dict) and data.get("@type") == "Product":
                        structured_data = data
                        break
                except json.JSONDecodeError:
                    continue
            
            # XPath + structured data hybrid extraction
            # Use structured data where available; XPath scraping as fallback
            yield {
                "url": response.url,
                "name": (
                    structured_data.get("name") if structured_data
                    else name
                ),
                "price": (
                    structured_data.get("offers", {}).get("price") if structured_data
                    else current_price
                ),
                "product_id": product.xpath("@data-product-id").get(""),
            }
        
        # Pagination — follow-sibling axis to get next page
        # Prefers 'next' rel link over positional selectors
        next_page = response.xpath(
            "//a[@rel='next']/@href"
            " | //li[contains(@class,'page-item')]"
            "[following-sibling::li[contains(@class,'active')]]"
            "/a/@href"
        ).get()
        
        if next_page:
            yield response.follow(next_page, callback=self.parse)

The JSON-LD hybrid extraction pattern is increasingly important for 2026 XPath scraping. E-commerce and news sites embed application/ld+json blocks carrying machine-readable product, article, and event data per Google’s structured data guidelines. Parsing this is always preferable to DOM traversal when it is available — it is authoritative, stable, and schema-documented. Use XPath scraping as the fallback when JSON-LD is absent or incomplete.

Recommended reading: For the full Scrapy middleware and pipeline architecture that this spider runs inside, see DataFlirt’s Best Scraping Tools for Python Developers in 2026.

LLM-Augmented XPath Generation

The most significant productivity acceleration in XPath scraping workflows in 2025–2026 is using large language models to generate initial XPath expressions from raw HTML. Rather than manually inspecting the DOM, you pass a representative HTML snippet with a data extraction goal and receive syntactically correct XPath expressions that you validate and deploy.

The failure mode is well-documented: LLMs generate positional predicates (div[3], tr[1]) that are brittle. Prompt engineering that emphasises resilience mitigates this.

Google Gemini 3.1 via GenAI SDK — API Mode

# Prerequisites: pip install google-genai
# Set env: GOOGLE_API_KEY=your_key

from google import genai
from google.genai import types
import json

client = genai.Client()  # uses GOOGLE_API_KEY env var

XPATH_SYSTEM_PROMPT = """You are an expert in XPath scraping and HTML parsing.
Given an HTML snippet and a list of fields to extract, you generate resilient
XPath expressions that:
1. Avoid positional predicates (div[3], tr[2]) unless absolutely necessary
2. Anchor to stable semantic attributes (data-*, aria-*, id, role)
3. Use contains() for partial class matching
4. Use normalize-space() for text-bearing elements
5. Use the following-sibling or ancestor axis when label-value pairs are present

Return ONLY a JSON object with field names as keys and XPath expressions as values.
No explanation, no markdown fences."""

def generate_xpath_with_gemini(html_snippet: str, fields: list[str]) -> dict:
    """
    Generates XPath expressions using Gemini 3.1 Flash for cost efficiency.
    Use gemini-3.1-pro for more complex DOM structures.
    
    The response is validated before being used in production.
    """
    prompt = f"""HTML snippet:
{html_snippet[:15000]}

Extract these fields: {', '.join(fields)}

Return a JSON object with each field name as a key and its XPath expression as the value."""
    
    response = client.models.generate_content(
        model="gemini-3.1-flash",  # Use gemini-3.1-pro for complex DOMs
        contents=[types.Part.from_text(prompt)],
        config=types.GenerateContentConfig(
            system_instruction=XPATH_SYSTEM_PROMPT,
            response_mime_type="application/json",
            temperature=0.1,  # Low temperature for deterministic structural output
        )
    )
    
    try:
        return json.loads(response.text)
    except json.JSONDecodeError:
        # Strip any accidental markdown fences
        clean = response.text.strip().lstrip("```json").rstrip("```").strip()
        return json.loads(clean)

Google Vertex AI — GenAI SDK Mode (Enterprise)

# Prerequisites: pip install google-genai google-auth
# Set env: GOOGLE_CLOUD_PROJECT=your-project-id
# Authenticate: gcloud auth application-default login

import vertexai
from google import genai as vertex_genai
from google.genai import types as vertex_types

# Initialize Vertex AI with your project
vertexai.init(project="your-gcp-project-id", location="us-central1")

# Use Vertex AI backend via the unified GenAI SDK
vertex_client = vertex_genai.Client(vertexai=True)

def generate_xpath_vertex(html_snippet: str, fields: list[str]) -> dict:
    """
    Vertex AI mode — identical API surface to the standard GenAI SDK
    but routes through your GCP project for enterprise billing and compliance.
    Useful when scraping pipelines run inside GCP infrastructure.
    """
    prompt = f"""HTML:\n{html_snippet[:15000]}\n\nFields to extract: {', '.join(fields)}\n
    Return ONLY a JSON object mapping field names to XPath expressions.
    Use ancestor/sibling axes where label-value pairs are present.
    Prefer contains() over exact class matching."""
    
    response = vertex_client.models.generate_content(
        model="gemini-3.1-pro",  # Pro tier for complex DOM structures in enterprise context
        contents=[vertex_types.Part.from_text(prompt)],
        config=vertex_types.GenerateContentConfig(
            system_instruction=XPATH_SYSTEM_PROMPT,
            response_mime_type="application/json",
            temperature=0.1,
        )
    )
    
    return json.loads(response.text)

Claude Sonnet and Opus via Anthropic SDK

# Prerequisites: pip install anthropic
# Set env: ANTHROPIC_API_KEY=your_key

import anthropic
import json

anthropic_client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY

def generate_xpath_claude(
    html_snippet: str,
    fields: list[str],
    use_opus: bool = False
) -> dict:
    """
    Claude-based XPath expression generation.
    
    Use claude-opus-4-6 for highly complex DOMs with deeply nested
    conditional rendering, shadow DOM hints, or unusual structural patterns.
    
    Use claude-sonnet-4-6 (default) for standard e-commerce and content
    extraction tasks — significantly faster and more cost-efficient.
    
    Claude's strength relative to Gemini for this task: stronger reasoning
    about DOM structure semantics, better at proposing ancestor-axis solutions,
    and more reliable at avoiding positional predicates when prompted.
    """
    model = "claude-opus-4-6" if use_opus else "claude-sonnet-4-6"
    
    message = anthropic_client.messages.create(
        model=model,
        max_tokens=1500,
        system=XPATH_SYSTEM_PROMPT,
        messages=[{
            "role": "user",
            "content": (
                f"HTML snippet:\n{html_snippet[:20000]}\n\n"
                f"Extract these fields using resilient XPath expressions: "
                f"{', '.join(fields)}\n\n"
                f"Return ONLY a JSON object. "
                f"Use ancestor and sibling axes where applicable. "
                f"Avoid positional predicates."
            )
        }]
    )
    
    raw_text = message.content[0].text
    try:
        return json.loads(raw_text)
    except json.JSONDecodeError:
        # Claude occasionally wraps JSON in markdown despite instructions
        clean = raw_text.strip().lstrip("```json").lstrip("```").rstrip("```").strip()
        return json.loads(clean)

def validate_xpath_batch(html: str, xpath_map: dict) -> dict:
    """
    Validates LLM-generated XPath expressions against a known HTML sample.
    Returns a report of which expressions produce results and which fail silently.
    
    Run this BEFORE deploying LLM-generated XPath to production.
    """
    import lxml.html as lxml_html
    
    tree = lxml_html.fromstring(html)
    validation_report = {}
    
    for field, xpath_expr in xpath_map.items():
        try:
            result = tree.xpath(xpath_expr)
            validation_report[field] = {
                "xpath": xpath_expr,
                "status": "ok" if result else "empty",
                "result_count": len(result),
                "sample": str(result[0])[:100] if result else None,
            }
        except Exception as e:
            validation_report[field] = {
                "xpath": xpath_expr,
                "status": "error",
                "error": str(e),
            }
    
    return validation_report

The validate_xpath_batch() function is not optional. LLMs generate syntactically valid XPath expressions that return empty results on the actual document more often than they generate syntactically invalid ones. An empty result is a silent failure in a scraping pipeline. Always validate against a ground-truth sample before promotion to production.

Recommended reading: DataFlirt’s Best Scraping Tools Powered by LLMs in 2026 covers the broader LLM-augmented extraction landscape, including evaluation frameworks for comparing LLM selector generation quality across model families.

Debugging XPath Expressions in Production

Silent failures — empty results from valid XPath expressions — are the most time-consuming debugging scenario in XPath scraping. The following diagnostic toolkit covers the three most common root causes.

import lxml.html as html
import lxml.etree as etree

def diagnose_xpath(html_string: str, failing_xpath: str) -> dict:
    """
    Systematic XPath expression debugger.
    
    Common failure modes covered:
    1. Whitespace in class attributes causing contains() mismatch
    2. Namespace pollution from embedded SVG or MathML
    3. HTML parser normalisation changing attribute casing
    4. Encoding issues producing entity characters in text nodes
    5. Phantom whitespace-only text nodes
    """
    tree = html.fromstring(html_string)
    
    diagnostics = {
        "original_xpath": failing_xpath,
        "original_result": tree.xpath(failing_xpath),
    }
    
    # Diagnostic 1: Namespace pollution
    # If the document contains SVG or MathML, lxml may inject namespace prefixes
    # that interfere with XPath DOM traversal
    all_namespaces = {
        el.tag.split("}")[0].strip("{")
        for el in tree.iter()
        if "}" in el.tag
    }
    diagnostics["detected_namespaces"] = list(all_namespaces)
    
    if all_namespaces:
        diagnostics["namespace_note"] = (
            "Document contains non-HTML namespaces. If your XPath targets elements "
            "in these namespaces, you must register them explicitly or use local-name()."
        )
    
    # Diagnostic 2: Whitespace in class attributes
    # Trailing spaces in class attributes cause contains(@class, 'foo') to mismatch
    # when the class value is '  foo  ' or 'foo bar '
    sample_elements = tree.xpath("//*[@class]")[:5]
    diagnostics["class_samples"] = [
        {"tag": el.tag, "class": repr(el.get("class"))}
        for el in sample_elements
    ]
    
    # Diagnostic 3: Serialise the subtree for visual inspection
    # This reveals what the parser actually ingested vs what you assume
    # Particularly useful when scraping sites with malformed HTML
    if failing_xpath.startswith("//"):
        # Find the likely parent context
        parent_tag = failing_xpath.split("/")[2].split("[")[0].split("::")[-1]
        parent_nodes = tree.xpath(f"//{parent_tag}")[:2]
        diagnostics["parent_node_serialised"] = [
            etree.tostring(n, pretty_print=True).decode()[:500]
            for n in parent_nodes
        ]
    
    return diagnostics

# Interactive debugging helper for parsel/Scrapy development
def xpath_test_harness(html_string: str, xpaths: dict) -> None:
    """
    Quick multi-expression test harness. Pass a dict of name → XPath expression
    and see results for all at once. Useful during spider development.
    """
    tree = html.fromstring(html_string)
    print(f"{'Field':<25} {'Status':<8} {'Sample Result'}")
    print("-" * 80)
    for name, expr in xpaths.items():
        try:
            result = tree.xpath(expr)
            if result:
                sample = str(result[0])[:50].strip()
                print(f"{name:<25} {'OK':<8} {sample}")
            else:
                print(f"{name:<25} {'EMPTY':<8} (no nodes matched)")
        except etree.XPathError as e:
            print(f"{name:<25} {'ERROR':<8} {str(e)[:50]}")

Performance Considerations: lxml vs selectolax vs Alternatives

For pipelines where XPath scraping is the primary parse operation at volume, the parser choice matters. lxml XPath at scale on a single CPU core processes approximately 800–1,200 medium-complexity HTML documents per second. For purely structural extraction without XPath, selectolax’s CSS API is 8–12x faster — but it does not expose XPath.

The engineering decision tree:

Need XPath expressions with axis traversal → lxml. No alternative.
CSS selectors sufficient, need maximum throughput → selectolax.
Need both XPath and CSS on same object → parsel (Scrapy’s wrapper over lxml).
Need XPath on JS-rendered content → Playwright to render, lxml to parse the resulting HTML.
Processing millions of small XML documents → lxml’s iterparse() in streaming mode — avoids loading the full document tree into memory.

# Streaming lxml for large XML feeds — avoids OOM on 100MB+ feeds
import lxml.etree as etree

def stream_large_xml_feed(xml_path: str, target_tag: str, nsmap: dict) -> list[dict]:
    """
    iterparse() for memory-efficient XPath scraping of large XML documents.
    Processes one element at a time; clears processed elements to keep memory flat.
    """
    results = []
    # Resolve tag with namespace prefix
    ns_uri = nsmap.get(target_tag.split(":")[0]) if ":" in target_tag else None
    local = target_tag.split(":")[-1]
    qualified_tag = f"{{{ns_uri}}}{local}" if ns_uri else local
    
    context = etree.iterparse(xml_path, events=("end",), tag=qualified_tag)
    
    for event, elem in context:
        # Extract data using XPath against this single element
        results.append({
            "title": elem.xpath(
                "atom:title/text()", namespaces=nsmap
            )[0] if elem.xpath("atom:title/text()", namespaces=nsmap) else "",
            "id": elem.get("id", ""),
        })
        
        # Critical: clear processed elements and their ancestors
        # Without this, lxml accumulates the full document in memory
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    
    return results

Production XPath Scraping: The Complete Pattern Summary

After covering the full range of XPath scraping techniques, here is the synthesised production decision framework DataFlirt’s engineering team uses when building extraction pipelines:

Always anchor to semantic attributes, not positional indices. @data-product-id, @id, @role, @aria-label, and @itemprop survive front-end refactors. div[3] does not.

Use parameterised XPath (keyword arguments to lxml’s .xpath()) for any expression that incorporates user input or variable data. It prevents injection and enables compiled-expression caching.

Combine JSON-LD extraction with XPath scraping as a hybrid pipeline. Structured data blocks are the authoritative source when present. XPath is the reliable fallback.

Validate LLM-generated XPath expressions against a ground-truth sample before promotion. Silent empty-result failures are more dangerous than syntax errors.

Use iterparse() for XML feeds exceeding 10MB. Loading the full document tree for streaming data is a memory management error, not a performance trade-off.

Profile with normalize-space() in all text predicates. HTML source whitespace is inconsistent and parser-dependent. Always normalise before matching.

Recommended reading: To complete your production scraping stack alongside the XPath scraping techniques in this guide, see DataFlirt’s Top 5 Scraping Tools for Extracting Structured Data with CSS and XPath for a comparative tool evaluation, and Best Databases for Storing Scraped Data at Scale for pipeline output architecture.

Frequently Asked Questions

When should I choose XPath over CSS selectors for web scraping?

Choose XPath scraping over CSS selectors when you need to navigate up the DOM (parent or ancestor axes), when you need text content matching in predicates, when the target document carries XML namespaces, or when positional logic relative to document structure is required. CSS selectors cannot traverse upward and have no equivalent to XPath’s string-function predicates, making XPath the only option for a significant class of complex DOM traversal problems.

What is the best Python library for XPath scraping in 2026?

lxml remains the fastest and most complete XPath 1.0 implementation in Python. For Scrapy pipelines, parsel wraps lxml with a cleaner API. For JavaScript-rendered pages, Playwright handles rendering while lxml or parsel processes the HTML output. Avoid BeautifulSoup for XPath-heavy workloads — its find() API does not natively support XPath expressions.

Can XPath extract data from inside iframes?

XPath cannot cross iframe document boundaries. You must first switch execution context to the iframe using Playwright’s frame_locator() or frames property, then apply XPath expressions against the sub-document. In lxml processing, if you have already extracted the iframe’s HTML as a separate string, standard XPath scraping applies normally.

How do I handle XPath with XML namespaces when scraping?

Pass an explicit namespace mapping dict to lxml’s .xpath() call: tree.xpath('//ns:element', namespaces={'ns': 'http://example.com/ns'}). Alternatively, use local-name() to bypass namespace matching for rapid prototyping, but switch to explicit mappings for production pipelines to avoid cross-namespace collisions.

Is LLM-generated XPath reliable enough for production scraping?

LLM-generated XPath expressions via Gemini 3.1 or Claude Sonnet work well as a bootstrap tool. Always run the validate_xpath_batch() function against a ground-truth sample before deploying. The common failure mode is LLMs generating positional predicates that break on paginated content — counter this with prompt engineering that explicitly prohibits positional selectors.

Does reCAPTCHA v3 have a challenge to solve?

No — but this is relevant context for XPath scraping pipelines targeting Google properties. reCAPTCHA v3 operates invisibly and assigns a risk score. The correct response is fingerprint engineering and IP quality management, not challenge solving. See DataFlirt’s Google CAPTCHA bypass guide for the full evasion stack.

Python XPath Comprehensive Guide: Advanced DOM Scraping Techniques in 2026

Who This Guide Is For — and Why XPath Scraping Still Matters in 2026

XPath vs CSS Selectors: The Engineering Decision

Setting Up Your Python XPath Environment

The XPath Data Model: Axes, Node Tests, and Predicates

Ancestor and Parent Axes: Traversing Up the DOM

Sibling Axes: The Label-Value Pattern at Scale

Advanced Predicates: Combining Conditions and String Functions

Relative XPath: Anchoring to Stable Structural Nodes

Iframe XPath Scraping with Playwright

Video Blob URL Extraction: Network Interception + XPath

Image Extraction: srcset, Lazy-Load, and data-src Patterns

XPath Namespace Handling in Python

XPath Scraping in Scrapy with parsel

LLM-Augmented XPath Generation

Google Gemini 3.1 via GenAI SDK — API Mode

Google Vertex AI — GenAI SDK Mode (Enterprise)

Claude Sonnet and Opus via Anthropic SDK

Debugging XPath Expressions in Production

Performance Considerations: lxml vs selectolax vs Alternatives

Production XPath Scraping: The Complete Pattern Summary

Further Reading from DataFlirt

Frequently Asked Questions

When should I choose XPath over CSS selectors for web scraping?

What is the best Python library for XPath scraping in 2026?

Can XPath extract data from inside iframes?

How do I handle XPath with XML namespaces when scraping?

Is LLM-generated XPath reliable enough for production scraping?

Does reCAPTCHA v3 have a challenge to solve?

Latest from the Blog

BeautifulSoup4 for Web Scraping: A Practical Python Guide

Assortment gap analysis with catalog extraction

What are the best ecommerce review scraping tools?

Data Extraction for Every Industry

Who This Guide Is For — and Why XPath Scraping Still Matters in 2026

XPath vs CSS Selectors: The Engineering Decision

Setting Up Your Python XPath Environment

The XPath Data Model: Axes, Node Tests, and Predicates

Ancestor and Parent Axes: Traversing Up the DOM

Sibling Axes: The Label-Value Pattern at Scale

Advanced Predicates: Combining Conditions and String Functions

Relative XPath: Anchoring to Stable Structural Nodes

Iframe XPath Scraping with Playwright

Video Blob URL Extraction: Network Interception + XPath

Image Extraction: srcset, Lazy-Load, and data-src Patterns

XPath Namespace Handling in Python

XPath Scraping in Scrapy with parsel

LLM-Augmented XPath Generation

Google Gemini 3.1 via GenAI SDK — API Mode

Google Vertex AI — GenAI SDK Mode (Enterprise)

Claude Sonnet and Opus via Anthropic SDK

Debugging XPath Expressions in Production

Performance Considerations: lxml vs selectolax vs Alternatives

Production XPath Scraping: The Complete Pattern Summary

Further Reading from DataFlirt

Frequently Asked Questions

When should I choose XPath over CSS selectors for web scraping?

What is the best Python library for XPath scraping in 2026?

Can XPath extract data from inside iframes?

How do I handle XPath with XML namespaces when scraping?

Is LLM-generated XPath reliable enough for production scraping?

Does reCAPTCHA v3 have a challenge to solve?

Web scraping insights, delivered to your inbox.

Latest from the Blog

BeautifulSoup4 for Web Scraping: A Practical Python Guide

Assortment gap analysis with catalog extraction

What are the best ecommerce review scraping tools?

Data Extraction for Every Industry

Web scraping insights,
delivered to your inbox.