Who This Guide Is For — and Why XPath Scraping Still Matters in 2026
You already know how to scrape. You have written CSS selectors, parsed responses with BeautifulSoup, and run a Scrapy spider or two against cooperative HTML. The problem is that cooperative HTML stopped being the default a long time ago.
This guide is for data engineers and senior Python developers who are regularly confronted with DOM structures that break CSS selectors: tables whose row cells shift position between pages, pricing blocks wrapped in deeply nested conditional markup, product images hidden behind lazy-load scaffolding, video players that inject blob URIs via Media Source Extensions, and API responses in namespaced XML that need structured extraction without a schema. These are the real-world targets that expose the ceiling of CSS-first scraping approaches.
XPath scraping has a ceiling too — but it is considerably higher. XML Path Language 1.0, specified by the W3C in 1999 and still fully supported by every major HTML parser, gives you a complete navigational model over any document tree. You can traverse upward (ancestor axes), sideways (sibling axes), and across text content (string functions as predicate filters). You can express conditions that combine structural position with content matching. None of these capabilities exist in CSS selectors.
The global web scraping market is projected to exceed USD 2.8 billion by 2030, growing at a CAGR above 18%, with an increasing share of that pipeline complexity driven by SPAs, anti-bot obfuscation layers, and nested iframe architectures. Engineers who have mastered advanced XPath expressions navigate these environments efficiently. Engineers who have not spend hours debugging selectors that break every time the target site runs an A/B test.
This is the guide for the former group.
XPath vs CSS Selectors: The Engineering Decision
The choice between XPath and CSS is not a stylistic preference. It is a structural one dictated by your target DOM’s characteristics.
CSS selectors are faster to write, easier to read, and slightly faster to evaluate in benchmarks. They are the right default when: the target structure is stable, you need descendant-only traversal, and your predicates are purely class- or attribute-based. Most public HTML pages fall into this category. Use CSS selectors there.
XPath scraping becomes the correct engineering choice when any of the following conditions apply:
Upward traversal is required. CSS has no equivalent to ancestor:: or parent::. If you need to find a label element, then extract a sibling value cell from the same parent row, XPath is your only option in the standard parser layer.
Text content predicates are needed. //div[contains(text(), 'Price')] has no CSS equivalent. Matching against partial text, normalised whitespace, or string-start patterns is fundamental to XPath expressions and absent from CSS.
Namespace-qualified documents. Any XML response from a REST or SOAP API, any Atom feed, any SVG embedded in HTML — all of these carry XML namespaces that CSS selectors do not handle.
Positional logic relative to document structure. position(), last(), and count() let you express conditions like “the second cell of the last row” or “all but the header row” as single XPath expressions.
Conditional multi-axis extraction. Real-world DOM traversal often requires combining axes: “find the <dt> whose text is ‘SKU’, then retrieve the text of its following <dd> sibling.” This is trivially expressed in XPath scraping and structurally impossible in CSS.
Setting Up Your Python XPath Environment
Before writing a single XPath expression, establish an isolated environment. Dependency conflicts between lxml versions are a common, silent source of parser inconsistency across scraping pipelines.
# Python 3.11+ recommended for performance improvements in lxml 5.x
python -m venv .xpath-env
source .xpath-env/bin/activate # Windows: .xpath-env\Scripts\activate
# Core dependencies
pip install lxml parsel scrapy playwright requests httpx
# System dependency for lxml on Debian/Ubuntu (required before pip install)
# sudo apt-get install libxml2-dev libxslt1-dev python3-dev
# Install Playwright browser binaries for JS-rendered DOM work
playwright install chromium firefox
The critical dependency to get right is lxml. Version 5.x ships with improved memory management for large HTML documents and tighter XPath 1.0 compliance for edge-case namespace handling. Verify your installation:
import lxml.etree as etree
import lxml.html as html
from lxml import __version__ as lxml_version
print(f"lxml version: {lxml_version}")
# Expect: 5.1.0 or higher for 2026 production use
The XPath Data Model: Axes, Node Tests, and Predicates
Every XPath expression decomposes into three parts: an axis specifier that defines direction of traversal, a node test that filters by node type or name, and zero or more predicates that add conditional filtering. Understanding this decomposition is the foundation of all advanced XPath scraping work.
ancestor::div[@class='product-container']
^^^^^^^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^
axis node predicate
test
The axes available in XPath 1.0 and their directional semantics:
| Axis | Direction | Common Use in XPath Scraping |
|---|---|---|
child:: | Immediate children | Default axis; child::div equals div |
descendant:: | All descendants | Deep search regardless of nesting depth |
descendant-or-self:: | Self + all descendants | The // abbreviation expands to this |
parent:: | Immediate parent | One level up |
ancestor:: | All ancestors up to root | Find containing context from inner element |
ancestor-or-self:: | Self + all ancestors | Useful for “within which section am I?” |
following-sibling:: | Siblings after current node | Extract value cell after a label cell |
preceding-sibling:: | Siblings before current node | Context-building from a known anchor |
following:: | All nodes after current | Cross-parent forward search |
preceding:: | All nodes before current | Rarely needed; expensive |
self:: | Current node | Validation predicates |
attribute:: | Attributes of current node | attribute::href = @href |
The abbreviated syntax most engineers use (//, @, .) maps directly to these axes. Understanding the full form is essential for complex DOM traversal because it lets you compose multi-axis XPath expressions that the abbreviations cannot express.
Ancestor and Parent Axes: Traversing Up the DOM
The single most common scenario where CSS selectors fail and XPath scraping succeeds is the “label-value pair” pattern: you can reliably find the label, but the value you need is a sibling or cousin of that label’s parent. No CSS selector resolves this. XPath expressions do.
Consider a product detail page with this markup:
<div class="spec-table">
<div class="spec-row obfuscated-cx12">
<span class="label">Processor</span>
<span class="value">Apple M4 Pro</span>
</div>
<div class="spec-row obfuscated-cx13">
<span class="label">RAM</span>
<span class="value">24GB</span>
</div>
<div class="spec-row obfuscated-cx14">
<span class="label">Storage</span>
<span class="value">512GB SSD</span>
</div>
</div>
The obfuscated class suffixes (cx12, cx13) are generated dynamically and change with every deployment. A CSS selector targeting .obfuscated-cx12 .value breaks on the next build. The ancestor-anchored XPath expression is resilient:
import lxml.html as html
def extract_spec(tree, label_text: str) -> str:
"""
Finds a spec value by its label text using ancestor-anchored XPath.
Resilient to obfuscated class names on containing rows.
"""
# Find the label span by text content
# Traverse to its parent (the row), then find the sibling value span
result = tree.xpath(
"//span[@class='label'][normalize-space(text())=$label]"
"/parent::div"
"/span[@class='value']/text()",
label=label_text # Parameterised XPath — prevents injection and improves caching
)
return result[0].strip() if result else ""
doc = html.fromstring("""<div class="spec-table">...(above HTML)...</div>""")
processor = extract_spec(doc, "Processor")
# Returns: "Apple M4 Pro"
The parameterised XPath pattern (label=label_text passed as a keyword argument to .xpath()) is a production best practice that lxml XPath supports natively. It prevents XPath injection through malformed input strings and allows lxml’s internal XPath compiler to cache the compiled expression — a measurable performance gain on high-frequency extractions.
Sibling Axes: The Label-Value Pattern at Scale
The following-sibling:: and preceding-sibling:: axes solve the most common real-world DOM traversal problem in XPath scraping: data laid out in definition list or table formats where position is semantic.
import lxml.html as html
SAMPLE_HTML = """
<table class="pricing-matrix">
<thead>
<tr>
<th>Plan</th><th>Monthly</th><th>Annual</th><th>Users</th>
</tr>
</thead>
<tbody>
<tr data-plan="starter">
<td class="plan-name">Starter</td>
<td class="price monthly">$29</td>
<td class="price annual">$290</td>
<td class="users">Up to 5</td>
</tr>
<tr data-plan="growth">
<td class="plan-name">Growth</td>
<td class="price monthly">$99</td>
<td class="price annual">$990</td>
<td class="users">Up to 25</td>
</tr>
</tbody>
</table>
"""
def extract_plan_pricing(tree) -> list[dict]:
"""
Extracts pricing data using sibling-axis XPath expressions.
Demonstrates relative positioning without hardcoded td[N] indices.
"""
plans = []
# Select each row by its data-plan attribute — semantic anchor
for row in tree.xpath("//tr[@data-plan]"):
plan_name_cell = row.xpath("td[@class='plan-name']")[0]
# Use following-sibling to get cells relative to the plan name cell
# This is resilient to column reordering unlike td[2], td[3]
monthly = plan_name_cell.xpath(
"following-sibling::td[contains(@class,'monthly')]/text()"
)
annual = plan_name_cell.xpath(
"following-sibling::td[contains(@class,'annual')]/text()"
)
users = plan_name_cell.xpath(
"following-sibling::td[@class='users']/text()"
)
plans.append({
"plan": plan_name_cell.text_content().strip(),
"monthly": monthly[0] if monthly else "",
"annual": annual[0] if annual else "",
"users": users[0] if users else "",
})
return plans
doc = html.fromstring(SAMPLE_HTML)
print(extract_plan_pricing(doc))
The following-sibling::td[contains(@class,'monthly')] pattern is preferable to td[2] because it survives column additions and reordering — common events in e-commerce product comparison tables that are A/B tested continuously.
Recommended reading: For a deeper treatment of Python scraping tool selection alongside XPath, see DataFlirt’s Best Free Web Scraping Tools for Developers — it covers how lxml, parsel, Scrapy, and Playwright fit together in a production stack.
Advanced Predicates: Combining Conditions and String Functions
Production XPath scraping requires combining multiple predicates and string manipulation functions. The XPath 1.0 string function library is rich enough to handle the majority of text-matching problems encountered in structured data extraction.
import lxml.html as html
from typing import List, Dict
COMPLEX_HTML = """
<div class="product-listing">
<article class="item featured sale" data-id="101" data-stock="12">
<h3 class="title"> Wireless Headphones Pro </h3>
<div class="badge-group">
<span class="badge sale">20% OFF</span>
<span class="badge new">NEW</span>
</div>
<p class="price">$79.99</p>
<p class="original-price">$99.99</p>
</article>
<article class="item" data-id="102" data-stock="0">
<h3 class="title">USB-C Hub 7-Port</h3>
<p class="price">$49.99</p>
</article>
<article class="item sale" data-id="103" data-stock="3">
<h3 class="title">Mechanical Keyboard TKL</h3>
<span class="badge sale">15% OFF</span>
<p class="price">$84.99</p>
<p class="original-price">$99.99</p>
</article>
</div>
"""
def extract_sale_items_in_stock(tree) -> List[Dict]:
"""
Demonstrates compound predicate XPath expressions:
- contains() for partial class name matching (space-separated class tokens)
- normalize-space() for whitespace-tolerant text extraction
- Numeric comparison predicates on data attributes
- Multiple predicates on a single axis step
"""
results = []
# XPath expressions combining three conditions in sequence:
# 1. Article has 'sale' in its class token list
# 2. Article's data-stock is greater than 0 (string-to-number coercion)
# 3. Article contains a .original-price element (confirming discount exists)
sale_items = tree.xpath("""
//article[
contains(concat(' ', normalize-space(@class), ' '), ' sale ')
and number(@data-stock) > 0
and .//p[@class='original-price']
]
""")
for item in sale_items:
# normalize-space() strips leading/trailing whitespace and collapses internals
title = item.xpath("normalize-space(h3[@class='title']/text())")
# Extract the discount badge text — may not exist on all sale items
discount = item.xpath(
".//span[contains(@class,'badge') and contains(@class,'sale')]/text()"
)
price = item.xpath("p[@class='price']/text()")
original = item.xpath("p[@class='original-price']/text()")
stock = item.get("data-stock", "0")
results.append({
"id": item.get("data-id"),
"title": title,
"discount": discount[0] if discount else "",
"price": price[0] if price else "",
"original_price": original[0] if original else "",
"stock": int(stock),
})
return results
doc = html.fromstring(COMPLEX_HTML)
items = extract_sale_items_in_stock(doc)
for item in items:
print(item)
The contains(concat(' ', normalize-space(@class), ' '), ' sale ') idiom is the canonical XPath 1.0 technique for safe class token matching. A naïve contains(@class, 'sale') would falsely match classes like on-sale or wholesale. The concat() approach wraps the class string with spaces so you always match the full token.
Relative XPath: Anchoring to Stable Structural Nodes
The worst XPath expressions in production scrapers are absolute paths: /html/body/div[3]/div[1]/article[2]/p[4]. These are brittle by design and break every time the page layout changes. The correct approach is always relative XPath scraping anchored to semantically stable nodes.
import lxml.html as html
def build_resilient_extractor(doc, anchor_xpath: str, relative_xpaths: dict) -> list:
"""
Pattern for building resilient extractors:
1. Find a stable structural anchor (semantic ID, unique landmark attribute)
2. Extract all targets relative to that anchor
3. Never hardcode absolute path positions
"""
results = []
for anchor in doc.xpath(anchor_xpath):
record = {}
for field_name, relative_expr in relative_xpaths.items():
values = anchor.xpath(relative_expr)
record[field_name] = values[0].strip() if values else None
results.append(record)
return results
# Example: e-commerce product grid
# Anchor: any article with a data-product-id (stable attribute)
# All fields extracted relative to that anchor
extractor_config = {
"anchor": "//article[@data-product-id]",
"fields": {
"name": "descendant::*[contains(@class,'name') or contains(@class,'title')][1]/text()",
"price": "descendant::*[contains(@class,'price') and not(contains(@class,'original'))][1]/text()",
"image_src": "descendant::img[not(contains(@class,'icon'))]/@src",
"image_data": "descendant::img/@data-src", # lazy-load fallback
"product_id": "@data-product-id",
"rating": "descendant::*[@itemprop='ratingValue']/@content",
}
}
The [not(contains(@class,'original'))] predicate pattern is critical for price extraction — most e-commerce pages render both current and struck-through original prices in adjacent elements, and positional selectors like p.price:first-child are fragile. This XPath expression explicitly excludes nodes whose class names indicate they carry the original price.
Iframe XPath Scraping with Playwright
Iframes represent a hard boundary for DOM traversal. An XPath expression evaluated against the top-level document cannot reach nodes inside an iframe — even if the iframe is same-origin. You must switch execution context into the iframe document, then apply XPath expressions there.
import asyncio
from playwright.async_api import async_playwright
import lxml.html as html
async def scrape_iframe_content(url: str) -> list[dict]:
"""
Demonstrates iframe context switching for XPath scraping.
Prerequisites: playwright install chromium
Common use case: pricing tables, embedded review widgets,
payment forms, and third-party ad slots loaded via iframe.
"""
async with async_playwright() as pw:
browser = await pw.chromium.launch(headless=True)
context = await browser.new_context(
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
)
)
page = await context.new_page()
await page.goto(url, wait_until="networkidle")
results = []
# Strategy 1: Target iframe by its src domain (most resilient)
for frame in page.frames:
if "reviews-widget" in frame.url or "embedded-data" in frame.url:
# Wait for iframe content to be interactive
try:
await frame.wait_for_selector("div.review-card", timeout=10_000)
except:
continue
iframe_html = await frame.content()
# Now use lxml XPath scraping on the extracted iframe document
tree = html.fromstring(iframe_html)
# XPath expressions run against the iframe's own DOM
reviews = tree.xpath("//div[contains(@class,'review-card')]")
for review in reviews:
author = review.xpath(
"descendant::span[@class='reviewer-name']/text()"
)
rating = review.xpath(
"descendant::*[@data-rating]/@data-rating"
)
body = review.xpath(
"normalize-space(descendant::p[@class='review-body'])"
)
results.append({
"author": author[0] if author else "",
"rating": rating[0] if rating else "",
"body": body,
})
# Strategy 2: Using Playwright's frame_locator API
# frame_locator accepts CSS selector targeting the <iframe> element
try:
frame_loc = page.frame_locator("iframe[title='Review Widget']")
# Mix Playwright's locator with XPath expressions
cards = frame_loc.locator("xpath=//div[@class='review-card']")
count = await cards.count()
for i in range(count):
card = cards.nth(i)
author_el = card.locator("xpath=.//span[@class='reviewer-name']")
author_text = await author_el.inner_text()
results.append({"author": author_text.strip()})
except Exception as e:
pass # frame not found — log and continue
await browser.close()
return results
# asyncio.run(scrape_iframe_content("https://example.com/product/123"))
The frame_locator("iframe[title='Review Widget']") approach is more stable than iterating over page.frames by URL because title attributes on iframes are typically set by the embedding site and are semantically meaningful. If neither URL nor title is reliable, use the iframe’s position-in-DOM as a last resort: page.frame_locator("iframe:nth-of-type(2)").
Video Blob URL Extraction: Network Interception + XPath
Blob URLs (blob:https://...) are memory references generated by the browser’s Media Source Extensions API. They are transient, process-local, and non-resolvable over HTTP. You cannot extract a blob URL and play it in another browser tab. However, the underlying media manifest URLs — HLS .m3u8 files or DASH .mpd files — that feed the blob stream are fully accessible via network interception.
XPath scraping handles the metadata layer (player attributes, video element data attributes, poster images). Network interception captures the actual stream URLs.
import asyncio
import re
import json
from playwright.async_api import async_playwright, Request
import lxml.html as html
async def extract_video_metadata_and_stream(url: str) -> dict:
"""
Two-layer video extraction strategy:
Layer 1: XPath scraping against DOM for video metadata
Layer 2: Playwright request interception for HLS/DASH manifest URLs
Handles: YouTube embeds, Vimeo, JW Player, Video.js, and custom MSE players.
Prerequisites:
- pip install playwright
- playwright install chromium
"""
intercepted_manifests = []
intercepted_api_calls = []
async def handle_request(request: Request):
req_url = request.url
# Intercept HLS manifests
if re.search(r'\.(m3u8)(\?|$)', req_url):
intercepted_manifests.append({
"type": "hls",
"url": req_url,
"headers": dict(request.headers),
})
# Intercept DASH manifests
elif re.search(r'\.(mpd)(\?|$)', req_url):
intercepted_manifests.append({
"type": "dash",
"url": req_url,
})
# Intercept video API calls that return stream tokens
elif re.search(r'/(video|stream|playback|manifest)/', req_url):
intercepted_api_calls.append(req_url)
async with async_playwright() as pw:
browser = await pw.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
# Wire the request interceptor BEFORE navigation
page.on("request", handle_request)
await page.goto(url, wait_until="networkidle", timeout=45_000)
# Allow the player to initialize and start loading the stream
await asyncio.sleep(3)
# XPath scraping layer: extract all video element metadata from DOM
page_html = await page.content()
tree = html.fromstring(page_html)
video_metadata = {}
# XPath expressions for common video player attribute patterns
# Each targets a different player framework's data model
# Standard HTML5 video element
video_src = tree.xpath("//video/@src")
video_poster = tree.xpath("//video/@poster")
# JW Player configuration
jw_data = tree.xpath(
"//div[contains(@class,'jwplayer') or @id='jwplayer']/@data-config"
)
# Video.js player
videojs_src = tree.xpath(
"//video[contains(@class,'video-js')]/@data-setup"
)
# Custom data attributes used by proprietary players
# These vary — XPath's wildcard attribute search is useful here
custom_stream = tree.xpath(
"//*[@data-hls-url or @data-stream-url or @data-video-src]"
"/@*[contains(name(),'url') or contains(name(),'src')]"
)
# Thumbnail/poster extraction for video correlating
poster_images = tree.xpath(
"//video/@poster | //img[contains(@class,'video-thumb')]/@src"
" | //img[contains(@class,'thumbnail')]/@data-src"
)
video_metadata = {
"direct_src": video_src,
"poster": video_poster or poster_images,
"jw_config": jw_data,
"videojs_config": videojs_src,
"custom_stream_attrs": custom_stream,
}
await browser.close()
return {
"page_url": url,
"metadata": video_metadata,
"hls_manifests": intercepted_manifests,
"api_calls": intercepted_api_calls,
}
The XPath expression //*[@data-hls-url or @data-stream-url or @data-video-src]/@*[contains(name(),'url') or contains(name(),'src')] is worth examining. It uses name() — a node-set function that returns the attribute name as a string — to match any attribute whose name contains ‘url’ or ‘src’, regardless of what element carries it. This is an intentionally broad sweep useful when you encounter a proprietary player you have not seen before.
Recommended reading: DataFlirt’s Best Approaches to Scraping Dynamic JavaScript Sites Without Getting Blocked covers the full Playwright configuration stack that pairs with the network interception pattern above.
Image Extraction: srcset, Lazy-Load, and data-src Patterns
Modern image delivery uses multiple attributes simultaneously — src, srcset, data-src, data-lazy-src, and WebP source sets inside <picture> elements. Comprehensive image XPath scraping must handle all variants.
import lxml.html as html
from dataclasses import dataclass
from typing import Optional, List
@dataclass
class ImageRecord:
src: Optional[str]
srcset: Optional[str]
data_src: Optional[str]
alt: Optional[str]
width: Optional[str]
height: Optional[str]
loading: Optional[str]
best_url: Optional[str] # Resolved best-quality URL
def extract_images_comprehensive(tree) -> List[ImageRecord]:
"""
Comprehensive image extraction using XPath expressions that cover:
- Standard src attribute
- Responsive srcset with multiple resolutions
- Lazy-loaded images using data-src, data-lazy, data-original
- Picture element with source sets
- WebP source elements with type fallbacks
XPath scraping strategy: extract all attributes at once per element,
then resolve the best-quality URL in Python post-processing.
"""
images = []
# XPath expression that finds ALL image-bearing elements:
# regular img tags, img inside picture, and noscript-wrapped lazy images
img_elements = tree.xpath("""
//img[
@src or @data-src or @data-lazy or @data-original
or @srcset or @data-srcset
]
| //noscript[.//img]/img
| //picture/img
""")
for img in img_elements:
# Extract every image-related attribute in one pass
src = img.get("src", "").strip()
srcset = img.get("srcset", img.get("data-srcset", "")).strip()
# Lazy-load attribute cascade — different frameworks use different names
data_src = (
img.get("data-src") or
img.get("data-lazy") or
img.get("data-original") or
img.get("data-lazy-src") or
img.get("data-defer-src")
)
# Skip base64 placeholder images (common with lazy-load skeletons)
if src.startswith("data:image") and not data_src:
continue
# Resolve best URL: data-src > srcset (highest res) > src
best_url = data_src or _resolve_highest_srcset(srcset) or src or None
images.append(ImageRecord(
src=src or None,
srcset=srcset or None,
data_src=data_src,
alt=img.get("alt"),
width=img.get("width"),
height=img.get("height"),
loading=img.get("loading"),
best_url=best_url,
))
# Also extract WebP sources from <picture> elements
# The XPath expression traverses to parent <picture> then finds <source>
for source in tree.xpath("//picture/source[@srcset]"):
srcset_val = source.get("srcset", "")
img_type = source.get("type", "")
# Add high-quality WebP sources as supplementary records
if "webp" in img_type and srcset_val:
images.append(ImageRecord(
src=None,
srcset=srcset_val,
data_src=None,
alt=None,
width=source.get("width"),
height=source.get("height"),
loading=None,
best_url=_resolve_highest_srcset(srcset_val),
))
return images
def _resolve_highest_srcset(srcset: str) -> Optional[str]:
"""
Parses a srcset string and returns the URL with the highest
width descriptor (e.g., "img-400.jpg 400w, img-800.jpg 800w" → img-800.jpg).
Falls back to the first URL if no width descriptor is present.
"""
if not srcset:
return None
candidates = []
for entry in srcset.split(","):
parts = entry.strip().split()
if len(parts) == 2:
url, descriptor = parts
try:
width = int(descriptor.rstrip("wx"))
candidates.append((width, url))
except ValueError:
continue
elif len(parts) == 1:
candidates.append((0, parts[0]))
if not candidates:
return None
candidates.sort(key=lambda x: x[0], reverse=True)
return candidates[0][1]
XPath Namespace Handling in Python
XML namespaces are the most common reason engineers abandon XPath scraping for API responses. The symptom: your XPath returns an empty list even though the node clearly exists in the document. The cause: the document uses a default namespace and your XPath expression ignores it.
import lxml.etree as etree
from io import BytesIO
# A realistic namespaced XML response — common from e-commerce APIs,
# real estate data feeds, and government open data portals
NAMESPACED_XML = b"""<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:commerce="http://schemas.example.com/commerce/v2"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#">
<entry>
<title>Laptop Pro 16 M4</title>
<commerce:price currency="USD">1999.00</commerce:price>
<commerce:stock>42</commerce:stock>
<geo:lat>37.7749</geo:lat>
<geo:long>-122.4194</geo:long>
<link href="https://example.com/product/laptop-pro-16"/>
</entry>
<entry>
<title>Wireless Earbuds X3</title>
<commerce:price currency="USD">149.00</commerce:price>
<commerce:stock>0</commerce:stock>
<geo:lat>40.7128</geo:lat>
<geo:long>-74.0060</geo:long>
<link href="https://example.com/product/earbuds-x3"/>
</entry>
</feed>
"""
def extract_namespaced_feed(xml_bytes: bytes) -> list[dict]:
"""
Two strategies for namespace-aware XPath expressions:
Strategy A: Explicit namespace mapping — correct, performant, maintainable.
Strategy B: local-name() bypass — quick prototype, breaks on name collisions.
"""
tree = etree.parse(BytesIO(xml_bytes))
# Strategy A: Explicit namespace map
# Register prefixes that map to the document's namespace URIs
# The prefix names you choose (atom, commerce, geo) are arbitrary —
# they only need to match the URIs in the document
nsmap = {
"atom": "http://www.w3.org/2005/Atom",
"commerce": "http://schemas.example.com/commerce/v2",
"geo": "http://www.w3.org/2003/01/geo/wgs84_pos#",
}
results = []
# lxml XPath with explicit namespace map — the production pattern
entries = tree.xpath("//atom:entry", namespaces=nsmap)
for entry in entries:
title = entry.xpath("atom:title/text()", namespaces=nsmap)
price = entry.xpath("commerce:price/text()", namespaces=nsmap)
currency = entry.xpath("commerce:price/@currency", namespaces=nsmap)
stock = entry.xpath("commerce:stock/text()", namespaces=nsmap)
link = entry.xpath("atom:link/@href", namespaces=nsmap)
lat = entry.xpath("geo:lat/text()", namespaces=nsmap)
lon = entry.xpath("geo:long/text()", namespaces=nsmap)
results.append({
"title": title[0] if title else "",
"price": float(price[0]) if price else None,
"currency": currency[0] if currency else "",
"stock": int(stock[0]) if stock else 0,
"url": link[0] if link else "",
"lat": float(lat[0]) if lat else None,
"lon": float(lon[0]) if lon else None,
})
# Strategy B: local-name() bypass — useful for rapid prototyping
# when namespace URIs are unknown or inconsistent across responses
entries_b = tree.xpath("//*[local-name()='entry']")
prices_b = tree.xpath(
"//*[local-name()='entry']/*[local-name()='price']/text()"
)
return results
print(extract_namespaced_feed(NAMESPACED_XML))
The local-name() strategy deserves a direct warning: in documents that mix multiple namespaces where elements from different namespaces share the same local name, it will silently match the wrong nodes. Always use explicit namespace mappings in production XPath scraping pipelines. Use local-name() only as a diagnostic tool.
XPath Scraping in Scrapy with parsel
Scrapy’s parsel library is the production wrapper for XPath scraping in Python’s most widely deployed scraping framework. It exposes lxml XPath with a cleaner API and adds CSS selector support on the same Selector object.
# spiders/advanced_xpath_spider.py
import scrapy
from parsel import Selector
import json
class AdvancedXPathSpider(scrapy.Spider):
name = "xpath_demo"
start_urls = ["https://example.com/catalogue/"]
custom_settings = {
"CONCURRENT_REQUESTS": 32,
"DOWNLOAD_DELAY": 0.8,
"AUTOTHROTTLE_ENABLED": True,
"ROBOTSTXT_OBEY": True,
}
def parse(self, response):
# parsel's .xpath() is lxml XPath with a cleaner return API
# .get() returns first result or None; .getall() returns all as list
# XPath expressions using ancestor axis to resolve obfuscated containers
for product in response.xpath("//article[@data-product-id]"):
# Relative XPath from the article anchor
name = product.xpath(
"descendant::*[self::h1 or self::h2 or self::h3]"
"[not(ancestor::nav)][1]/text()"
).get("").strip()
# Price with disambiguation between current and original
current_price = product.xpath(
".//span[contains(@class,'price')"
" and not(contains(@class,'original'))"
" and not(contains(@class,'was'))]/text()"
).get("").strip()
# Handle structured data in JSON-LD (very common in 2026 e-commerce)
json_ld = response.xpath(
"//script[@type='application/ld+json']/text()"
).getall()
structured_data = None
for ld_block in json_ld:
try:
data = json.loads(ld_block)
if isinstance(data, dict) and data.get("@type") == "Product":
structured_data = data
break
except json.JSONDecodeError:
continue
# XPath + structured data hybrid extraction
# Use structured data where available; XPath scraping as fallback
yield {
"url": response.url,
"name": (
structured_data.get("name") if structured_data
else name
),
"price": (
structured_data.get("offers", {}).get("price") if structured_data
else current_price
),
"product_id": product.xpath("@data-product-id").get(""),
}
# Pagination — follow-sibling axis to get next page
# Prefers 'next' rel link over positional selectors
next_page = response.xpath(
"//a[@rel='next']/@href"
" | //li[contains(@class,'page-item')]"
"[following-sibling::li[contains(@class,'active')]]"
"/a/@href"
).get()
if next_page:
yield response.follow(next_page, callback=self.parse)
The JSON-LD hybrid extraction pattern is increasingly important for 2026 XPath scraping. E-commerce and news sites embed application/ld+json blocks carrying machine-readable product, article, and event data per Google’s structured data guidelines. Parsing this is always preferable to DOM traversal when it is available — it is authoritative, stable, and schema-documented. Use XPath scraping as the fallback when JSON-LD is absent or incomplete.
Recommended reading: For the full Scrapy middleware and pipeline architecture that this spider runs inside, see DataFlirt’s Best Scraping Tools for Python Developers in 2026.
LLM-Augmented XPath Generation
The most significant productivity acceleration in XPath scraping workflows in 2025–2026 is using large language models to generate initial XPath expressions from raw HTML. Rather than manually inspecting the DOM, you pass a representative HTML snippet with a data extraction goal and receive syntactically correct XPath expressions that you validate and deploy.
The failure mode is well-documented: LLMs generate positional predicates (div[3], tr[1]) that are brittle. Prompt engineering that emphasises resilience mitigates this.
Google Gemini 3.1 via GenAI SDK — API Mode
# Prerequisites: pip install google-genai
# Set env: GOOGLE_API_KEY=your_key
from google import genai
from google.genai import types
import json
client = genai.Client() # uses GOOGLE_API_KEY env var
XPATH_SYSTEM_PROMPT = """You are an expert in XPath scraping and HTML parsing.
Given an HTML snippet and a list of fields to extract, you generate resilient
XPath expressions that:
1. Avoid positional predicates (div[3], tr[2]) unless absolutely necessary
2. Anchor to stable semantic attributes (data-*, aria-*, id, role)
3. Use contains() for partial class matching
4. Use normalize-space() for text-bearing elements
5. Use the following-sibling or ancestor axis when label-value pairs are present
Return ONLY a JSON object with field names as keys and XPath expressions as values.
No explanation, no markdown fences."""
def generate_xpath_with_gemini(html_snippet: str, fields: list[str]) -> dict:
"""
Generates XPath expressions using Gemini 3.1 Flash for cost efficiency.
Use gemini-3.1-pro for more complex DOM structures.
The response is validated before being used in production.
"""
prompt = f"""HTML snippet:
{html_snippet[:15000]}
Extract these fields: {', '.join(fields)}
Return a JSON object with each field name as a key and its XPath expression as the value."""
response = client.models.generate_content(
model="gemini-3.1-flash", # Use gemini-3.1-pro for complex DOMs
contents=[types.Part.from_text(prompt)],
config=types.GenerateContentConfig(
system_instruction=XPATH_SYSTEM_PROMPT,
response_mime_type="application/json",
temperature=0.1, # Low temperature for deterministic structural output
)
)
try:
return json.loads(response.text)
except json.JSONDecodeError:
# Strip any accidental markdown fences
clean = response.text.strip().lstrip("```json").rstrip("```").strip()
return json.loads(clean)
Google Vertex AI — GenAI SDK Mode (Enterprise)
# Prerequisites: pip install google-genai google-auth
# Set env: GOOGLE_CLOUD_PROJECT=your-project-id
# Authenticate: gcloud auth application-default login
import vertexai
from google import genai as vertex_genai
from google.genai import types as vertex_types
# Initialize Vertex AI with your project
vertexai.init(project="your-gcp-project-id", location="us-central1")
# Use Vertex AI backend via the unified GenAI SDK
vertex_client = vertex_genai.Client(vertexai=True)
def generate_xpath_vertex(html_snippet: str, fields: list[str]) -> dict:
"""
Vertex AI mode — identical API surface to the standard GenAI SDK
but routes through your GCP project for enterprise billing and compliance.
Useful when scraping pipelines run inside GCP infrastructure.
"""
prompt = f"""HTML:\n{html_snippet[:15000]}\n\nFields to extract: {', '.join(fields)}\n
Return ONLY a JSON object mapping field names to XPath expressions.
Use ancestor/sibling axes where label-value pairs are present.
Prefer contains() over exact class matching."""
response = vertex_client.models.generate_content(
model="gemini-3.1-pro", # Pro tier for complex DOM structures in enterprise context
contents=[vertex_types.Part.from_text(prompt)],
config=vertex_types.GenerateContentConfig(
system_instruction=XPATH_SYSTEM_PROMPT,
response_mime_type="application/json",
temperature=0.1,
)
)
return json.loads(response.text)
Claude Sonnet and Opus via Anthropic SDK
# Prerequisites: pip install anthropic
# Set env: ANTHROPIC_API_KEY=your_key
import anthropic
import json
anthropic_client = anthropic.Anthropic() # uses ANTHROPIC_API_KEY
def generate_xpath_claude(
html_snippet: str,
fields: list[str],
use_opus: bool = False
) -> dict:
"""
Claude-based XPath expression generation.
Use claude-opus-4-6 for highly complex DOMs with deeply nested
conditional rendering, shadow DOM hints, or unusual structural patterns.
Use claude-sonnet-4-6 (default) for standard e-commerce and content
extraction tasks — significantly faster and more cost-efficient.
Claude's strength relative to Gemini for this task: stronger reasoning
about DOM structure semantics, better at proposing ancestor-axis solutions,
and more reliable at avoiding positional predicates when prompted.
"""
model = "claude-opus-4-6" if use_opus else "claude-sonnet-4-6"
message = anthropic_client.messages.create(
model=model,
max_tokens=1500,
system=XPATH_SYSTEM_PROMPT,
messages=[{
"role": "user",
"content": (
f"HTML snippet:\n{html_snippet[:20000]}\n\n"
f"Extract these fields using resilient XPath expressions: "
f"{', '.join(fields)}\n\n"
f"Return ONLY a JSON object. "
f"Use ancestor and sibling axes where applicable. "
f"Avoid positional predicates."
)
}]
)
raw_text = message.content[0].text
try:
return json.loads(raw_text)
except json.JSONDecodeError:
# Claude occasionally wraps JSON in markdown despite instructions
clean = raw_text.strip().lstrip("```json").lstrip("```").rstrip("```").strip()
return json.loads(clean)
def validate_xpath_batch(html: str, xpath_map: dict) -> dict:
"""
Validates LLM-generated XPath expressions against a known HTML sample.
Returns a report of which expressions produce results and which fail silently.
Run this BEFORE deploying LLM-generated XPath to production.
"""
import lxml.html as lxml_html
tree = lxml_html.fromstring(html)
validation_report = {}
for field, xpath_expr in xpath_map.items():
try:
result = tree.xpath(xpath_expr)
validation_report[field] = {
"xpath": xpath_expr,
"status": "ok" if result else "empty",
"result_count": len(result),
"sample": str(result[0])[:100] if result else None,
}
except Exception as e:
validation_report[field] = {
"xpath": xpath_expr,
"status": "error",
"error": str(e),
}
return validation_report
The validate_xpath_batch() function is not optional. LLMs generate syntactically valid XPath expressions that return empty results on the actual document more often than they generate syntactically invalid ones. An empty result is a silent failure in a scraping pipeline. Always validate against a ground-truth sample before promotion to production.
Recommended reading: DataFlirt’s Best Scraping Tools Powered by LLMs in 2026 covers the broader LLM-augmented extraction landscape, including evaluation frameworks for comparing LLM selector generation quality across model families.
Debugging XPath Expressions in Production
Silent failures — empty results from valid XPath expressions — are the most time-consuming debugging scenario in XPath scraping. The following diagnostic toolkit covers the three most common root causes.
import lxml.html as html
import lxml.etree as etree
def diagnose_xpath(html_string: str, failing_xpath: str) -> dict:
"""
Systematic XPath expression debugger.
Common failure modes covered:
1. Whitespace in class attributes causing contains() mismatch
2. Namespace pollution from embedded SVG or MathML
3. HTML parser normalisation changing attribute casing
4. Encoding issues producing entity characters in text nodes
5. Phantom whitespace-only text nodes
"""
tree = html.fromstring(html_string)
diagnostics = {
"original_xpath": failing_xpath,
"original_result": tree.xpath(failing_xpath),
}
# Diagnostic 1: Namespace pollution
# If the document contains SVG or MathML, lxml may inject namespace prefixes
# that interfere with XPath DOM traversal
all_namespaces = {
el.tag.split("}")[0].strip("{")
for el in tree.iter()
if "}" in el.tag
}
diagnostics["detected_namespaces"] = list(all_namespaces)
if all_namespaces:
diagnostics["namespace_note"] = (
"Document contains non-HTML namespaces. If your XPath targets elements "
"in these namespaces, you must register them explicitly or use local-name()."
)
# Diagnostic 2: Whitespace in class attributes
# Trailing spaces in class attributes cause contains(@class, 'foo') to mismatch
# when the class value is ' foo ' or 'foo bar '
sample_elements = tree.xpath("//*[@class]")[:5]
diagnostics["class_samples"] = [
{"tag": el.tag, "class": repr(el.get("class"))}
for el in sample_elements
]
# Diagnostic 3: Serialise the subtree for visual inspection
# This reveals what the parser actually ingested vs what you assume
# Particularly useful when scraping sites with malformed HTML
if failing_xpath.startswith("//"):
# Find the likely parent context
parent_tag = failing_xpath.split("/")[2].split("[")[0].split("::")[-1]
parent_nodes = tree.xpath(f"//{parent_tag}")[:2]
diagnostics["parent_node_serialised"] = [
etree.tostring(n, pretty_print=True).decode()[:500]
for n in parent_nodes
]
return diagnostics
# Interactive debugging helper for parsel/Scrapy development
def xpath_test_harness(html_string: str, xpaths: dict) -> None:
"""
Quick multi-expression test harness. Pass a dict of name → XPath expression
and see results for all at once. Useful during spider development.
"""
tree = html.fromstring(html_string)
print(f"{'Field':<25} {'Status':<8} {'Sample Result'}")
print("-" * 80)
for name, expr in xpaths.items():
try:
result = tree.xpath(expr)
if result:
sample = str(result[0])[:50].strip()
print(f"{name:<25} {'OK':<8} {sample}")
else:
print(f"{name:<25} {'EMPTY':<8} (no nodes matched)")
except etree.XPathError as e:
print(f"{name:<25} {'ERROR':<8} {str(e)[:50]}")
Performance Considerations: lxml vs selectolax vs Alternatives
For pipelines where XPath scraping is the primary parse operation at volume, the parser choice matters. lxml XPath at scale on a single CPU core processes approximately 800–1,200 medium-complexity HTML documents per second. For purely structural extraction without XPath, selectolax’s CSS API is 8–12x faster — but it does not expose XPath.
The engineering decision tree:
- Need XPath expressions with axis traversal → lxml. No alternative.
- CSS selectors sufficient, need maximum throughput → selectolax.
- Need both XPath and CSS on same object → parsel (Scrapy’s wrapper over lxml).
- Need XPath on JS-rendered content → Playwright to render, lxml to parse the resulting HTML.
- Processing millions of small XML documents → lxml’s
iterparse()in streaming mode — avoids loading the full document tree into memory.
# Streaming lxml for large XML feeds — avoids OOM on 100MB+ feeds
import lxml.etree as etree
def stream_large_xml_feed(xml_path: str, target_tag: str, nsmap: dict) -> list[dict]:
"""
iterparse() for memory-efficient XPath scraping of large XML documents.
Processes one element at a time; clears processed elements to keep memory flat.
"""
results = []
# Resolve tag with namespace prefix
ns_uri = nsmap.get(target_tag.split(":")[0]) if ":" in target_tag else None
local = target_tag.split(":")[-1]
qualified_tag = f"{{{ns_uri}}}{local}" if ns_uri else local
context = etree.iterparse(xml_path, events=("end",), tag=qualified_tag)
for event, elem in context:
# Extract data using XPath against this single element
results.append({
"title": elem.xpath(
"atom:title/text()", namespaces=nsmap
)[0] if elem.xpath("atom:title/text()", namespaces=nsmap) else "",
"id": elem.get("id", ""),
})
# Critical: clear processed elements and their ancestors
# Without this, lxml accumulates the full document in memory
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
return results
Production XPath Scraping: The Complete Pattern Summary
After covering the full range of XPath scraping techniques, here is the synthesised production decision framework DataFlirt’s engineering team uses when building extraction pipelines:
Always anchor to semantic attributes, not positional indices. @data-product-id, @id, @role, @aria-label, and @itemprop survive front-end refactors. div[3] does not.
Use parameterised XPath (keyword arguments to lxml’s .xpath()) for any expression that incorporates user input or variable data. It prevents injection and enables compiled-expression caching.
Combine JSON-LD extraction with XPath scraping as a hybrid pipeline. Structured data blocks are the authoritative source when present. XPath is the reliable fallback.
Validate LLM-generated XPath expressions against a ground-truth sample before promotion. Silent empty-result failures are more dangerous than syntax errors.
Use iterparse() for XML feeds exceeding 10MB. Loading the full document tree for streaming data is a memory management error, not a performance trade-off.
Profile with normalize-space() in all text predicates. HTML source whitespace is inconsistent and parser-dependent. Always normalise before matching.
Recommended reading: To complete your production scraping stack alongside the XPath scraping techniques in this guide, see DataFlirt’s Top 5 Scraping Tools for Extracting Structured Data with CSS and XPath for a comparative tool evaluation, and Best Databases for Storing Scraped Data at Scale for pipeline output architecture.
Further Reading from DataFlirt
Engineering teams extending their XPath scraping infrastructure into production deployments will find these guides directly relevant:
- Best Free Web Scraping Tools for Developers in 2026 — Covers the full open-source ecosystem including Scrapy, Playwright, lxml, and Camoufox integration patterns that sit upstream of your XPath extraction layer
- Best Approaches to Scraping Dynamic JavaScript Sites Without Getting Blocked — Playwright rendering strategies that produce the DOM your XPath expressions operate against
- Top 7 Anti-Fingerprinting Tools Every Scraper Should Know About — Infrastructure-level evasion that determines whether your scraper reaches the DOM in the first place
- 7 Reasons Your Scraper Keeps Getting Blocked and the Tools to Fix Each One — Diagnostic framework for production pipeline failures that surface above the XPath layer
- Best Scraping Tools for Python Developers in 2026 — Broader Python scraping ecosystem context for the lxml, parsel, and Scrapy tools used throughout this guide
- Top 5 Scraping Compliance and Legal Considerations Every Scraper Should Know — Legal framework for operating production scraping pipelines responsibly
Frequently Asked Questions
When should I choose XPath over CSS selectors for web scraping?
Choose XPath scraping over CSS selectors when you need to navigate up the DOM (parent or ancestor axes), when you need text content matching in predicates, when the target document carries XML namespaces, or when positional logic relative to document structure is required. CSS selectors cannot traverse upward and have no equivalent to XPath’s string-function predicates, making XPath the only option for a significant class of complex DOM traversal problems.
What is the best Python library for XPath scraping in 2026?
lxml remains the fastest and most complete XPath 1.0 implementation in Python. For Scrapy pipelines, parsel wraps lxml with a cleaner API. For JavaScript-rendered pages, Playwright handles rendering while lxml or parsel processes the HTML output. Avoid BeautifulSoup for XPath-heavy workloads — its find() API does not natively support XPath expressions.
Can XPath extract data from inside iframes?
XPath cannot cross iframe document boundaries. You must first switch execution context to the iframe using Playwright’s frame_locator() or frames property, then apply XPath expressions against the sub-document. In lxml processing, if you have already extracted the iframe’s HTML as a separate string, standard XPath scraping applies normally.
How do I handle XPath with XML namespaces when scraping?
Pass an explicit namespace mapping dict to lxml’s .xpath() call: tree.xpath('//ns:element', namespaces={'ns': 'http://example.com/ns'}). Alternatively, use local-name() to bypass namespace matching for rapid prototyping, but switch to explicit mappings for production pipelines to avoid cross-namespace collisions.
Is LLM-generated XPath reliable enough for production scraping?
LLM-generated XPath expressions via Gemini 3.1 or Claude Sonnet work well as a bootstrap tool. Always run the validate_xpath_batch() function against a ground-truth sample before deploying. The common failure mode is LLMs generating positional predicates that break on paginated content — counter this with prompt engineering that explicitly prohibits positional selectors.
Does reCAPTCHA v3 have a challenge to solve?
No — but this is relevant context for XPath scraping pipelines targeting Google properties. reCAPTCHA v3 operates invisibly and assigns a risk score. The correct response is fingerprint engineering and IP quality management, not challenge solving. See DataFlirt’s Google CAPTCHA bypass guide for the full evasion stack.