← Glossary / Web Components

What is Web Components?

Web components are a suite of browser APIs that allow developers to create custom, reusable HTML tags with encapsulated styling and markup. For scraping pipelines, they represent a significant extraction hurdle: their internal structure is hidden behind a Shadow DOM boundary, rendering standard CSS selectors and XPath queries useless. If your scraper sees an empty custom tag where the data should be, you've hit a web component.

Shadow DOMCustom ElementsPlaywrightExtractionDOM Traversal
// 02 — definitions

Piercing the
shadow boundary.

Why standard extraction logic fails on modern web apps, and how to traverse encapsulated DOM trees.

Ask a DataFlirt engineer →

TL;DR

Web components encapsulate their internal DOM (the Shadow DOM) to prevent CSS and JS from leaking in or out. This breaks traditional scraping tools like BeautifulSoup or Cheerio that parse static HTML. To extract data from a web component, your pipeline must either use a headless browser with shadow-piercing selectors or execute client-side JavaScript to cross the boundary.

01Definition & structure
Web components consist of three main technologies: Custom Elements (defining new HTML tags), Shadow DOM (encapsulating markup and style), and HTML Templates (reusable markup blocks). Together, they allow developers to build complex, isolated widgets that don't interfere with the rest of the page.
02How it works in practice
When a browser encounters a custom element like <data-grid>, it executes the associated JavaScript class. This class attaches a shadow root to the element and populates it with HTML. To a static scraper fetching the raw HTTP response, <data-grid> is just an empty tag. The actual data only exists in the browser's memory after rendering.
03The Shadow DOM barrier
The shadow boundary is designed specifically to prevent external CSS and JavaScript from interfering with the component. This is great for frontend modularity but terrible for scraping, as document.querySelectorAll stops at the boundary. You cannot select an element inside a shadow root from the main document context without specialized traversal logic.
04How DataFlirt handles it
We don't rely on brittle JS evaluation scripts to cross boundaries. Our extraction workers use a flattened virtual DOM approach, allowing our schema engine to select elements inside shadow roots as if they were part of the standard light DOM. This keeps our extraction logic clean and our pipelines resilient to frontend refactors.
05Did you know?
The >>> shadow-piercing CSS combinator was originally proposed as a web standard but was abandoned because it violated the core principle of encapsulation. Automation tools like Puppeteer and Playwright had to implement it custom at the protocol level just to make end-to-end testing (and scraping) possible.
// 03 — the extraction math

How deep is
the shadow tree?

Traversing shadow boundaries adds computational overhead. DataFlirt's extraction engine calculates the traversal cost to optimize selector execution paths.

Shadow traversal cost = Tcost = Nnodes × O(log D)
D is the depth of nested shadow roots. Deeply nested components degrade extraction speed. Browser rendering engine heuristics
Playwright pierce selector = "my-element >> css=.price"
The engine crosses the shadow boundary automatically. Slower than native CSS. Playwright documentation
DataFlirt extraction latency = Lext < 12 ms
Median time to extract 50+ fields across 4 nested shadow roots. Internal SLO
// 04 — shadow dom traversal

Extracting data
from the dark.

A trace of an extraction worker attempting to read a price inside a custom <product-card> component.

PlaywrightShadow DOMNode.js
edge.dataflirt.io — live
CAPTURED
// standard selector attempt
query: "document.querySelector('product-card .price')"
result: null // shadow boundary blocked access

// shadow root inspection
element: "<product-card>"
shadowRoot.mode: "open" // traversable

// piercing selector execution
query: "page.locator('product-card >> .price')"
node.text: "$149.99"
status: extracted
// 05 — extraction failure modes

Why web components
break pipelines.

The most common reasons extraction jobs fail when encountering custom elements, based on DataFlirt's telemetry across modern SPA targets.

SPA TARGETS ·  ·  ·  ·    42% of fleet
SHADOW ROOTS ·  ·  ·  ·   avg 14 per page
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Closed shadow roots

JS evaluation required · Cannot be pierced by standard browser APIs
02

Hydration delays

timing issue · Component tag exists, but internal DOM is pending JS execution
03

Nested shadow boundaries

selector complexity · Components inside components require multi-step piercing
04

Dynamic slot injection

DOM mutation · Content moves between light DOM and shadow DOM slots
05

Static HTML parser failure

tooling mismatch · Using BeautifulSoup on a site that requires a headless browser
// 06 — our extraction engine

Native traversal,

without the JavaScript overhead.

Most scraping teams handle web components by injecting brittle JavaScript evaluation blocks into their headless browsers. This is slow, prone to memory leaks, and breaks when the target site updates its component architecture. DataFlirt's extraction engine parses the DOM tree natively, automatically flattening open shadow roots into a unified, queryable structure. We extract from custom elements exactly as we do from standard HTML, keeping pipeline latency low and selector maintenance centralized.

Shadow DOM Extraction Job

Live metrics from a worker extracting product data from a Salesforce Commerce Cloud storefront heavily utilizing web components.

target.framework Lightning Web Components
shadow_roots.detected 128
pierce_mode native-flattenactive
extraction.latency 8.4 msfast
closed_roots 0
fields.extracted 42/42complete

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about scraping sites built with Web Components, Shadow DOM, and modern frontend frameworks.

Ask us directly →
Can I scrape web components with BeautifulSoup or Cheerio? +
No. Static HTML parsers only see the light DOM — the custom tags themselves (e.g., <my-widget>). The internal structure (the shadow tree) is generated by JavaScript at runtime. You must use a headless browser or a specialized JS-rendering engine to execute the component logic before extraction.
What is the difference between an open and closed shadow root? +
An open shadow root can be accessed via JavaScript using element.shadowRoot. A closed shadow root returns null for that property, making it intentionally difficult to inspect from the outside. While closed roots are rare in the wild, they require intercepting the component's creation script to extract data.
How do I write a CSS selector that pierces the shadow DOM? +
Standard CSS cannot pierce the shadow DOM. However, modern automation tools provide custom combinators. In Playwright, you can use the >> combinator or the pierce/ engine. Puppeteer uses >>>. These instruct the browser automation protocol to traverse the boundary.
Why is my scraper returning empty text for a web component? +
Usually, it's a hydration issue. The custom element tag is in the DOM immediately, but the JavaScript required to build its internal shadow tree hasn't finished executing. You need to wait for a specific element inside the shadow root to appear, rather than just waiting for the custom tag.
How does DataFlirt handle deeply nested web components? +
Our extraction layer automatically flattens open shadow roots into a unified virtual DOM during the parsing phase. This allows our clients to write standard, flat extraction schemas without worrying about which fields live inside which shadow boundaries. It abstracts away the frontend framework complexity.
Do web components impact anti-bot detection? +
Indirectly, yes. Because web components force you to use a headless browser rather than simple HTTP requests, you inherit the fingerprinting risks associated with browser automation. If your Playwright instance isn't properly cloaked, the anti-bot system will flag you long before the web components even render.
$ dataflirt scope --new-project --target=web-components READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h