← Glossary / Screen Scraping

What is Screen Scraping?

Screen scraping extracts data from the visual render of a page rather than from its underlying markup or API responses. Instead of querying the DOM, it captures what the browser actually draws — using pixel coordinates, OCR, or visual element detection — to pull values that exist only in the rendered output. For scraping engineers, it's the last resort: slower and more brittle than DOM-based extraction, but the only option when markup is obfuscated, content is in canvas or WebGL, or the target is a legacy app with no accessible HTML.

RenderOCRVisual ExtractionCanvasLegacy Systems
// 02 — definitions

When the DOM
won't cooperate.

Screen scraping treats the browser as a black box — you don't care what's in the HTML; you care about the pixels that end up on screen. It's the correct tool in specific situations and a liability everywhere else.

Ask a DataFlirt engineer →

TL;DR

Screen scraping captures data from the visual render rather than the DOM. It's used when markup is obfuscated, content lives in canvas or images, or the target is a non-web interface. The tradeoff is cost: screenshot capture, OCR processing, and visual element detection are 10–50× slower than DOM queries. Use it when there's no faster path, not as a general approach.

01Definition & how it differs from DOM scraping

DOM scraping queries the HTML structure: document.querySelector('.price') returns a node. Screen scraping queries the visual output: "the number at pixel coordinates (312, 204) in this screenshot." The data is the same; the access path is entirely different.

DOM scraping works on the parse tree. Screen scraping works on rasterised pixels — often via screenshot capture followed by OCR, or by coordinate-based region extraction. The browser renders the page; you capture what it drew.

02Extraction methods

Three approaches, in order of reliability:

  • Region OCR — capture a screenshot, crop a region, run OCR. Works for text rendered anywhere, including canvas. Depends on rendering engine consistency.
  • Canvas API interception — hook the drawing API in JavaScript to capture values before they're rendered to pixels. More reliable than OCR but requires script injection.
  • Visual element detection — use computer vision models to locate and classify UI elements. Used for complex legacy apps where neither DOM nor coordinates are reliable.
03Why targets use canvas to render data

Three reasons, with different implications for extraction:

  • Intentional obfuscation — financial data providers and premium content sites render values in canvas to prevent scraping. They know the DOM approach won't work.
  • Performance — trading terminals and real-time dashboards use canvas because it renders faster than DOM updates at high tick rates. Scraping difficulty is a side effect, not the goal.
  • Legacy architecture — older Java Applet or Flash applications migrated to canvas. The canvas is the app, not a display layer over a DOM.
04How DataFlirt approaches canvas targets

Our first pass on any canvas target is always network interception — capturing the WebSocket or XHR payloads that populate the canvas, which often contain the raw data in JSON. If the connection is binary-encoded WebSocket, we attempt binary protocol reverse-engineering before falling back to OCR.

When OCR is required, we pin the rendering environment (specific Chrome version, exact viewport) and run confidence scoring on every extracted field. Fields below 0.93 confidence are flagged and manually reviewed before delivery. We don't deliver uncertain data silently.

05Try network interception before screen scraping

Most canvas-rendered data comes from somewhere. The JS that draws it received it via an API call, a WebSocket message, or an embedded JSON blob. In Playwright, page.on('response', ...) intercepts every network response. In Chrome DevTools, the Network tab shows everything. Check those sources before reaching for OCR.

In our experience, about 60% of "canvas targets" have interceptable JSON payloads that make OCR unnecessary. The remaining 40% are genuine render-only cases — binary WebSocket streams, server-side rendered canvas images, or fully obfuscated data paths.

// 03 — the cost model

Why screen scraping
is the last resort.

Screen scraping has a fundamentally different cost structure from DOM extraction. These three comparisons quantify why it should be reserved for cases where DOM access genuinely isn't possible — not just inconvenient.

Extraction latency ratio = Lscreen / LDOM = screenshot + OCR + coordinate_map
Typical ratio: 15–50×. Screenshot alone adds 200–800ms per page. DataFlirt benchmark suite
OCR accuracy = AOCR = 1character_errors / total_characters
95% accuracy sounds good. On a 20-char price string it means 1 wrong character. Tesseract / Google Vision benchmarks
Coordinate fragility = P(break) ∝ layout_changes + font_rendering_delta + viewport_variance
A 1px font rendering difference moves every coordinate below the affected element. Internal stability analysis
// 04 — visual extraction trace

Screenshot to
structured record.

Extraction trace for a financial data terminal that renders values in a canvas element — no DOM nodes, no accessible HTML. Screenshot capture followed by region OCR.

canvas targetTesseract OCRregion-based extraction
edge.dataflirt.io — live
CAPTURED
// capture
method: "playwright.screenshot"
viewport: 1440x900
wait_for: "networkidle"
capture.ms: 680

// target element: canvas — no DOM access
canvas.node: present but unreadable
dom.textContent: empty

// OCR regions
region.price: [312, 204, 480, 228]
ocr.price.raw: "₹ 2,847.50"
ocr.price.confidence: 0.97
region.change: [312, 232, 480, 252]
ocr.change.raw: "+1.24 (0.044%)"
ocr.change.confidence: 0.94

// output
record.price: 2847.50
record.change_pct: 0.044
extraction.total_ms: 1,240 // vs ~18ms for DOM
// 05 — when to use it

Legitimate cases
for screen scraping.

Five scenarios where screen scraping is the correct tool, not a workaround. Outside these cases, DOM extraction or API interception should be used instead.

USE CASES TRACKED ·  ·    across active pipelines
DOM POSSIBLE ·  ·  ·  ·   < 5% of these cases
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Canvas / WebGL rendered data

no DOM nodes · Charts, terminals, game UIs
02

Obfuscated or encoded markup

DOM inaccessible · Intentionally scrambled class names
03

Legacy desktop / Java apps

no HTML at all · Internal tools, ERP, trading desks
04

PDF and image-only content

raster only · Scanned docs, image-embedded tables
05

Anti-bot obfuscated text

CSS scrambled text · Prices in pseudo-elements or SVG
// 06 — DataFlirt's approach to render extraction

Pixel-level,

only when unavoidable.

Before we use screen scraping on any target, we exhaust three alternatives: DOM extraction, network request interception (capturing the underlying JSON the JS fetches), and HTML-embedded JSON blobs in script tags. Screen scraping is approved for a pipeline only when all three return nothing useful. When we do use it, we pin to specific viewport dimensions and rendering engines to keep coordinate maps stable.

Render extraction decision

Pipeline assessment for a financial data target with canvas-rendered prices.

target.type canvas-rendered terminal
dom.accessible no
api.intercepted no — WebSocket binary
json.embedded no
method.approved screen scraping
viewport.pinned 1440x900 · Chrome 124
ocr.engine Google Vision API
accuracy.threshold > 0.93 per field

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About when screen scraping is appropriate, OCR accuracy tradeoffs, coordinate stability, and how DataFlirt handles canvas and legacy app extraction.

Ask us directly →
When should I use screen scraping instead of DOM scraping? +
When the data you need genuinely doesn't exist in the DOM. Specifically: canvas or WebGL renders, legacy desktop applications, PDFs and scanned images, and targets that deliberately obfuscate markup. If you can get the same data by querying the DOM or intercepting network requests, do that. Screen scraping is 15–50× slower and breaks whenever layout changes.
How accurate is OCR for extracting prices and numbers? +
Character-level accuracy is 94–98% with a good engine on clean renders. That sounds high, but a 20-character price string at 97% accuracy has a 46% chance of containing at least one error. For numeric data, always validate the extracted value against expected ranges and formats, and flag anything that doesn't parse as a number.
Can I intercept canvas data without OCR? +
Sometimes. If the canvas is drawn by JavaScript, you can intercept the drawing calls by hooking CanvasRenderingContext2D methods before the script runs — this gives you the raw data being rendered without needing OCR. It requires a custom browser extension or a Playwright page script injection. More reliable than OCR when it works, but target-specific.
How do I keep coordinate maps stable as the page changes? +
Pin the viewport size and rendering engine version. Use relative coordinates (percentage of canvas dimensions) rather than absolute pixels where possible. Add confidence score thresholds to OCR output — low confidence is a canary for layout drift. Re-validate coordinate maps on every pipeline deployment, not just on failure.
Is screen scraping slower than DOM scraping? +
Yes, always. A DOM query takes 1–5ms. A screenshot takes 200–800ms. OCR adds another 300–2000ms depending on the engine and image complexity. For high-volume pipelines, screen scraping is rarely viable at scale unless the data genuinely can't be obtained any other way and the business value justifies the cost per record.
Does DataFlirt support screen scraping pipelines? +
Yes, for cases where DOM and API interception both fail. We assess each target before approving screen scraping as the extraction method — it requires explicit sign-off because the ongoing maintenance cost is higher. Canvas extraction and legacy app scraping are our most common screen scraping use cases.
$ dataflirt scope --new-project --target=screen-scraping READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h