← Glossary / OCR (Optical Character Recognition)

What is OCR (Optical Character Recognition)?

OCR (Optical Character Recognition) is the computational process of converting raster images of text — scanned PDFs, rendered canvas elements, or obfuscated contact details — into machine-readable string data. In a scraping pipeline, it serves as the ultimate fallback when DOM extraction fails or when targets deliberately render critical fields as images to thwart automated collection. Relying on it introduces latency and non-deterministic error rates that require aggressive downstream validation.

Computer VisionTesseractDocument AIImage ProcessingAnti-Obfuscation
// 02 — definitions

Pixels to
payloads.

The mechanics of extracting structured text from unstructured image data when the DOM refuses to cooperate.

Ask a DataFlirt engineer →

TL;DR

OCR bridges the gap between human-readable images and machine-readable text. Modern pipelines use lightweight engines like Tesseract for simple anti-scraping bypasses (e.g., email addresses rendered as PNGs) and heavy Vision-Language Models (VLMs) for complex document parsing. It is computationally expensive, making it a targeted tool rather than a default extraction method.

01Definition & structure
OCR (Optical Character Recognition) is the process of analyzing a raster image (pixels) to identify and extract text characters. In web scraping, it is used to recover data that is intentionally obfuscated by the target site. The output is typically a string of text accompanied by bounding box coordinates and a confidence score for each character or word.
02How it works in practice
An OCR pipeline consists of three stages:
  • Pre-processing: The image is converted to grayscale, binarized (black and white), deskewed, and upscaled to improve contrast and edge definition.
  • Inference: An engine like Tesseract (using LSTM networks) or a modern VLM analyzes the pixel patterns and predicts the corresponding characters.
  • Post-processing: The raw text is validated against expected formats (e.g., regex for phone numbers) and corrected using heuristic rules or dictionary lookups.
03Common scraping use cases
OCR is deployed when DOM extraction is impossible. Common scenarios include: extracting email addresses or phone numbers rendered as base64 PNGs to prevent spam harvesting; reading pricing data drawn onto an HTML5 <canvas> element; and parsing historical public records or government filings that are only available as scanned PDF documents.
04How DataFlirt handles it
We run a tiered, asynchronous OCR pipeline. When an extraction worker detects an obfuscated image field, it ships the buffer to a dedicated GPU cluster. Simple, regex-bound fields (like phone numbers) hit a highly optimized C++ Tesseract wrapper for sub-100ms response times. Complex layouts or heavily distorted images are routed to a fine-tuned Vision-Language Model. This keeps the main crawl loop fast while ensuring high data recovery rates.
05The latency penalty
OCR is orders of magnitude slower than standard HTML parsing. A typical XPath extraction takes roughly 2 milliseconds. Extracting the same text via a headless browser screenshot and OCR inference can take 400 to 800 milliseconds. Because of this massive compute cost, OCR must be applied surgically to specific fields, never as a blanket approach to page parsing.
// 03 — the math

How accurate
is the read?

OCR quality is measured by Character Error Rate (CER) and Word Error Rate (WER). DataFlirt monitors these metrics continuously to trigger human-in-the-loop review when confidence drops below threshold.

Character Error Rate (CER) = CER = (S + D + I) / N
Substitutions, Deletions, Insertions over Total Characters. Lower is better. Standard Levenshtein distance metric
Confidence Threshold = P(correct) = ∏ ci > 0.92
Product of character confidences. Drops sharply on long strings. Tesseract inference engine
DataFlirt OCR Latency = Ttotal = Tpreprocess + (Nboxes × Tinfer)
Pre-processing (deskew, binarize) often takes longer than the actual inference. Internal pipeline SLO
// 04 — pipeline trace

Extracting an
obfuscated phone number.

A trace of an extraction worker encountering a base64-encoded image instead of a text node, triggering the OCR fallback routine.

Tesseract v5OpenCVRegex Validation
edge.dataflirt.io — live
CAPTURED
// DOM extraction failure
dom.phone: missing
dom.phone.node: <img src="data:image/png;base64,iVBOR...">

// routing to OCR fallback
ocr.engine: "tesseract_fast"
ocr.preprocess: ["grayscale", "threshold_otsu", "scale_2x"]

// inference
ocr.raw_output: "+91-9876S-43210"
ocr.confidence: 0.87

// post-processing & validation
regex.match: false // invalid character 'S'
ocr.correction: "S" -> "5" // heuristic substitution
regex.match: true

// final output
field.phone: "+91-98765-43210"
status: recovered
// 05 — failure modes

Where OCR
breaks down.

Ranked by share of OCR-related extraction failures across DataFlirt's active pipelines. Image quality and layout complexity are the primary drivers of error.

OCR INVOCATIONS ·  ·  ·   1.2M / day
AVG CER ·  ·  ·  ·  ·  ·  1.4%
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Low contrast / background noise

% of errors · Watermarks or patterned backgrounds defeat binarization
02

Non-standard fonts

% of errors · Highly stylized or custom anti-scraping glyphs
03

Skewed or distorted images

% of errors · Requires expensive affine transformations to correct
04

Complex multi-column layouts

% of errors · Bounding box overlap causes reading order failures
05

Language / charset mismatch

% of errors · Missing language packs for specific unicode ranges
// 06 — our stack

Targeted inference,

only when the DOM goes dark.

DataFlirt treats OCR as an expensive fallback, not a primary extraction method. When a target obfuscates a field, our extraction layer automatically routes the image buffer to a dedicated GPU cluster. We use lightweight models for simple text and heavy Vision-Language Models for complex tables, ensuring we only pay the latency penalty when the data value justifies it.

ocr-worker-node.log

Live metrics from a GPU worker processing obfuscated pricing data.

worker.id gpu-ocr-in-04
engine.active tesseract-v5.3
throughput 142 images/sec
latency.p95 118ms
confidence.avg 0.96
fallback.vlm 12 requests
node.status healthy

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About OCR accuracy, anti-scraping bypasses, latency costs, and how DataFlirt integrates computer vision into high-throughput pipelines.

Ask us directly →
What is the difference between OCR and Document AI? +
OCR is strictly about converting pixels to characters — it gives you raw text and bounding boxes. Document AI (or Information Extraction) goes a step further, using machine learning to understand the semantic structure of the document. OCR tells you "Total: $45.00" is on the page; Document AI maps that string to the invoice_total field in your schema.
Why not just use OCR for the whole page instead of parsing HTML? +
Latency and accuracy. Parsing a DOM tree with XPath takes roughly 2 milliseconds and is 100% deterministic. Rendering a page in a headless browser, taking a screenshot, and running OCR takes 500–2000 milliseconds and introduces probabilistic errors (e.g., reading a '0' as an 'O'). OCR is a fallback, never a primary strategy.
Is scraping data via OCR legal? +
The method of extraction (DOM parsing vs. OCR) generally does not change the legal status of the data collection. If the underlying data is public and factual, extracting it via OCR is typically lawful under precedents like hiQ v. LinkedIn. However, using OCR to bypass technical protection measures (like CAPTCHAs) can introduce CFAA or ToS complications. Consult counsel for specific use cases.
How do you deal with custom fonts designed to break OCR? +
Some targets use custom web fonts where the unicode mapping is scrambled (e.g., the letter 'A' renders as '7'). Standard OCR fails here. We handle this by intercepting the font file during the network fetch, generating a reverse-mapping dictionary based on the glyph vectors, and translating the text directly without needing visual OCR at all.
What is the latency impact of adding OCR to a pipeline? +
Significant. A standard HTTP fetch and parse pipeline can process 100+ pages per second per core. Adding a lightweight OCR step (like Tesseract) drops that to ~10 pages per second. Routing to a heavy Vision-Language Model drops it to <1 page per second. DataFlirt mitigates this by running OCR asynchronously on dedicated GPU nodes, preventing it from blocking the main crawl loop.
How does DataFlirt handle image-based CAPTCHAs? +
We don't solve CAPTCHAs with OCR in real-time. Our primary strategy is to maintain high-quality residential IP pools and pristine browser fingerprints so the CAPTCHA is never served in the first place. If a pipeline encounters a CAPTCHA, the session is burned and rotated. Solving them is an arms race that degrades pipeline reliability.
$ dataflirt scope --new-project --target=ocr-(optical-character-recognition) READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h