What is OCR (Optical Character Recognition)?
OCR (Optical Character Recognition) is the computational process of converting raster images of text — scanned PDFs, rendered canvas elements, or obfuscated contact details — into machine-readable string data. In a scraping pipeline, it serves as the ultimate fallback when DOM extraction fails or when targets deliberately render critical fields as images to thwart automated collection. Relying on it introduces latency and non-deterministic error rates that require aggressive downstream validation.