← Glossary / Web Scraping

What is Web Scraping?

Web scraping is the automated extraction of structured data from websites — fetching pages, parsing the response, and mapping unstructured HTML or JSON into rows and fields a downstream system can use. The web is the largest dataset that nobody publishes as a dataset. Scraping is how it becomes one. Done well, it's a reliable data pipeline. Done naively, it's a script that breaks the first time the target changes a class name or deploys Cloudflare.

InfrastructureExtractionParsingAnti-botPipelines
// 02 — definitions

Turn pages
into rows.

Web scraping is three problems wearing one name: getting the page (access), getting the data out of it (extraction), and doing both reliably at scale (operations). Most failures live in the third.

Ask a DataFlirt engineer →

TL;DR

Web scraping fetches web pages and extracts structured data from them — product prices, listings, reviews, filings — into formats like JSON or CSV. The fetch layer fights anti-bot systems (Cloudflare, Akamai, DataDome), the parse layer fights changing HTML, and the ops layer fights both at once. A production scraper is a monitored pipeline with retries, validation, and alerting — not a script on a cron job.

01Definition & the three layers

Web scraping is programmatic data extraction from websites. Every scraper, from a 20-line script to a distributed system, has three layers:

  • Fetch — retrieve the page. Plain HTTP client (httpx, requests) or headless browser (Playwright) when JavaScript rendering is required. This layer absorbs all anti-bot friction.
  • Parse — extract fields from the response. CSS selectors or XPath against HTML, key paths against JSON APIs, regex as a last resort.
  • Store — validate against a schema and write records to their destination: files, object storage, a database, or a delivery feed.

The layers fail independently. A fetch failure is visible (4xx, 5xx, challenge page). A parse failure is often silent — the request succeeds, the selector misses, and the field is quietly empty.

02How a scraping job runs in practice

A production run starts with a URL manifest — usually produced by a crawler. For each URL: select a proxy from the pool, send the request with a coherent header and TLS profile, check the response for challenge pages or decoy content, parse the fields, validate the record, write it. Failures go to a retry queue with backoff and a different exit IP.

The run ends with a reconciliation pass: records written vs URLs attempted, yield per field, cost per record. Web scraping at scale is less about the extraction code and more about this accounting — it's how you know the dataset you delivered is actually complete.

03The access problem: anti-bot systems

Most commercially valuable targets sit behind Cloudflare, Akamai, DataDome, or PerimeterX. These systems classify traffic before serving content, using signals across layers:

  • Network — IP reputation, ASN type (datacenter vs residential), request rate per IP
  • Protocol — TLS fingerprints (JA3/JA4), HTTP/2 frame ordering, header coherence
  • Behaviour — JavaScript challenge results, mouse and timing entropy, navigation patterns

A scraper that gets the parse layer perfect but ignores these signals extracts nothing — the page it parses is a 403 or a challenge interstitial. Access engineering is half the discipline, and it's the half that changes monthly.

04How DataFlirt runs extraction

We separate access from extraction. The fetch layer is shared infrastructure — proxy pools, fingerprint management, challenge handling — maintained centrally, so an anti-bot escalation on one target gets fixed once, not per-pipeline. Extraction logic is per-target and schema-first: we define the output schema before writing a selector, and every record is validated against it at write time.

Our monitoring watches yield per field, not just request success. fetch.success_rate at 99% means nothing if price is empty on a third of records. When field yield drops, the pipeline alerts and quarantines the affected records — clients get a late dataset over a wrong one, every time.

05Misconception: scraping is mostly parsing

The parsing code in a mature web scraping pipeline is typically under 10% of the codebase. The other 90% is access (proxy rotation, fingerprints, challenge handling), reliability (retries, backoff, dead-letter queues), and quality (schema validation, anomaly detection, reconciliation).

This is why "we'll just use BeautifulSoup" projects stall: the tutorial covers the 10%. It's also why LLM-based extraction hasn't eliminated the discipline — a model can replace your selectors, but it can't get you past Cloudflare, and at scale it costs more per page than the proxy that does. The hard part was never reading the HTML. It's getting the HTML, every day, without being noticed.

// 03 — the cost model

What a record
actually costs.

Scraping economics are driven by three quantities: how often requests succeed, what each request costs, and how much of what you fetch is usable. DataFlirt prices and monitors pipelines against all three — per-record cost is the metric that matters, not per-request.

Effective cost per record = Crec = (request_cost + proxy_cost + render_cost) / success_rate
A 50% success rate doubles your real cost. Blocks are an economics problem, not just an access problem. DataFlirt pipeline accounting
Extraction yield = Y = valid_records / pages_fetched
Yield below 0.9 means selectors are stale or the target is serving decoy content. Internal SLO
Render decision rule = if (fieldraw_html) → httpx elseheadless_browser
Browsers cost 10–50× more per page than plain HTTP. Render only when the data demands it. DataFlirt fetch router, 2025
// 04 — one extraction job

From URL to record,
end to end.

A single product-page extraction on a Cloudflare-protected e-commerce target. Fetch, parse, validate — the three stages every scraping job runs, with the metrics each one reports.

httpx + fallback renderCSS selectorsschema-validated
edge.dataflirt.io — live
CAPTURED
// fetch
target: "ajio.com/p/sneaker-4410223"
proxy: "residential · IN · jio-asn"
status: 200 // no challenge served
response_time: 412ms
render_needed: false // price present in raw HTML

// parse
selector.title: "h1.prod-name"hit
selector.price: "div.prod-sp"hit
selector.mrp: "span.prod-cp"hit
selector.stock: "div.size-avail"hit

// validate
schema.check: pass // 14/14 required fields
price.sanity: pass // within 3σ of category median
record.written: "s3://df-prod/ajio/2026-06-11/...parquet"
job.cost: $0.0011 / record
// 05 — why scrapers break

Where extraction
pipelines fail.

Root causes of scraping pipeline failures across production jobs. The pattern is consistent: access failures (blocks, challenges) get all the attention, but silent extraction failures — stale selectors returning empty fields without erroring — do the most damage to dataset quality.

PIPELINES TRACKED ·  ·    200+ active
WINDOW ·  ·  ·  ·  ·  ·   90d trailing
UPDATED ·  ·  ·  ·  ·  ·  2026-05-28
01

Layout / selector drift

% of incidents · Target redesigns, class renames, A/B tests
02

Anti-bot escalation

% of incidents · New Cloudflare rules, JS challenges, JA4 checks
03

Silent partial extraction

% of incidents · Selectors match but return empty or decoy data
04

Render dependency changes

% of incidents · Field moves from raw HTML to JS-hydrated state
05

Rate limit tightening

% of incidents · Target lowers per-IP thresholds without notice
// 06 — DataFlirt's extraction stack

Scrapers fail,

pipelines recover.

We treat every scraper as a pipeline with explicit contracts at each stage. The fetch layer routes between plain HTTP and Playwright based on where the data actually lives. The parse layer validates every record against a schema before it's written — a selector returning empty strings fails loudly, not silently. The ops layer monitors yield per target and alerts when extraction quality drops, usually hours before a client would notice. Most of our engineering effort goes into the recovery path, not the happy path.

Pipeline health — one target

Live extraction metrics for a continuous e-commerce price feed.

target nykaa.com
fetch.success_rate 99.2% · 24h
fetch.mode httpx 94% · render 6%
parse.yield 97.8% valid records
schema.version v14 · 22 fields
selector.drift 1 field flagged
records.today 412,990 written
cost.per_record $0.0009

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About legality, scraping JavaScript-heavy sites, build vs buy decisions, handling site redesigns, and what it takes to run extraction reliably at scale.

Ask us directly →
Is web scraping legal? +
Scraping publicly accessible data is broadly lawful — hiQ v. LinkedIn established that accessing public pages doesn't violate the CFAA — but the picture changes with authentication walls, personal data (GDPR/DPDP apply regardless of how data was collected), and contractual terms you've agreed to. The practical rule: public data, no login, no PII, respect robots.txt, and you're in well-trodden territory. Anything beyond that needs a case-by-case legal read.
What's the difference between web scraping and web crawling? +
Web scraping extracts data from pages; crawling discovers which pages exist by following links. A crawler outputs URLs, a scraper outputs records. Production pipelines chain them — crawl the category tree to build a URL manifest, then scrape each product page. They have different failure modes and should be separate, separately monitored components.
How do you scrape sites that render everything with JavaScript? +
Two options: run a headless browser (Playwright, Puppeteer) to execute the JS, or intercept the underlying API the JavaScript calls. The second is almost always better — most SPAs hydrate from JSON endpoints you can hit directly, which is 10–50× cheaper than rendering and returns cleaner data than parsed HTML. We render only when a field is provably absent from both the raw HTML and any discoverable API response.
Why do scrapers keep breaking, and can that be prevented? +
Because targets change without notice — redesigns, class renames, A/B tests, anti-bot upgrades. It can't be prevented, only detected fast. The fix is schema validation on every record plus yield monitoring per field: when price suddenly returns empty on 40% of pages, the pipeline alerts within minutes instead of delivering a silently broken dataset. Breakage is inevitable; silent breakage is a design choice.
Should I build a web scraping pipeline in-house or buy managed extraction? +
Build if extraction is core to your product and you'll staff 1–2 engineers on it permanently — that's the real cost, not the initial script. Buy if you need the data, not the infrastructure. A working scraper takes a week; keeping it working against anti-bot escalation, layout drift, and proxy management is a permanent operational load that most teams underestimate by an order of magnitude.
How does DataFlirt price and deliver web scraping projects? +
Per-record or per-month for continuous feeds, scoped after a 20-minute call. We deliver a pilot dataset within a week so you can validate fields and quality before committing. Delivery formats: CSV, JSON, Parquet to S3/GCS, or direct database push. Pipelines run with yield SLOs — if extraction quality drops below threshold, our on-call fixes it before the next scheduled delivery, not after you file a ticket.
$ dataflirt scope --new-project --target=web-scraping READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h