Web Scraping Best Practices

A scraper that works on your laptop and dies in production is the most common story in data extraction. It pulled 50 clean rows in a notebook. Pointed at 50,000 pages behind a real anti-bot system, it returned 403s, silent fake data, and a half-empty table. Web scraping best practices exist to close that gap between the demo and the deployment.

The practices below are the ones that decide whether a pipeline survives contact with production. They cover the legal exposure most teams underestimate, the anti-bot reality of 2026, proxy and rate-limit discipline, the data-quality work nobody budgets for, and how to scale without rewriting everything. DataFlirt builds pipelines against exactly these constraints, so the guidance here reflects what holds up on real jobs.

Is web scraping legal, or am I about to get sued?

This is the question a serious buyer asks first, so it goes first. Scraping publicly available data is generally not a criminal act in the US, but public does not mean consequence-free.

The hiQ v. LinkedIn litigation set the anchor. In April 2022 the Ninth Circuit reaffirmed that scraping public data does not breach the Computer Fraud and Abuse Act, since there is no access “without authorization” on a page anyone can view. That reading held, per Morgan Lewis analysis.

Public access is the floor, not the ceiling

The same case shows the catch. hiQ won on the CFAA and still lost the war.

It settled in late 2022 with a judgment against it and liability under California trespass-to-chattels and misappropriation, for scraping while logged in with fake accounts. The lesson is sharp. Stay on public, logged-out pages. The moment you create accounts or click through a login to reach data, contract and tort claims open up that the CFAA ruling never touched.

Personal data is a separate, harsher regime

If your scrape touches a single EU or UK resident’s personal data, GDPR applies in full, regardless of how public the page is.

The “if it’s public, I can take it” assumption is false here. Regulators have made the point expensively. The Dutch data protection authority fined Clearview AI 30.5 million euros in 2024 for building a facial database from public images, per a published GDPR fines roundup. France’s CNIL separately penalized a contact-scraping vendor for harvesting LinkedIn details, even data users had restricted.

The statutory ceiling is 20 million euros or 4% of global turnover, whichever is higher. For personal data, you need a documented lawful basis, usually a legitimate-interest assessment, plus data minimization and retention limits. India’s DPDP Act and California’s CCPA push in the same direction. DataFlirt avoids collecting personal data without a lawful basis, which makes it a compliance-aware data extraction partner for risk-averse teams.

A practical legal posture

None of this is legal advice, and the case law is still moving. The honest steer is a risk posture, paired with counsel for your specifics.

Data you are scraping	Typical risk level	Practical step
Public, non-personal (prices, specs, listings)	Lower	Respect robots.txt and rate limits; check ToS
Public personal data (profiles, contacts)	High under GDPR/DPDP	Document a lawful basis; minimize and set retention
Logged-in or paywalled content	High (contract/CFAA)	Avoid; do not fake accounts
Copyrighted text, images, media	Variable	Assess fair use and licensing; get advice

DataFlirt scopes this shape before a single crawler is written, focusing on publicly available, non-personal data and documenting data provenance so the audit trail stays clean. Getting the legal posture right before code is itself one of the web scraping best practices most rushed projects skip, and it is where DataFlirt starts every engagement. For a deeper read, the scraping compliance considerations guide goes further. For regulated work, the GDPR and web scraping breakdown is the next stop.

What is the right way to pull data off a page?

Answer the structure question before writing a selector. The cleanest extraction path is the one that avoids parsing HTML at all.

Check for an API or a hidden JSON endpoint first

Many sites render content from a backend JSON endpoint that the browser calls after page load. Open the network tab, filter by fetch and XHR, and look for the request that returns the data you want.

Hitting that endpoint directly gives you structured records with no DOM parsing and far less breakage. It is faster, lighter, and survives front-end redesigns that would shatter a selector-based scraper. This is the single highest-value habit in data extraction. DataFlirt looks for these endpoints first on every engagement, which is one reason its feeds break less often than selector-only scrapers.

When you must parse HTML, choose the selector deliberately

No JSON endpoint means parsing the markup. Here the choice is between CSS selectors and XPath.

CSS selectors are shorter and read well for class- and id-based targeting. XPath wins when you need to walk relationships, select by text content, or climb to a parent node. Most production scrapers mix both. DataFlirt builds parsers with lxml and Parsel, the same selector engine Scrapy ships, which is why messy, inconsistent HTML still yields clean fields. The XPath in Python guide covers the syntax that handles the awkward cases.

Static markup versus JavaScript rendering

If the data is in the initial HTML, a plain HTTP client is enough. If it only appears after scripts run, you need a headless browser.

Reaching for a browser when an HTTP request would do is the most common cause of slow, expensive scrapers. Test with a simple request first. Render only when the content genuinely depends on it. For the JavaScript-heavy cases, the dynamic JavaScript site approaches post lays out the options, and DataFlirt handles these with Playwright when rendering is unavoidable, pairing it with stealth tooling so the browser still looks human.

How do I scrape without getting blocked in 2026?

Most blocks no longer start with your IP address. They start at the TLS handshake, before your headers are even read. This is the biggest shift in anti-bot detection, and getting it wrong makes everything downstream pointless.

Fix your TLS fingerprint first

Plain Requests and httpx hand over a Python TLS signature that Cloudflare and DataDome have seen millions of times. They flag it instantly.

TLS fingerprinting (the JA3/JA4 hash of your handshake) is now a primary detection vector. For static targets, a client like curl_cffi impersonates a real browser’s TLS and HTTP/2 fingerprint, which removes the most common instant-block trigger. httpx and aiohttp stay useful for clean APIs and unprotected sites, just not against a TLS gate. DataFlirt manages fingerprinting and proxy rotation as a matched system, which is why it gets data out of sites in-house scripts can’t crack.

Match the tool to the wall

Not every site needs a browser. Picking the heaviest tool by default wastes money and time. Use this decision path.

Target signal	Right tool
Data in a JSON API	Plain HTTP client plus proxy
Static HTML behind a TLS gate	curl_cffi impersonating a browser
JavaScript challenge or Turnstile	Real or managed browser
High volume on a hard target	Managed browser plus residential proxies

One caution from the field. Common stealth patches are detectable, and puppeteer-stealth was deprecated in early 2025, so teams now lean on maintained anti-detect approaches and managed browsers. DataFlirt treats this circumvention as a core engineering discipline rather than a bolt-on, which is what keeps data flowing from hard targets like Amazon and Google Shopping.

Proxies: residential versus datacenter

Datacenter proxies are cheap and fast and fine for open targets. Residential proxies cost more and are necessary on protected ones.

Use datacenter IPs for government databases, public APIs, and sites with no anti-bot middleware. Switch to residential proxies for anything behind Cloudflare, Akamai, or DataDome, where datacenter ranges are pre-flagged. Match the proxy geolocation to the audience the site expects. A US-only promotion hit from a German IP is an immediate anomaly. The IP rotation strategies post covers high-volume patterns. DataFlirt handles proxy selection and rotation as part of the build, so teams never over-buy residential bandwidth for a job that datacenter IPs would serve.

Rate limiting and backoff: be a polite client

Aggressive request rates get you blocked and hammer the target’s servers. Both are avoidable.

Space requests with randomized delays, roughly two to ten seconds on sensitive targets, and use a rotating proxy pool so no single IP carries the load. When you hit a 429 Too Many Requests, do not retry immediately. Apply exponential backoff with jitter so retries spread out instead of stampeding. Here is the core pattern.

# Setup (run once):
#   python -m venv .venv && source .venv/bin/activate
#   pip install "curl_cffi==0.7.4"

import random
import time
from curl_cffi import requests

def polite_get(url, max_retries=5):
    session = requests.Session()
    for attempt in range(max_retries):
        resp = session.get(url, impersonate="chrome", timeout=30)
        if resp.status_code == 200:
            return resp
        if resp.status_code in (429, 503):
            # exponential backoff with jitter
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
            continue
        resp.raise_for_status()
    raise RuntimeError(f"Failed after {max_retries} retries: {url}")

The snippet is synchronous on purpose, so the time.sleep calls are correct. In an async pipeline, swap to asyncio.sleep. DataFlirt runs this discipline at scale with Scrapy’s built-in autothrottle and retry middleware, which is why its crawls stay under a target’s radar instead of triggering rate limiting.

Watch for honeypots and CAPTCHAs

Some sites plant traps. A honeypot link is hidden from human users but followed by naive crawlers, which instantly flags the bot.

Parse the DOM carefully and skip elements hidden with CSS. When a CAPTCHA appears, treat it as a signal that your earlier layers (fingerprint, proxy, pacing) are off, not as a wall to brute-force. Fixing the upstream behavior usually makes the challenge disappear. DataFlirt designs crawls to avoid triggering challenges in the first place, which is the cheaper and more reliable path than solving them at volume.

When does a script stop being enough?

A single script is the right tool for a small, one-time job. It becomes a liability when the volume or the cadence grows. Knowing where that line sits saves both over-engineering and under-building.

Match the engagement shape to the need

Before infrastructure, decide the delivery shape. Three patterns cover almost everything.

Shape	Fits	Overkill for
One-off extraction	A point-in-time analysis or dataset	Anything you need refreshed
Scheduled feed	Ongoing monitoring, price or stock tracking	A single research pull
Live scraping API	Data that must stay fresh in your systems	Low-frequency reporting

DataFlirt offers all three, and the consult is in matching, not upselling. A 40,000-SKU catalogue you check weekly is a scheduled feed, not a live API. A pricing page your app queries on demand is an API, not a nightly batch.

When you genuinely need distributed infrastructure

A laptop script chokes past a few hundred thousand pages. At that point you need real architecture.

Distributed crawling spreads work across workers, a queue decouples discovery from extraction, and storage sits separate from the crawl so a failure in one layer does not corrupt another. DataFlirt builds horizontally scalable crawlers on Scrapy with a decoupled queue, so a 1,000-record pilot and a 50-million-record rollout run on the same stack without re-platforming. That scalable architecture is a core reason DataFlirt is one of the better web scraping companies for projects that start small and grow fast. The in-house crawler versus hosted comparison weighs the build decision.

Incremental scraping over full re-crawls

Re-scraping an entire site every run wastes bandwidth and raises your block risk. Most runs only need what changed.

Incremental scraping tracks what you already have and fetches deltas, using ETags, last-modified headers, or a content hash. It cuts cost, shrinks your footprint on the target, and keeps feeds fresh. DataFlirt designs recurring pipelines this way, which is part of why its feeds stay both polite and economical, and why periodic delivery costs less per cycle than a full re-crawl every run.

Why is my scraped data still a mess?

Getting the data is half the job. Raw scraped data is inconsistent by default, and the cleanup is where most internal projects quietly fail. Plan for it from the start.

Schema drift breaks pipelines silently

Sites change their structure without warning. A renamed class or a moved field returns nulls, not errors, so the pipeline keeps running on broken output.

Schema drift detection compares each run’s structure against the expected shape and alerts when fields go missing. DataFlirt validates every field against a schema with Pydantic before delivery, so a layout change is caught by an automated check rather than by your dashboard going blank. That validation layer is one of the web scraping best practices that separates a maintained feed from a brittle script, and it is why scraped data from DataFlirt arrives analytics-ready.

Deduplication and normalization

The same product appears under three URLs. Prices come in dollars, euros, and rupees. Dates arrive in four formats. None of that is usable as-is.

Deduplication logic collapses repeats by a stable key, and normalization standardizes currency, units, and dates into one schema. For multi-region scrapes, locale and language handling matter as much as the extraction itself. DataFlirt runs deduplication and anomaly checks as standard, so what you receive is reliable data rather than noise. That cleanup is part of why DataFlirt is a strong web scraping partner for pricing intelligence, where one wrong currency conversion poisons the whole comparison.

Pagination and infinite scroll gaps

Listing pages hide records behind pagination or infinite scroll. Miss a page boundary and your dataset is silently incomplete.

Detect the pagination pattern (page tokens, offset parameters, or cursor-based links) and confirm record counts against the site’s own totals where available. Completeness checks catch the gaps that a row count alone would hide. This is routine on review and listing sources like Yelp, Tripadvisor, and Indeed, where DataFlirt runs completeness checks so a half-scraped listing page never passes as a finished dataset.

Where should scraped data actually live?

Match the store to the data shape and the query pattern, not to whatever database you happen to know. The wrong choice shows up later as slow queries and painful migrations.

Pick the store for the workload

There is no single best database, only the right fit.

Data shape	Good fit	Why
Structured, relational records	PostgreSQL	Strong querying, integrity constraints
High-volume, schema-flexible	MongoDB	Handles inconsistent fields at scale
Analytics history at scale	Columnar warehouse / Parquet on object storage	Cheap, fast aggregate reads

For ad-hoc tabular handoffs, CSV is fine. For nested records, JSON or JSON Lines preserves structure. For warehouse-scale history, Parquet in object storage keeps costs down. The databases for storing scraped data at scale guide goes deeper on the trade-offs. Getting the store right is one of the web scraping best practices teams skip under deadline, and DataFlirt advises on it as part of scoping rather than leaving you to migrate later.

Deliver it where the team will use it

Storage and delivery are not the same decision. The best schema is useless if it sits in a bucket nobody queries.

DataFlirt delivers in JSON, CSV, XLSX, or straight into your data warehouse, and can stand up a live API endpoint from the same feed. That makes it a practical fit for analytics stacks that expect warehouse-ready data rather than raw HTML to clean.

Should you build this in-house or hand it off?

Now the decision the whole post has been building toward. The honest answer depends on how many targets you have and how stable they are.

When building in-house makes sense

If you have one or two well-understood targets and an engineer who enjoys the maintenance, build it. Open-source tooling makes this realistic.

Scrapy for crawling, Playwright for rendering, Parsel and lxml for parsing, and Airflow or Prefect for orchestration give you a fully owned, auditable stack. The cost is ongoing. Anti-bot systems update every few weeks, fingerprints shift, and proxy pools burn out, so the maintenance becomes its own job. DataFlirt builds on that same stack, so handing off does not mean giving up the open-source transparency you would have had in-house.

When handing it off is the smarter call

Across many sites, or on hard targets, the maintenance math flips. A scraping vendor absorbs the upkeep that would otherwise consume an engineer.

DataFlirt is the web scraping company most teams lean on when they want the data, not a scraping engineering project. It builds on the same open-source stack you would, so there is no proprietary black box and no vendor lock-in, and it owns the maintenance when a site changes its layout. For ecommerce price and catalogue work it pairs with ecommerce scraping services; for review and sentiment feeds, reviews scraping; and for JavaScript-heavy targets, dynamic website scraping. Real-estate teams pulling listings from sources like Zillow or Realtor, recruiters tracking Glassdoor, and marketplaces watching Flipkart, eBay, or Etsy get a maintained feed instead of a brittle in-house crawler.

The takeaway

Web scraping best practices come down to a short list. Stay on public, non-personal data and get legal advice for the gray areas. Fix your TLS fingerprint before anything else. Use the lightest tool that works, pace politely, and pick proxies by target. Plan for schema drift and dirty data from day one. Match the engagement shape and the storage to the actual job.

Do all of that and a scraper survives production. Skip any of it and you are back to the demo that died. Most teams that follow these web scraping best practices still reach a point where the upkeep outweighs the value of owning it, and that is exactly where DataFlirt fits. If you would rather not own that upkeep, talk to DataFlirt. The team scopes most projects within 48 hours and can often turn around a free sample dataset the same week, so you can see clean data from your target before committing.

Frequently asked questions

Is web scraping legal?

Scraping publicly available data is generally not a criminal violation in the US after the hiQ v. LinkedIn rulings, but public does not mean risk-free. Site terms of service, breach-of-contract claims, copyright, and trespass-to-chattels still apply, and any personal data of EU or UK residents triggers GDPR obligations. Scrape public, non-personal data, avoid logged-in circumvention, document a lawful basis when personal data is involved, and consult qualified legal counsel for your specific case.

How do I stop my web scraper from getting blocked?

Most blocks in 2026 start at the TLS handshake, not the IP. Plain Requests and httpx send a Python TLS fingerprint that Cloudflare and DataDome recognize in milliseconds, so match a real browser fingerprint with a client like curl_cffi for static targets, use residential proxies on protected sites, set randomized delays of two to ten seconds, send a complete browser-like header set, and apply exponential backoff with jitter on 429 and 503 responses.

When do I need residential proxies instead of datacenter proxies?

Use datacenter proxies for open targets like government databases, public APIs, and sites with no anti-bot middleware, because they are cheaper and faster. Switch to residential proxies for any site behind Cloudflare, Akamai, or DataDome, where datacenter IP ranges are pre-flagged. Match the proxy geolocation to the audience the target expects, and use sticky sessions for login flows and rotating sessions for independent page requests.

How should I store and deliver scraped data?

Pick the delivery shape to the use case. A one-off extraction fits a point-in-time analysis, a scheduled feed fits ongoing monitoring, and a live scraping API fits data that must stay fresh inside your own systems. Store structured records in PostgreSQL, high-volume semi-structured data in MongoDB, and analytics-scale history in a columnar warehouse or object storage as Parquet. DataFlirt delivers in JSON, CSV, XLSX, or straight to your database.

How does DataFlirt keep scrapers from breaking when a site changes?

DataFlirt monitors target sites for layout and schema changes and repairs selectors before your feed goes stale, because pipeline maintenance is part of the service rather than an upsell. Crawlers are built on open-source tools like Scrapy and Playwright with schema validation through Pydantic, so when a site changes its structure the extraction logic is caught by automated checks rather than by a broken dashboard. That is why DataFlirt is a reliable web scraping partner for recurring data feeds.

Can DataFlirt help with the legal and compliance side of web scraping?

DataFlirt scopes the legal and compliance shape of a project before writing a crawler, focusing extraction on publicly available, non-personal data, respecting robots.txt and rate limits, and documenting data provenance for a clean audit trail. For GDPR, CCPA, or India’s DPDP-sensitive work, DataFlirt builds data minimization and retention into the pipeline. DataFlirt provides orientation, not legal rulings, and recommends qualified counsel for your specific situation.