Your scraper hits a public jobs page, pulls names, email addresses, and employer histories at scale, and lands them in a database for outreach. The data was public. The site had no login wall. You respected robots.txt. By almost any intuitive reading, you’ve done nothing wrong.
Then a data protection authority opens an investigation.
This is the gap that trips up data teams, growth engineers, and market intelligence platforms every year. GDPR compliance for web scraping is not about whether data was public. It’s about whether you had a lawful basis for processing it, whether you minimized what you collected, and whether the people whose data you hold can exercise their rights. The rules apply whether you scraped a profile from a gated platform or a register that any government website makes freely available.
This guide covers how GDPR actually regulates web scraping: the lawful bases available, the enforcement cases that define what “compliance” means in practice, and the technical and documentation steps your pipeline needs before you collect a single personal data field at scale.
Why “publicly available” doesn’t exempt you from GDPR
GDPR defines personal data broadly: any information relating to an identified or identifiable natural person. That covers names and email addresses obviously, but also IP addresses, job titles combined with employer names, profile photos, and any combination of fields that together let you single out an individual.
The critical point that catches teams off guard is that a field being publicly accessible does not remove it from GDPR’s scope. A name appearing on a public company register is still personal data. A LinkedIn profile URL is personal data. An aggregate page view count is not.
The Polish data protection authority’s first GDPR fine made this concrete. In March 2019, a firm was fined 220,000 euros for scraping over six million Polish citizens’ business contact details from a public commercial register and using them for commercial services, without meeting the Article 14 notification obligation that applies when you collect personal data from a source other than the data subject themselves. The data was genuinely public. The fine was genuine too, per GDPR.eu’s tracking of the case.
Six years later, the enforcement logic has only hardened. In December 2024, France’s CNIL fined a data scraping company 240,000 euros specifically for scraping LinkedIn contact details from users who had restricted their profile visibility to first- and second-degree connections. The fine confirmed a principle the regulation has always contained but enforcement has now made undeniable: exceeding the reasonable expectations of individuals whose data you process is itself a GDPR violation, even when no login wall was bypassed. The CNIL’s published decision is available on their site.
The only lawful basis that works at scraping scale
Under Article 6 GDPR, you need a lawful basis before processing any personal data. For web scraping, three bases exist in theory. Two fail in practice at scale.
Consent (Article 6(1)(a)) requires a clear affirmative act from each individual before you collect their data. You can’t get consent from millions of people before scraping them. The European Data Protection Board’s Opinion 28/2024 on AI training data explicitly confirmed this: consent is “unlikely to serve as a valid legal basis for web scraping due to the large-scale data collection and the difficulty of identifying whose data will be scraped.”
Contractual necessity (Article 6(1)(b)) applies when processing is necessary to perform a contract with the data subject. There’s no contract between a scraper and the people whose data it collects. This basis doesn’t apply.
Legitimate interest (Article 6(1)(f)) is the only viable route, and it is a route, not a shortcut. This basis allows processing when you have a genuine, specific interest that isn’t overridden by the individuals’ rights and freedoms. It requires documented justification before you start, and ongoing accountability after. In June 2025, France’s CNIL confirmed this in published guidance, building on EDPB Opinion 28/2024: scraping is not inherently prohibited under GDPR, but legitimate interest is conditional on strict necessity and proportionality.
The legitimate interest basis under GDPR is widely misunderstood as a fallback you invoke when nothing else fits. In practice, it’s a documented risk assessment that should, and often does, result in a “no-go” decision for projects that don’t survive scrutiny.
Running the Legitimate Interest Assessment before you build
Before any scraping pipeline that touches personal data, you need a Legitimate Interest Assessment (LIA). It’s a three-part test, and all three parts must pass.
| Test | What you must establish |
|---|---|
| Purpose test | A specific, real, documented interest. Not “lead generation” but “building a B2B contact list for outreach to procurement managers at logistics firms in Germany.” |
| Necessity test | Scraping personal data is the least intrusive way to achieve this purpose. You’ve considered alternatives and ruled them out. |
| Balancing test | Your interest does not override the individuals’ reasonable expectations, rights, and freedoms. |
The balancing test is where most projects fail. Key factors include: What did individuals reasonably expect when they published or were listed with this information? What is the potential harm from your use (unwanted contact, profiling, surveillance)? Is the data sensitive in any way under Article 9?
Consider what the KASPR enforcement means operationally. A LinkedIn scraper that pulls contacts from users who restricted their visibility to direct connections fails the balancing test even if the tool respects robots.txt, because those users made an affirmative choice to limit their data’s reach. Scraping past that choice is documented evidence that their expectations were not respected.
For Glassdoor or G2 review scraping, the analysis is different. Review text is deliberately public, and extracting aggregate sentiment or ratings to benchmark product feedback is a purpose most data subjects would regard as unsurprising. That’s a much cleaner legitimate interest argument. For Zoominfo or Europages B2B contact data, the analysis sits in between, depending heavily on data type, scale, and downstream use.
The LIA must be written down before scraping begins. A verbal agreement between a data engineer and a product manager does not count. A real document that could survive a DPA inspection counts.
What “data minimization” means for your scraper architecture
Data minimization is one of GDPR’s core principles (Article 5(1)(c)): collect only what is necessary for the specific purpose you’ve documented. For scraping pipelines, this is an engineering constraint as much as a legal one.
In practice, minimization means three things.
Field-level filtering at extraction. Don’t pull every field a page exposes. Use precise CSS selectors or XPath to extract only the fields your LIA covers. If your purpose is pricing intelligence on Amazon or eBay listings, you don’t need seller profile data; architect that out of scope before you deploy.
No speculative collection. “We might need it later” is not a documented purpose and will not pass a DPA inspection. The CNIL’s June 2025 guidance explicitly requires that the controller “ensure that the processing or retention of personal data is necessary, including evaluating whether the data can be retained in a form that permits identification.”
Storage limitation with enforced TTLs. Data must not be kept longer than necessary for its purpose. KASPR’s five-year retention period, automatically renewed when someone changed jobs, was ruled disproportionate for contact data used in commercial outreach. Define time-to-live rules before deployment and enforce them in your pipeline, not as a manual cleanup task.
For use cases where you genuinely need population-level data but not personal-level data, consider Tripadvisor sentiment trends, price distributions on Etsy, or salary band analysis from Indeed job posts. In these cases, data anonymization is the more robust path. Strip or hash direct identifiers before the data lands in storage, aggregate what you can, and your dataset falls outside GDPR’s scope entirely.
Data pseudonymization is the middle ground: replace direct identifiers with tokens, retain the ability to re-link for specific legitimate purposes, but treat the data as still within GDPR scope. It reduces risk and is relevant when longitudinal tracking serves a documented purpose.
The right to erasure and why your deletion workflow must be built first
Article 17 GDPR gives individuals the right to request deletion of their personal data when it’s no longer necessary, when they withdraw consent, or when processing is found unlawful. The right to erasure is not theoretical. It’s a data engineering requirement.
If your scraping pipeline stores personal data, you need three things in place before go-live:
- A documented inventory of where personal data is stored (which tables, which fields, which downstream systems received exports)
- A mechanism to search by individual identifier and delete all records relating to a specific person
- A defined response SLA, because GDPR requires responding to Subject Access Requests (SARs) within one month
The KASPR case illustrated two specific failures here. First, KASPR initially told individuals who exercised their right of access that their data had been “collected from publicly available sources.” That answer doesn’t satisfy Article 15’s requirement to provide meaningful information about the source and purpose. Second, when people had restricted their LinkedIn visibility, KASPR had no architectural mechanism to know whose data it shouldn’t have collected in the first place.
The implication for your pipeline is direct: deletion and access workflows are not post-launch features. They need to be designed before the first data lands in production. For large-scale pipelines processing financial data, healthcare information, or B2B contact records from platforms like Xing or Europages, this is non-trivial infrastructure. But a DPA will ask for it.
How robots.txt and crawl-delay affect your compliance posture
robots.txt has historically been a voluntary protocol, a site’s request rather than a legal constraint. GDPR has shifted its practical significance.
France’s CNIL and other EU regulators now treat robots.txt compliance as a relevant factor in the legitimate interest balancing test. Ignoring a Disallow directive signals that you disregarded the site operator’s explicit instructions, which weakens your argument that you respected the reasonable expectations of the individuals listed on that site. The legal standing of robots.txt in GDPR analysis has strengthened since EDPB Opinion 28/2024 listed respecting such protocols as a mitigating measure scraping operators should adopt.
Practically, this means four things:
- Fetch
robots.txtonce per domain before crawling and honorDisallowdirectives - Respect crawl-delay directives; aggressive crawling without delay draws attention from both site operators and regulators
- Apply sensible rate limiting, which functions as a compliance signal as well as a technical safeguard
- Return honest HTTP headers where your scraping is a disclosed commercial activity
None of this makes a non-compliant scraping purpose compliant. But combined with a documented LIA, minimized field extraction, and deletion workflows, it builds a defensible compliance posture that can withstand scrutiny.
Sector-specific risk: where the exposure is highest
The risk profile of GDPR non-compliance varies sharply by sector. Not all scraping sits at the same point on that curve.
Healthcare data. Article 9 GDPR designates health-related data as a special category requiring stricter protection. Scraping doctor listings from Doctoralia or patient-reported information from healthcare.gov touches this category if any field can be linked to a health condition. DataFlirt’s healthcare scraping services are designed to work with aggregate and institutional data, such as formulary pricing, appointment availability, and provider directories, not personal patient records.
Financial data. Scraping analyst reports from Bloomberg or filings from Finance Yahoo typically involves no personal data at all. Scraping personal loan applications, user account data, or individual investor histories is a different category entirely. The distinction matters for scoping your LIA.
B2B contact intelligence. This is the highest-risk category for most commercial scraping operations and the one where enforcement is most active. The KASPR case, the Polish DPA’s 220,000 euro fine, and Clearview AI’s coordinated European enforcement all involved large-scale collection of individual contact or identity data. If your pipeline targets LinkedIn, Glassdoor, Gartner peer reviews, or professional directories like Xing for contact-level data at scale, you need a documented LIA, active deletion workflows, and a clear answer to the Article 14 transparency obligation. That obligation requires informing individuals that their data has been collected, even when you didn’t get it from them directly.
Public sector and research data. Scraping Eurostat, data.europa.eu, World Bank, or Reuters news aggregates typically presents minimal personal-data risk. These sources publish aggregated, institutional, or public-interest information. For research and journalism exceptions under GDPR Article 85 and 89, additional safeguards apply but the compliance burden is lower.
DataFlirt’s legal data scraping services and company data services are designed around these sector-specific risk profiles, with pipeline architecture adapted to what each data type actually requires.
Cross-border transfers and non-EU scrapers
One aspect of GDPR that non-European businesses consistently underestimate: the regulation applies to the processing of personal data about EU/EEA residents regardless of where the processing entity is located.
If you’re a company based outside the EU and you’re scraping profiles of German employees from Xing, salary data from Glassdoor for EU-based candidates, or business contacts from Europages, GDPR applies to you. The enforcement mechanism involves EU representatives, but the substantive obligations are identical.
For cross-border data transfers, meaning sending EU personal data to servers outside the EEA, you need a legal mechanism: Standard Contractual Clauses (SCCs), the EU-US Data Privacy Framework if your US processor is enrolled, or another approved transfer basis under Chapter V GDPR. This is not optional infrastructure; it’s a condition of lawful processing.
For technical implementation, this typically means choosing data storage regions that keep EU personal data within the EEA unless a transfer mechanism is documented, and ensuring your scraping-to-database pipeline doesn’t route EU personal data through infrastructure that lacks the appropriate safeguards.
Building a defensible GDPR-compliant scraping pipeline
Compliance isn’t a policy document. It’s architecture. Here is what a defensible pipeline needs before it touches personal data at scale.
Before you build:
- Complete and document a Legitimate Interest Assessment. Be specific about purpose, necessity, and the balancing test outcome.
- Identify every personal data field your scraper might encounter and decide which to extract, which to skip, and which to anonymize at point of extraction.
- Design your storage schema with deletion in mind: know exactly which tables hold personal data and how you’d delete all records for a given individual.
- Define retention periods and TTL rules before the first run.
At extraction time:
- Use precise selectors to take only the documented fields: CSS selectors or XPath, not page-level HTML dumps.
- Apply hashing or anonymization to direct identifiers at extraction, before they reach your database, for any use case where you don’t need to re-identify individuals.
- Respect
robots.txtand crawl-delay directives and log compliance decisions for your audit trail. - Apply rate limiting as both a technical safeguard and a compliance signal.
After data lands:
- Enforce storage limits and auto-delete raw personal data after your documented retention period.
- Maintain an Article 30 record of processing activities. Every scraping project that handles personal data needs an entry.
- Build a Subject Access Request workflow: search by individual identifier, compile all records, respond within 30 days.
- Train anyone with database access on what personal data looks like and who can authorize processing decisions.
This isn’t a new compliance overhead for professional scraping operations. It’s the infrastructure that separates a defensible data pipeline from one that generates enforcement risk. DataFlirt builds these safeguards into every project involving personal data, from the field-selection logic at extraction time to the deletion workflows at the storage layer. Whether you’re acquiring structured data for market intelligence or running ongoing business intelligence feeds, the right architecture from the start is always cheaper than retrofitting after a DPA inquiry.
Where GDPR enforcement stands in 2026
The legal landscape for scraping personal data has clarified considerably since GDPR came into force in 2018. Several things are now settled.
Scraping publicly accessible personal data is not inherently prohibited by GDPR, but it is regulated. Legitimate interest is the viable basis for most commercial scraping, but only with a documented LIA that survives scrutiny. “Publicly available” is not a defense against enforcement. Regulators have fined companies for scraping public data when notification obligations weren’t met and when users had taken active steps to limit their data’s reach.
The ePrivacy Directive adds another layer for certain cookie and session-based data processing. robots.txt compliance, data minimization, and deletion workflows are now compliance signals, not just engineering preferences. The EU AI Act, being implemented in phases through 2025 and 2026, adds data governance requirements for scraping used to train AI systems, building on the EDPB Opinion 28/2024 framework.
For teams building custom web crawlers at scale or evaluating data collection strategies, the right question is no longer “is our data public?” It’s “can we document why we collected it, what we did with it, and how we’d delete it if asked?”
If you want to understand how GDPR shapes responsible data crawling, or review what the top compliance and legal considerations for scrapers look like across frameworks, DataFlirt’s team builds pipelines that address both the technical and documentation layers. The guidance here is orientation, not legal advice. Consult qualified legal counsel for your specific jurisdiction and use case. But the infrastructure can be right from day one.
Frequently asked questions
Does GDPR apply to publicly available scraped data?
GDPR treats any information that can identify a natural person as personal data: names, email addresses, IP addresses, job titles, even profile photos. The critical rule most scrapers overlook is that data remains personal data regardless of whether it appears on a public webpage. Scraping six million people’s business contacts from a public register still triggers GDPR obligations, as the Polish regulator’s first-ever GDPR fine confirmed.
What lawful basis can you use to scrape personal data under GDPR?
Under Article 6 GDPR, legitimate interest (Article 6(1)(f)) is the most commonly applicable basis for commercial web scraping. Consent is largely unavailable at scraping scale, and contractual necessity rarely applies. Legitimate interest is valid only if you document and pass a three-part Legitimate Interest Assessment (purpose test, necessity test, and balancing test) before you start collecting. The EDPB Opinion 28/2024 and France’s CNIL both confirm this is the viable route, but with strict conditions attached.
How does the Legitimate Interest Assessment work for web scraping?
The Legitimate Interest Assessment (LIA) has three stages. First, the purpose test: identify a specific, documented interest, not a vague goal like “lead generation.” Second, the necessity test: scraping personal data must be the least intrusive way to achieve that purpose. Third, the balancing test: your interest must not be overridden by the individuals’ rights and reasonable expectations. If a LinkedIn user restricted their profile to first-degree connections, KASPR’s 2024 fine shows that scraping them anyway fails this test regardless of “public” framing.
What penalties have regulators actually imposed for GDPR violations in web scraping?
Non-compliance with GDPR can result in fines up to 20 million euros or 4% of global annual turnover, whichever is higher. Real scraping-related enforcement includes the Polish DPA’s first-ever GDPR fine of 220,000 euros against a firm that scraped six million citizens without meeting Article 14 notification obligations, and France’s CNIL fining a data scraping company 240,000 euros in December 2024 for collecting LinkedIn contact details from users who had restricted their visibility.
How do data anonymization and pseudonymization help with GDPR compliance?
Data anonymization removes or transforms personally identifiable fields so they can no longer be linked back to an individual. Pseudonymization replaces direct identifiers with tokens while retaining analytical value. Both reduce GDPR exposure significantly. Anonymized data falls outside GDPR’s scope entirely, while pseudonymized data still requires protection but carries lower risk. For business use cases like price intelligence, sentiment analysis, or market trend monitoring, designing your scraper to extract only aggregate or anonymized fields from the start is the most durable compliance strategy.
Does the right to erasure (right to be forgotten) apply to scraped data?
Yes. GDPR gives data subjects the right to erasure (Article 17), also called the right to be forgotten, when their personal data is no longer necessary for the purpose it was collected, or when they withdraw consent. If you store scraped personal data, you need a documented deletion workflow. The KASPR case showed that retaining contact details for five years from each data update, automatically renewed every time someone changed jobs, was ruled disproportionate and in violation of the storage limitation principle.
How does DataFlirt approach GDPR compliance in its scraping projects?
DataFlirt builds scraping pipelines with compliance architecture built in: scraping only the fields a project requires, applying anonymization at extraction time, respecting robots.txt and crawl-delay directives, and designing deletion and access workflows before deployment. For projects that touch personal data in regulated sectors like healthcare, financial data, or B2B contact intelligence, DataFlirt recommends engaging qualified legal counsel alongside the technical build. The right infrastructure and the right legal documentation are both necessary; neither alone is sufficient.

