← Glossary / Sensitive Category Data (GDPR Art. 9)

What is Sensitive Category Data (GDPR Art. 9)?

Sensitive Category Data (GDPR Art. 9) refers to specific types of personal information—such as racial origin, political opinions, religious beliefs, biometric data, or sexual orientation—that are granted elevated protection under EU law. In the context of web scraping, inadvertently ingesting this data transforms a standard public data pipeline into a high-risk compliance liability. If your crawler pulls unstructured text containing Article 9 data without explicit consent or a valid legal exemption, your entire dataset is legally compromised.

GDPR ComplianceData GovernancePII FilteringLegal RiskEntity Resolution
// 02 — definitions

The toxic
payload.

Why scraping unstructured public data often accidentally ingests highly regulated personal information, and how to build pipelines that reject it at the edge.

Ask a DataFlirt engineer →

TL;DR

Article 9 data is a strict subset of PII that requires explicit consent or a substantial public interest basis to process. For scraping teams, the danger isn't targeting this data—it's accidentally capturing it in free-text fields like forum posts, author bios, or review comments, triggering massive regulatory exposure.

01Definition & structure
Sensitive Category Data (defined in GDPR Article 9) is a special class of personal data deemed inherently risky to process. It includes:
  • Racial or ethnic origin
  • Political opinions or religious beliefs
  • Trade union membership
  • Genetic and biometric data
  • Health data, sexual orientation, or sex life
Processing this data is generally prohibited unless a specific, narrow exemption applies.
02How it leaks into public datasets
Scraping pipelines rarely target Article 9 data intentionally. The risk comes from unstructured text. When you scrape product reviews, forum threads, or social media bios, users frequently volunteer sensitive information ("As a Catholic...", "My heart medication..."). If your pipeline blindly ingests this text and stores it alongside a user ID or username, you are now processing Article 9 data.
03The "Manifestly Made Public" exemption
Article 9(2)(e) allows processing if the data is "manifestly made public by the data subject." However, regulators interpret this strictly. You must prove the user intended to make it public to the world, not just to a specific forum community, and that a third party didn't post it. Because a scraper cannot verify intent at scale, relying on this exemption for bulk data extraction is legally perilous.
04How DataFlirt handles it
We treat Article 9 data as toxic. Our extraction architecture uses strict schema definitions to avoid free-text fields whenever possible. When unstructured text is required, we route the payload through in-memory Named Entity Recognition (NER) models tuned to detect health, political, and religious entities. Flagged text is redacted before the record is serialized to disk.
05The cost of accidental ingestion
Accidentally ingesting Article 9 data doesn't just risk fines (up to €20M or 4% of global revenue). It poisons the dataset. If a regulator determines your dataset contains unlawfully processed sensitive data, the standard remedy is an order to delete the entire dataset and any machine learning models trained on it.
// 03 — risk modeling

Quantifying
compliance risk.

DataFlirt's legal engineering team uses these heuristic models to evaluate the risk profile of new target domains before a single HTTP request is dispatched.

Exposure Probability = P(E) = text_volume × ugc_ratio
Higher volumes of User-Generated Content (UGC) exponentially increase the risk of accidental Art. 9 ingestion. DataFlirt Risk Assessment Model
NER Filter Confidence = C = 1 − (false_negatives / total_art9_entities)
The reliability of the NLP model used to detect and drop sensitive entities at extraction time. Pipeline QA Metrics
GDPR Fine Cap = F = max(€20M, 0.04 × global_revenue)
The maximum statutory penalty for processing Article 9 data without a valid lawful basis. GDPR Article 83(5)
// 04 — edge filtration

Dropping toxic records
before the database.

A live trace of DataFlirt's extraction worker processing a public forum scrape. An integrated NER model flags and redacts Article 9 data in real-time before it hits the delivery sink.

NER pipelineregex fallbackquarantine
edge.dataflirt.io — live
CAPTURED
// inbound record
source.url: "https://target-forum.com/thread/8821"
record.id: "rec_992a1b"

// extraction phase
field.author: "jdoe_84"
field.body: "As a type 1 diabetic, I found this clinic..."

// compliance scan (NER model)
scan.pii_standard: clean
scan.art9_health: DETECTED // "type 1 diabetic"
scan.confidence: 0.98

// policy enforcement
action: REDACT_FIELD
field.body.redacted: "As a [REDACTED_HEALTH], I found this clinic..."

// delivery
status: cleared for export
audit.log: "art9_redaction_applied"
// 05 — ingestion vectors

Where Article 9
data hides.

Ranked by frequency of accidental ingestion across unstructured scraping pipelines. User-generated content is the primary vector for toxic data.

PIPELINES MONITORED ·   140+ UGC targets
REDACTION RATE ·  ·  ·    0.8% of records
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Forum & comment threads

High risk · Users oversharing medical or political views
02

Author bios & profiles

High risk · Self-declared sexual orientation or religion
03

Medical & health reviews

Extreme · Explicit health conditions and treatments
04

Image/Video metadata

Moderate · Biometric markers in media files
05

Union membership lists

Niche · Trade union status is strictly protected
// 06 — our approach

Scrape the public web,

leave the personal data behind.

DataFlirt operates on a strict data minimization principle. When a client requests unstructured text from high-risk domains—like medical forums or political subreddits—we deploy specialized Named Entity Recognition models at the extraction layer. If an Article 9 entity is detected, the record is either redacted or dropped entirely before it ever reaches the delivery bucket. We do not store, process, or deliver toxic data.

Compliance Filter Status

Real-time metrics from an active UGC scraping pipeline.

pipeline.target public-health-forum
records.processed 45,200
ner.model df-art9-health-v4
pii.standard.drops 142 records
art9.health.redacts 89 records
compliance.status passing
audit.trail active

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About GDPR Article 9, public data exemptions, and how DataFlirt builds compliant extraction pipelines.

Ask us directly →
Does GDPR apply if the sensitive data is already public? +
Yes. GDPR still applies to public data. Article 9(2)(e) provides an exemption if the data is "manifestly made public by the data subject." However, proving that the user themselves (not a third party) intentionally made it public is operationally impossible at scraping scale. Relying on this exemption for bulk extraction is highly risky.
What exactly counts as Article 9 data? +
It includes data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic data, biometric data (for identification), health data, and data concerning a person's sex life or sexual orientation.
Can we just hash or encrypt the sensitive data? +
No. Pseudonymization (like hashing) is still considered "processing" under GDPR. If you don't have a lawful basis to process Article 9 data in the first place, you cannot legally ingest it to hash it. The data must be dropped or redacted before it is persistently stored.
How does DataFlirt prevent accidental ingestion? +
We use strict schema validation and edge-deployed Named Entity Recognition (NER) models. For structured data, we only extract explicitly requested, non-PII fields. For unstructured text, our NER models scan the payload in memory; if Article 9 entities are detected, the text is redacted before being written to disk.
What happens if a pipeline accidentally ingests this data? +
If toxic data bypasses the filters, it triggers a compliance incident. The affected records are quarantined, purged from all storage layers (including backups), and the extraction logic or NER model is updated to catch the edge case.
Are B2B contact databases exempt from Article 9? +
B2B data is still personal data under GDPR, but it rarely contains Article 9 categories unless it explicitly infers trade union membership, political affiliation, or religious beliefs (e.g., scraping a directory of political party employees). Standard B2B scraping usually falls under standard PII rules, not Article 9.
$ dataflirt scope --new-project --target=sensitive-category-data-(gdpr-art.-9) READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h