← Glossary / DPDP Act (India) and Scraping

What is DPDP Act (India) and Scraping?

DPDP Act (India) and Scraping refers to the compliance obligations imposed by the Digital Personal Data Protection Act, 2023, on data extraction pipelines operating within or targeting Indian residents. Unlike the GDPR's flexible legitimate interest clause, the DPDP Act relies heavily on explicit consent and a strict "publicly available data" exemption. For scraping engineers, this means your extraction layer must actively identify and redact personally identifiable information (PII) before it hits the delivery sink, or risk severe non-compliance penalties.

CompliancePII RedactionData GovernanceLegal RiskIndia
// 02 — definitions

Scraping under
the DPDP Act.

How India's 2023 privacy framework changes the rules of engagement for web scraping, and why pipeline-level PII redaction is no longer optional.

Ask a DataFlirt engineer →

TL;DR

The DPDP Act mandates strict consent for processing personal data of Indian digital citizens, but includes a critical exemption for data made publicly available by the user. For scraping pipelines, the operational challenge is distinguishing between user-published public data and platform-published personal data, requiring automated PII filtering at the extraction layer.

01Definition & scope
The Digital Personal Data Protection Act (DPDP), 2023 is India's comprehensive privacy legislation. For data engineering teams, it regulates how the personal data of Indian digital citizens can be collected, processed, and stored. In the context of web scraping, it shifts the burden of proof onto the scraper to ensure that any extracted PII (names, phone numbers, emails, identifiers) was either collected with explicit consent or falls under the strict exemption of being made publicly available by the user themselves.
02The "Publicly Available" exemption
Unlike the GDPR, which allows data processing under "legitimate interest," the DPDP Act is binary: you need consent, or an exemption. The most relevant exemption for scraping is Section 3(c), which excludes personal data made publicly available by the Data Principal (the user). If a freelancer posts their email on their public portfolio, it is exempt. If a platform leaks a user directory without their knowledge, scraping it is a violation.
03Data minimization in pipelines
The DPDP Act heavily penalises the over-collection and unnecessary retention of personal data. For scraping infrastructure, this means the traditional "scrape everything now, filter later" approach is legally toxic. Pipelines must implement data minimization at the extraction layer—parsing the DOM, extracting only the required non-personal fields, and immediately discarding the raw HTML payload before it is written to disk.
04How DataFlirt handles DPDP compliance
We treat compliance as an engineering problem. Our extraction workers deployed in the ap-south-1 region utilize inline Named Entity Recognition (NER) to detect and redact Indian PII formats (Aadhaar, PAN, +91 phone numbers) in real-time. If a client requests a dataset of business listings, our pipeline automatically nullifies any personal contact fields associated with the listing before the JSON record is delivered to the client's S3 bucket.
05The risk of children's data
The DPDP Act places severe restrictions on processing the data of individuals under 18, requiring verifiable parental consent and prohibiting behavioral tracking. Scraping platforms with a high density of minors (e.g., certain social media or educational forums) carries exponential risk, as the "publicly available" exemption does not override the strict protections afforded to children's data under the Act.
// 03 — compliance metrics

Measuring your
exposure risk.

Under the DPDP Act, retaining personal data without a valid exemption is a strict liability. DataFlirt uses these metrics to enforce data minimization across all India-targeted pipelines.

PII Exposure Risk = R = records_scraped × pii_density × retention_days
DPDP penalises unnecessary retention. Keep retention_days at zero for raw HTML. Data Minimization Principle
Redaction Confidence = C = 1 − (false_negatives / total_pii_entities)
DataFlirt's NER models maintain C > 0.998 for Indian phone numbers and Aadhaar formats. DataFlirt Extraction SLO
DPDP Penalty Cap = P = min(breach_severity, ₹250Cr)
Statutory maximum penalty per instance under the 2023 Act. DPDP Act, 2023 Schedule
// 04 — pipeline execution

Filtering PII at
the extraction edge.

A scraper hitting an Indian business directory. The pipeline is configured to extract corporate data while actively dropping personal mobile numbers to maintain DPDP compliance.

NER filteringap-south-1PII redaction
edge.dataflirt.io — live
CAPTURED
// pipeline ingestion
source.target: "in-business-directory"
record.raw: { name: "Rahul Sharma", phone: "+91-9876543210", gst: "22AAAAA0000A1Z5" }

// dpdp compliance filter (edge worker)
ner.detect(record.raw.phone): "Personal Mobile Number"
ner.detect(record.raw.gst): "Business Identifier"

// redaction policy applied
policy.action: "DROP_FIELD"
record.clean: { name: "Rahul Sharma", phone: null, gst: "22AAAAA0000A1Z5" }

// delivery
sink.write(record.clean): SUCCESS
memory.flush(record.raw): CLEARED // zero retention
// 05 — compliance failures

Where DPDP compliance
fails in scraping.

Ranked by frequency of occurrence in unmanaged scraping pipelines targeting Indian data sources. Accidental ingestion is the primary driver of regulatory risk.

PIPELINES AUDITED ·  ·    120+ IN-targeted
PRIMARY RISK ·  ·  ·  ·   Phone numbers
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Accidental PII ingestion

High risk · Scraping personal emails/phones alongside business data
02

Raw HTML retention

Medium risk · Storing unparsed DOMs containing user personal data
03

Erasure request failure

Process risk · Inability to delete specific records upon user request
04

Children's data scraping

Severe risk · Strictly prohibited without verifiable parental consent
05

Cross-border transfer limits

Legal risk · Routing data through government-restricted territories
// 06 — our architecture

Redact at the edge,

never store what you don't have consent to hold.

Under the DPDP Act, holding personal data without consent or a valid exemption is a massive liability. DataFlirt handles this by pushing PII redaction to the extraction edge. Our workers run lightweight Named Entity Recognition (NER) models locally to strip Indian phone numbers, email addresses, and national IDs from the DOM before the record is ever serialized. The raw HTML is discarded immediately. If it never hits the database, it can't be breached.

dpdp-redaction-worker

Live telemetry from an extraction worker processing Indian public records.

worker.region ap-south-1 (Mumbai)
pii.detection active
redaction.regex IN_PHONE, AADHAAR, PAN
records.processed 14,200
fields.redacted 842 fields
raw_html.retention 0 seconds
compliance.status DPDP Aligned

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions regarding web scraping legality, data minimization, and operational compliance under India's DPDP Act.

Ask us directly →
Is web scraping illegal under the DPDP Act? +
No. Web scraping itself is a technical process and is not inherently illegal. However, scraping personal data of Indian residents requires explicit consent unless it falls under a specific exemption, such as data made publicly available by the user themselves.
What counts as 'publicly available data' under the Act? +
The DPDP Act exempts personal data that is made publicly available by the Data Principal (the user) themselves, or under a legal obligation. If a user publishes their own phone number on a public forum, it is generally exempt. If a platform publishes a user's data without their explicit intent to make it public, scraping it carries high regulatory risk.
How does the DPDP Act differ from the GDPR for scraping? +
The most critical difference is the absence of a "legitimate interest" basis in the DPDP Act. Under GDPR, companies often scrape public data claiming legitimate business interest. Under DPDP, you must rely entirely on the "publicly available data" exemption or obtain explicit consent. DPDP is also significantly stricter regarding the processing of children's data.
How does DataFlirt ensure DPDP compliance for its clients? +
We enforce strict data minimization. Our extraction pipelines are configured to drop PII at the edge using NER models before data is serialized. We do not retain raw HTML dumps, and we process India-targeted workloads within the ap-south-1 region to simplify cross-border compliance and data residency requirements.
Do we need to process scraped data locally in India? +
The DPDP Act allows cross-border transfers of personal data to any country, except those explicitly restricted by the Central Government. However, processing data locally reduces regulatory friction and simplifies compliance audits, which is why DataFlirt defaults to Mumbai-based infrastructure for Indian targets.
What happens if a user requests data erasure? +
Even if data was scraped under the public exemption, users may still exercise their rights. DataFlirt supports dynamic blocklists at the extraction layer. If a user opts out, their identifiers are added to the pipeline's redaction filter, ensuring they are automatically dropped from all future crawl iterations.
$ dataflirt scope --new-project --target=dpdp-act-(india)-and-scraping READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h