← Glossary / CCPA Compliance

What is CCPA Compliance?

CCPA Compliance in the context of web scraping means ensuring your data extraction pipelines do not unlawfully harvest, store, or sell the personal information of California residents without providing notice and opt-out mechanisms. While public data is often assumed to be exempt, the line between public business data and personal information is easily blurred when scraping directories, social profiles, or review sites. Failing to filter PII at the edge turns a simple data pipeline into a massive regulatory liability.

CompliancePII FilteringData GovernanceCalifornia PrivacyEdge Processing
// 02 — definitions

Scraping within
the law.

The California Consumer Privacy Act fundamentally changes how data pipelines must handle personal information, even when that information is sitting in plain text on a public website.

Ask a DataFlirt engineer →

TL;DR

CCPA grants California consumers rights over their personal data, including the right to know, delete, and opt-out of its sale. For scraping operations, compliance requires strict data minimization, edge-level PII redaction, and robust audit trails. If you scrape a directory and accidentally ingest personal emails, you are subject to the regulation.

01Definition & structure
CCPA Compliance refers to adhering to the California Consumer Privacy Act when operating data extraction pipelines. The law grants California residents the right to know what personal data is collected, the right to delete it, and the right to opt-out of its sale. For scraping, this means you must have mechanisms to identify personal data during extraction, honor suppression lists, and ensure you are not unlawfully selling scraped identities.
02The "Publicly Available" exemption
A common misconception is that if data is public on the internet, it is exempt from CCPA. This is false. The CCPA's "publicly available" exemption is narrow, generally applying only to data lawfully made available from federal, state, or local government records. Scraping a public social media profile or a company team page still constitutes the collection of personal information under the statute.
03Edge filtering vs. database scrubbing
Compliance is easiest when personal data never enters your systems. Edge filtering involves inspecting and redacting records at the extraction worker level, before the data is serialized and sent to your data lake. Database scrubbing—storing everything and running a cleanup job later—creates a window of liability where you are actively storing unconsented personal data, complicating audit trails and deletion requests.
04How DataFlirt handles it
We build data minimization directly into the extraction schema. Our workers run lightweight NER (Named Entity Recognition) and regex classifiers in memory. If a pipeline is scoped for corporate data but encounters a personal email or residential address, the field is redacted to [REDACTED_CCPA] before the record is yielded. We do not store raw HTML dumps for our clients, eliminating the risk of latent PII sitting in cold storage.
05Did you know?
Under CCPA, personal information is not just names and emails. It includes IP addresses, browsing history, and unique device identifiers. If your scraping infrastructure logs the IP addresses of the target servers or proxies, and those can be linked to a household (e.g., via residential proxies), those logs themselves may be subject to CCPA data governance rules.
// 03 — the risk model

Quantifying
compliance risk.

Regulatory risk scales linearly with the volume of unredacted personal data you retain. DataFlirt models compliance risk at the pipeline level to enforce strict data minimization before records hit your warehouse.

PII Exposure Risk = R = records_scraped × pii_density × retention_days
Risk drops to zero if retention_days is zero (edge filtering). DataFlirt compliance framework
CCPA Statutory Damages = D = affected_consumers × $750
Per-incident statutory damages for data breaches under CCPA. Cal. Civ. Code § 1798.150
DataFlirt Redaction Rate = E = redacted_fields / total_extracted_fields
Tracked per job to monitor unexpected PII spikes in target schemas. Internal SLO
// 04 — edge redaction trace

Filtering PII before
it hits the disk.

A live trace of a DataFlirt extraction worker processing a B2B directory. The pipeline is configured for strict CCPA compliance, automatically redacting personal emails while retaining business data.

Edge FilteringRegex + NERAudit Logged
edge.dataflirt.io — live
CAPTURED
// pipeline init: b2b_directory_v3
target: "california_business_registry"
ccpa_filter: enabled (strict mode)

// record extraction: row 402
field.business_name: "Acme Logistics LLC"
field.public_phone: "555-0199"
field.owner_email: "j.doe@personal.com" // FLAG: PII detected

// redaction engine
action: redact_field
reason: "matches personal email heuristic"
field.owner_email: "[REDACTED_CCPA]"

// compliance audit log
record_id: "rec_88392a"
pii_status: clean
delivery_sink: "s3://df-client-prod/clean_data/"
// 05 — compliance failures

Where pipelines
violate CCPA.

The most common ways scraping operations inadvertently trigger CCPA liability. Retaining raw HTML dumps is the leading cause of accidental PII ingestion.

PIPELINES AUDITED ·  ·    150+ enterprise
PRIMARY RISK ·  ·  ·  ·   Raw HTML storage
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Retaining raw HTML dumps

High risk · Stores unparsed PII indefinitely
02

Scraping mixed-use directories

High risk · B2B mixed with sole proprietorships
03

Fingerprinting without notice

Med risk · IPs and device IDs are personal info
04

Ignoring opt-out requests

High risk · Failure to maintain a suppression list
05

Missing vendor agreements

Legal risk · No DPA with scraping infrastructure provider
// 06 — our architecture

Filter at the edge,

never in the warehouse.

If personal data hits your S3 bucket, you are already processing it. DataFlirt's compliance architecture pushes PII detection to the extraction worker. We use deterministic patterns and lightweight NER models to identify emails, residential addresses, and phone numbers in flight. If a pipeline isn't scoped for personal data, that data is dropped from memory before the record is serialized. You cannot leak what you never stored.

Compliance enforcement job

Live status of a CCPA-compliant extraction run on a public directory.

pipeline.id b2b-leads-ca
ccpa_mode active-redact
records.processed 142,050
pii.detected 1,204 records
pii.redacted 1,204 records
raw_html.retention 0 days
audit_log written to cold storage

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about CCPA applicability, public data exemptions, and how DataFlirt keeps enterprise pipelines compliant.

Ask us directly →
Does CCPA apply to publicly available web data? +
It depends. CCPA exempts "publicly available" information, but defines it narrowly as data lawfully made available from federal, state, or local government records. Data that a consumer voluntarily posts on a public website is not automatically exempt from CCPA if you scrape and store it.
Are B2B contacts exempt from CCPA? +
No. The temporary B2B exemption in the CCPA expired on January 1, 2023. Employees, contractors, and business owners acting in their professional capacity are now fully covered as consumers under the law. Scraping B2B directories requires the same compliance rigor as B2C data.
How does DataFlirt prevent accidental PII ingestion? +
We enforce schema contracts at the extraction layer. If a field is defined as a company name but our regex/NER models detect a personal email address, the worker redacts the field and flags the record. The raw HTML is discarded immediately after parsing, ensuring no residual PII is stored.
Do I need a 'Do Not Sell' link if I just scrape data? +
If you sell or share the scraped personal data for cross-context behavioral advertising, yes. If you act as a service provider processing data strictly for internal analytics, you may not need the link, but you still must provide notice at collection. Consult your legal counsel for your specific use case.
Can we keep raw HTML for debugging broken selectors? +
Retaining raw HTML is a massive compliance risk because it often contains unparsed personal data. DataFlirt handles this by keeping raw HTML in an ephemeral, encrypted cache with a strict 24-hour TTL. It is used solely for automated selector repair and is never delivered to the client.
Are IP addresses considered personal information under CCPA? +
Yes. CCPA explicitly lists IP addresses and unique device identifiers as personal information if they can be linked to a particular consumer or household. This means the logs generated by your scraping infrastructure must also be managed under your CCPA compliance program.
$ dataflirt scope --new-project --target=ccpa-compliance READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h