← Glossary / CCPA and Data Collection

What is CCPA and Data Collection?

CCPA and Data Collection governs how businesses scrape, store, and sell the personal information of California residents. Unlike GDPR, which requires opt-in consent, CCPA operates on an opt-out model. For scraping pipelines, this means you can often collect public data legally, but you must maintain strict provenance tracking, honor deletion requests, and ensure your downstream data delivery doesn't violate sale restrictions if a consumer exercises their rights.

CompliancePII FilteringOpt-OutData ProvenanceLegal
// 02 — definitions

Scraping under
California law.

The operational reality of building data pipelines that touch personal information without triggering statutory damages.

Ask a DataFlirt engineer →

TL;DR

CCPA applies to businesses meeting specific revenue or data-volume thresholds. If your scraper pulls names, emails, IP addresses, or behavioral data of Californians, you are collecting PII. The engineering challenge isn't just stopping the scrape — it's building a deletion and opt-out propagation system that reaches all the way to your client's S3 buckets.

01Definition & scope
The California Consumer Privacy Act (CCPA) regulates how businesses collect, use, and sell the personal information of California residents. In the context of web scraping, if your pipeline extracts names, emails, phone numbers, or even persistent identifiers like IP addresses associated with Californians, you are subject to the law. The core requirement is transparency and control: consumers must be able to know what you scraped, request its deletion, and opt out of its sale.
02The Opt-Out mechanism
Because CCPA does not require prior consent, scrapers can legally ingest public PII. However, businesses must provide a clear "Do Not Sell My Personal Information" link. When a consumer exercises this right, the business must stop selling the scraped data immediately. For data brokers and scraping agencies, this requires a robust identity resolution system to match an incoming opt-out request against millions of unstructured scraped records.
03Publicly available data exemption
A common misconception is that "if it's on the public internet, CCPA doesn't apply." This is false. CCPA's exemption for "publicly available" information is narrowly defined as data lawfully made available from federal, state, or local government records (e.g., property tax databases, professional licenses). Scraping a public LinkedIn profile or a company team page does not fall under this exemption; that data remains fully protected PII.
04How DataFlirt handles it
We build compliance directly into the extraction layer. Unless a client specifically contracts for PII and signs a Data Processing Agreement, our pipelines use regex and NER models to automatically mask emails, phone numbers, and names before the data is serialized. For clients who do ingest PII, we attach a unique provenance hash to every record. If an opt-out occurs, we use that hash to purge the record from our caches and trigger a deletion webhook on the client's end.
05Statutory damages and breach risk
The most dangerous aspect of CCPA for scraping operations is the private right of action for data breaches. If unencrypted scraped PII is exposed due to a failure to maintain reasonable security procedures, consumers can sue for statutory damages of $100 to $750 per consumer per incident. A misconfigured S3 bucket containing 100,000 scraped California profiles carries a potential statutory liability of $75 million, regardless of actual harm.
// 03 — compliance math

Quantifying
CCPA exposure.

CCPA risk scales linearly with the volume of unmanaged PII. DataFlirt uses these models to determine when a pipeline requires mandatory PII quarantine layers.

PII Density = D = records_with_pii / total_records
High density requires strict CCPA controls and provenance tracking. DataFlirt compliance model
Statutory Risk = R = california_records × $750
Per-incident statutory damages under CCPA for data breaches. California Civil Code § 1798.150
Deletion Propagation Latency = T = t_receive + t_purge_db + t_client_sync
Must be < 45 days to meet the regulatory SLA for consumer requests. CCPA SLA requirements
// 04 — pipeline execution

Filtering PII
at the edge.

A live trace of an extraction worker processing a directory scrape, identifying California residents, and applying CCPA quarantine rules before delivery.

NER classificationPII quarantineCCPA opt-out
edge.dataflirt.io — live
CAPTURED
// record ingestion
source: "public_directory_CA"
record_id: "usr_88392a"

// PII classification
field.name: extracted
field.email: extracted
field.location: "San Francisco, CA" // CCPA jurisdiction match

// compliance check
opt_out_registry.check: true // consumer requested deletion
pii_quarantine.status: active

// transform & mask
action: masking applied to field.name, field.email

// delivery routing
route: "s3://df-client-042/ccpa-compliant/"
status: 200 OK // record written with CCPA metadata tag
// 05 — compliance failures

Where pipelines
violate CCPA.

Ranked by frequency of occurrence in enterprise scraping operations. The most common failures aren't malicious — they are architectural oversights in data lifecycle management.

PIPELINES AUDITED ·  ·    150+ enterprise
PRIMARY RISK ·  ·  ·  ·   Orphaned PII
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Ignoring Opt-Outs

high risk · Failure to propagate 'Do Not Sell' requests downstream
02

Incomplete Deletion

high risk · Leaving PII in logs, caches, or dead-letter queues
03

Misclassifying PII

med risk · Failing to recognize IP addresses or device IDs as personal info
04

Broken Provenance

med risk · Inability to prove where and when a specific record was scraped
05

Unsecured Storage

critical · Breach of unencrypted scraped PII triggering statutory damages
// 06 — our architecture

Scrape everything,

quarantine the toxic assets.

DataFlirt treats PII as radioactive material. Our extraction layer uses Named Entity Recognition (NER) to flag personal data in real-time. If a pipeline isn't explicitly contracted to deliver PII, those fields are masked before they ever hit disk. For pipelines that do require PII, we attach a cryptographic provenance tag to every record, linking it to the exact scrape event. When a CCPA deletion request arrives, our orchestration layer purges the record from our lakes and automatically fires a webhook to the client's infrastructure to ensure downstream compliance.

CCPA Deletion Job

Live execution of a consumer deletion request across DataFlirt infrastructure.

job.id purge-ccpa-req-992
target.entity hash(email)
records.found 14 records
storage.raw_zone purged
storage.client_s3 webhook_fired
logs.scrubbed verified
sla.remaining 42 days

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About CCPA applicability, public data exemptions, opt-out propagation, and how DataFlirt manages compliance at scale.

Ask us directly →
Does CCPA apply to publicly available data? +
Yes, but with a massive caveat. CCPA exempts "publicly available" information, but defines it strictly as data lawfully made available from federal, state, or local government records. General web scraping of public social media profiles or directories is NOT exempt and is fully subject to CCPA.
Is scraping an IP address considered collecting PII under CCPA? +
Yes. CCPA defines personal information broadly to include any data that identifies, relates to, or could reasonably be linked with a particular consumer or household. This explicitly includes IP addresses, device identifiers, and browsing history.
Do I need consent before scraping California residents? +
No. Unlike the GDPR's opt-in model, CCPA is an opt-out framework. You do not need prior consent to scrape personal information, but you must provide a mechanism for consumers to opt out of the sale of their data and request deletion.
How does DataFlirt handle 'Do Not Sell' requests for scraped data? +
We maintain a centralized cryptographic suppression list. Any entity on this list is automatically dropped or masked at the extraction layer across all 300+ active pipelines. The data never reaches the delivery sink.
What happens if a client buys scraped data and the consumer later opts out? +
Our data contracts require clients to expose a deletion webhook endpoint. When we receive a verified CCPA deletion request, we purge our systems and push the event to the client. The client is legally obligated to purge the record from their downstream warehouses.
Can we just block California IPs to avoid CCPA compliance? +
No. A California resident traveling in Nevada is still protected by CCPA. You must filter based on the data subject's actual residency (inferred from the scraped data itself), not just the IP address of the site you are scraping or the proxy you are using.
$ dataflirt scope --new-project --target=ccpa-and-data-collection READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h