SYSTEM all green source epa.gov queue 18,402 facilities p99 latency 314ms dataflirt.com · scraper/epa-gov
RUN | 31 active pipelines | epa.gov live

Environmental data,
at warehouse scale.

We extract facility emissions, enforcement records, chemical registries, and air quality metrics from EPA.gov databases. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Facilities tracked
843K
Emissions records
1.2M /month
Compliance updates
45K /week
Active pipelines
31
Uptime
99.94%
Data Dictionary

Every field we extract from epa.gov

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for ECHO Compliance objects from epa.gov. All fields typed and schema-versioned.

facility_idregistry_idfacility_nameaddresscitystatezip_codeinspection_countviolation_statusquarters_with_violationpenalty_amountlast_inspection_dateenforcement_actions
echo_compliance
● 200 OK
"facility_id": "110000345678",
"registry_id": "110000345678",
"facility_name": "INDUSTRIAL MANUFACTURING CORP",
"violation_status": "Significant Violation",
"penalty_amount": 25000.0,
"last_inspection_date": "2023-11-14",
"state": "OH"
# facility_idregistry_idfacility_nameaddresscitystate
1
2
3

Complete list of extractable fields for TRI Emissions objects from epa.gov. All fields typed and schema-versioned.

tri_facility_idchemical_namecas_numberrelease_yearair_emissions_lbswater_discharges_lbsland_releases_lbstotal_releasesparent_companyprimary_naicslatitudelongitude
tri_emissions
● 200 OK
"tri_facility_id": "44114NDSTR12345",
"chemical_name": "TOLUENE",
"cas_number": "108-88-3",
"release_year": 2022,
"air_emissions_lbs": 1450.5,
"total_releases": 1600.5,
"primary_naics": "325199"
# tri_facility_idchemical_namecas_numberrelease_yearair_emissions_lbswater_discharges_lbs
1
2
3

Complete list of extractable fields for Superfund Sites objects from epa.gov. All fields typed and schema-versioned.

site_idepa_idsite_namenpl_statuslisting_datehuman_health_exposuregroundwater_migrationoperable_unitscleanup_statusproject_managerregion
superfund_sites
● 200 OK
"site_id": "0501234",
"epa_id": "OHD980509657",
"site_name": "RIVER VALLEY LANDFILL",
"npl_status": "Currently on the Final NPL",
"listing_date": "1983-09-08",
"cleanup_status": "Construction Complete",
"region": "05"
# site_idepa_idsite_namenpl_statuslisting_datehuman_health_exposure
1
2
3

Complete list of extractable fields for Air Quality (AQI) objects from epa.gov. All fields typed and schema-versioned.

monitor_idcbsa_nameparameter_nameaqi_valueaqi_categoryobservation_timelatitudelongitudepollutant_standardstate_codecounty_codesite_number
air_quality (aqi)
● 200 OK
"monitor_id": "39-035-0038",
"cbsa_name": "Cleveland-Elyria, OH",
"parameter_name": "Ozone",
"aqi_value": 45,
"aqi_category": "Good",
"observation_time": "2023-08-12T14:00:00Z",
"pollutant_standard": "Ozone 8-hour 2015"
# monitor_idcbsa_nameparameter_nameaqi_valueaqi_categoryobservation_time
1
2
3

Complete list of extractable fields for TSCA Registry objects from epa.gov. All fields typed and schema-versioned.

substance_namecas_rnepa_registry_idactive_statusregulatory_flagshazard_classificationgeneric_nameimport_export_volumemanufacturer_countapproval_date
tsca_registry
● 200 OK
"substance_name": "Benzene",
"cas_rn": "71-43-2",
"epa_registry_id": "110000000000",
"active_status": "Active",
"regulatory_flags": "['SNUR', 'TRI']",
"hazard_classification": "Carcinogen",
"manufacturer_count": 42
# substance_namecas_rnepa_registry_idactive_statusregulatory_flagshazard_classification
1
2
3

Capabilities

Complete coverage of EPA public databases

Our EPA scraper handles fragmented portals: ECHO, TRI, Envirofacts, and AQS. We normalise legacy schemas, parse nested PDFs, and map facility IDs across disparate federal systems.

ECHO Compliance Extraction

Extract facility inspection histories, violation statuses, enforcement actions, and penalty amounts across all environmental statutes.

TRI Chemical Releases

Track toxic chemical releases to air, water, and land by facility, parent company, and NAICS code.

Superfund & NPL Tracking

Monitor site cleanup progress, human health exposure indicators, and operable unit statuses for CERCLA sites.

Air Quality Sensor Time-Series

Capture hourly AQI readings, pollutant concentrations, and meteorological data from AirNow and AQS monitoring stations.

TSCA Chemical Inventory

Scrape substance registries, active/inactive statuses, and regulatory flags for chemical manufacturing compliance.

PDF & Permit Parsing

Extract structured text and tables from scanned enforcement documents and environmental permits using OCR.

Geospatial Standardisation

Normalise coordinate projections and polygon boundaries for facilities, water bodies, and non-attainment areas.

Cross-Database ID Mapping

Link records across TRI, RCRA, and CAA databases using the EPA Facility Registry Service (FRS) ID.

Greenhouse Gas Reporting

Extract facility-level CO2e emissions data from the GHGRP database for ESG and carbon accounting.

Scheduled + Streaming Modes

Run one-off historical exports or configure continuous pipelines for daily compliance updates.

// engagement pipeline

From facility list to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide facility IDs, state codes, NAICS categories, or chemical names. We design the extraction schema together.

Pipeline Build
d 2–4

We configure crawlers to handle EPA legacy APIs, rate limits, and PDF parsing infrastructure.

Validation & QA
d 4–6

Schema validation, FRS ID integrity checks, and null-rate monitoring before full launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

Handling legacy government infrastructure

EPA.gov relies on fragmented legacy databases, heavy PDF payloads, and inconsistent API schemas. Here is how we maintain pipeline stability.

pipeline-monitor · epa.gov · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Legacy API normalisation
Abstracting Envirofacts inconsistencies

EPA APIs often return mixed types, undocumented nulls, and nested XML structures. We enforce strict data typing and normalise responses into clean, relational JSON before delivery.

ID resolution
Cross-referencing Facility Registry Service

A single factory might have different IDs in the air, water, and hazardous waste databases. We map all subsystem IDs back to the master FRS registry ID to provide a unified view of a facility.

Document extraction
PDF permit extraction via OCR

Many enforcement actions and permits are only available as scanned PDFs. We route these payloads through our Tesseract OCR cluster to extract violation dates, penalty amounts, and regulatory text.

Traffic management
Rate limiting and timeout handling

Government servers frequently drop connections under load. Our distributed crawlers respect `robots.txt` crawl delays, implement exponential backoff, and use proxy rotation to avoid IP bans during large historical backfills.

Schema stability
Schema drift detection

Public endpoints change without notice. We hash API response schemas and trigger alerts on structural drift, allowing our engineers to patch extraction logic before corrupted data reaches your warehouse.

Applications

Who uses EPA data and how

Teams across industries use epa.gov data to build competitive products and smarter operations.

01
ESG & Sustainability Reporting

Financial institutions and corporate sustainability teams aggregate TRI and GHGRP data to audit Scope 1 and Scope 2 emissions.

02
Real Estate Due Diligence

Commercial real estate firms query Superfund and brownfield registries to assess environmental liabilities before acquisition.

03
Supply Chain Risk Monitoring

Procurement teams track supplier facilities in the ECHO database to flag environmental violations and compliance risks.

04
Environmental Justice Analysis

Non-profits and academic researchers overlay demographic data with facility emissions footprints to study disproportionate health impacts.

05
Quantitative Hedge Funds

Alternative data teams ingest enforcement penalties and chemical production volumes to model regulatory risk for publicly traded manufacturers.

06
Industrial Market Research

Consultancies analyse TSCA chemical registries and production volumes to forecast shifts in industrial chemical supply chains.

Why DataFlirt

"EPA data is public but heavily fragmented across legacy portals. Normalising facility records across ECHO, TRI, and TSCA requires dedicated infrastructure."

Government endpoints frequently suffer from undocumented schema changes, aggressive rate limiting, and multi-gigabyte static file dumps. DataFlirt abstracts this unreliability. We normalise FRS IDs across all EPA subsystems, parse compliance PDFs into structured text, and deliver clean, relational data directly to your warehouse.

Technical Spec

EPA.gov scraper technical capabilities

Everything supported by our epa.gov scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

FRS ID mapping
Automatic resolution of program IDs to master Facility Registry Service IDs
Supported
PDF permit OCR parsing
Text extraction from scanned enforcement documents and administrative orders
Supported
Geospatial coordinate normalisation
Conversion of legacy projections to standard WGS84 format
Supported
Historical TRI data extraction
Full backfill of toxic release inventory data spanning decades
Supported
Daily AirNow AQI polling
Time-series extraction of hourly monitor readings
Supported
ECHO enforcement action tracking
Continuous monitoring for new violations and penalty assessments
Supported
Change detection (diffs)
Hash-based diff logic to only emit records with changed fields
Supported
Webhook delivery
HTTP POST per record for immediate downstream alerting
Supported
Confidential Business Information
Trade secrets and CBI redacted from TSCA and TRI public filings
Partial
Pre-decisional enforcement documents
Internal EPA memos and unfinalised settlement negotiations
Partial
Infrastructure

Infrastructure powering the EPA pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusTesseract OCRPostGIS
Legacy Endpoint Normalisation

We map SOAP XML, fragmented CSV dumps, and legacy JSON endpoints into a unified relational schema using Python data validation pipelines.

PDF Extraction Pipeline

Documents are routed to a Kubernetes cluster running Tesseract OCR and spatial layout analysis to extract tables from unstructured permits.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested array format
CSV
Flat file with typed columns
XLS
Excel compatible format for analyst teams
Parquet
Columnar format for BigQuery, Snowflake, Athena
AWS S3
Direct bucket delivery with partition logic
Webhook
HTTP POST per record for real-time alerting
API
RESTful endpoints to query extracted datasets
BigQuery
Streamed directly into your dataset
Snowflake
Stage and COPY INTO workflow
PostgreSQL
Direct database upsert with PostGIS support
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About epa.gov scraping, legality, and pipeline operations.

Ask us directly →
Is scraping EPA.gov legal?

Yes. Environmental data published on EPA.gov is in the public domain and funded by taxpayers. We strictly extract publicly available records and do not attempt to bypass authentication for internal government systems or access Confidential Business Information (CBI).

How do you handle facility matching across different EPA databases?

We use the EPA Facility Registry Service (FRS). This master index assigns a unique FRS ID to each physical site. We map program-specific IDs (like TRI or RCRA IDs) back to the FRS ID to provide a unified, deduplicated view of a facility.

Can you extract data from scanned PDF permits?

Yes. We run an OCR pipeline using Tesseract to convert scanned enforcement documents and permits into searchable text, allowing us to extract violation dates, penalty amounts, and specific regulatory clauses.

How fresh is the AQI data?

Air quality data from AirNow is polled hourly. We can configure streaming pipelines to deliver these updates to your webhook or database with sub-5-minute latency after the EPA publishes them.

Do you support historical Superfund data?

Yes. We can perform full historical backfills of the CERCLIS/NPL databases, capturing listing dates, cleanup milestones, and deletion records spanning decades.

What is the minimum viable engagement?

Our minimum engagement typically starts with a defined facility list (e.g., 5,000 specific factories) or a complete historical dump of a single dataset like TRI. We price based on data volume, update frequency, and schema complexity.

Can you integrate state-level environmental data alongside federal EPA data?

Yes. Many enforcement actions are handled by state agencies (e.g., TCEQ in Texas, CalEPA in California) before reaching federal databases. We can build custom pipelines to scrape state-level portals and normalise the data into the federal schema.

Can I request a sample dataset before committing?

Absolutely. We provide a sample run of up to 500 facility records or a subset of historical emissions data during the scoping process. This allows your engineering team to validate schema fit and data completeness.

$ dataflirt scope --new-project --source=epa.gov ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a daily feed of enforcement actions or a historical dump of TRI emissions across 50 states, we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →