SYSTEM all green source healthcare.gov queue 11,402 zip codes p99 latency 318ms dataflirt.com · scraper/healthcare-gov
RUN · 14 active pipelines · healthcare.gov live

ACA plan data,
at warehouse scale.

We extract health insurance plans, premium rates, out-of-pocket limits, metal tiers, and drug formularies from healthcare.gov. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Plans extracted
41,294 /run
Premium variations
1.2M /month
Formulary drugs
384K /run
Active pipelines
14
Uptime
99.94%
Data Dictionary

Every field we extract from healthcare.gov

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Plan Overview objects from healthcare.gov. All fields typed and schema-versioned.

plan_idissuer_nameplan_namemetal_tierplan_typerating_areastatenetwork_urlformulary_urlsummary_url
plan_overview
● 200 OK
"plan_id": "12345TX0010001",
"issuer_name": "Blue Cross Blue Shield",
"plan_name": "Blue Advantage Bronze HMO 205",
"metal_tier": "Bronze",
"plan_type": "HMO",
"state": "TX"
# plan_idissuer_nameplan_namemetal_tierplan_typerating_area
1
2
3

Complete list of extractable fields for Pricing & Premiums objects from healthcare.gov. All fields typed and schema-versioned.

plan_idbase_premiumage_21_premiumage_40_premiumage_60_premiumchild_premiumehb_percenttobacco_surchargerating_areazip_codes
pricing_& premiums
● 200 OK
"plan_id": "12345TX0010001",
"base_premium": 345.5,
"age_40_premium": 412.75,
"ehb_percent": 98.5,
"tobacco_surcharge": 50.0,
"rating_area": "Rating Area 3"
# plan_idbase_premiumage_21_premiumage_40_premiumage_60_premiumchild_premium
1
2
3

Complete list of extractable fields for Cost Sharing & Deductibles objects from healthcare.gov. All fields typed and schema-versioned.

plan_idmedical_deductible_individualmedical_deductible_familydrug_deductible_individualdrug_deductible_familyoop_max_individualoop_max_familyprimary_care_copayspecialist_copayer_copay
cost_sharing & deductibles
● 200 OK
"plan_id": "12345TX0010001",
"medical_deductible_individual": 7500.0,
"oop_max_individual": 9100.0,
"primary_care_copay": 40.0,
"specialist_copay": 80.0,
"er_copay": 500.0
# plan_idmedical_deductible_individualmedical_deductible_familydrug_deductible_individualdrug_deductible_familyoop_max_individual
1
2
3

Complete list of extractable fields for Drug Formularies objects from healthcare.gov. All fields typed and schema-versioned.

rx_cuindc_codedrug_nametier_levelprior_authorisationstep_therapyquantity_limitplan_idissuer_idcoverage_status
drug_formularies
● 200 OK
"rx_cui": "855332",
"drug_name": "Atorvastatin 20mg",
"tier_level": "Tier 1",
"prior_authorisation": false,
"step_therapy": false,
"coverage_status": "Covered"
# rx_cuindc_codedrug_nametier_levelprior_authorisationstep_therapy
1
2
3

Complete list of extractable fields for Provider Networks objects from healthcare.gov. All fields typed and schema-versioned.

network_idprovider_typefacility_namenpispecialtyaccepting_new_patientstelehealth_offeredaddresscitystatezip_code
provider_networks
● 200 OK
"network_id": "NW-88392",
"provider_type": "Facility",
"facility_name": "Methodist Hospital",
"npi": "1932485721",
"specialty": "Acute Care Hospital",
"accepting_new_patients": true
# network_idprovider_typefacility_namenpispecialtyaccepting_new_patients
1
2
3

Capabilities

Everything you need from the federal exchange

Our healthcare.gov scraper targets the underlying API endpoints powering the plan comparison tool, extracting clean data across thousands of rating areas without fragile DOM parsing.

Full ACA Plan Extraction

Extract metal tiers, plan types, issuer details, and plan IDs across all rating areas and states on the federal exchange.

Premium Rate Aggregation

Capture base premiums, age-curve pricing, tobacco surcharges, and child rates per geographic rating area.

Cost-Sharing & Copay Data

Extract individual and family deductibles, out-of-pocket maximums, and specific copays for primary care, ER, and specialists.

Drug Formulary Mapping

Map NDC codes and RxNorm identifiers to plan tiers, capturing step therapy and prior authorisation requirements.

Provider Network Parsing

Extract in-network hospitals, specialists, and primary care physicians linked to specific plan network IDs.

Rating Area Resolution

Resolve county-level and zip-code-level plan availability across the 30+ states using the federal exchange.

Subsidy & Tax Credit Logic

Extract advanced premium tax credit (APTC) baseline data and cost-sharing reduction (CSR) plan variations.

Quality Rating Capture

Extract CMS star ratings, member experience scores, and clinical quality metrics for each health plan.

Document URL Extraction

Capture direct links to Summary of Benefits and Coverage (SBC), plan brochures, and network directories.

Change Detection & Diffing

Track premium adjustments, network exits, and formulary tier changes across open enrollment periods.

// engagement pipeline

From zip code list to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide target states, rating areas, or specific issuers. We design the extraction schema together.

Pipeline Build
d 2–4

We configure Scrapy / Playwright crawlers, handle zip code session states, and bypass rate limits for healthcare.gov.

Validation & QA
d 4–6

Schema validation, null-rate checks, and premium-outlier detection before full launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our pipeline handles federal exchange complexities

Healthcare.gov relies on strict rate limits and complex session states. Here is how we extract data reliably at scale.

pipeline-monitor · healthcare.gov · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Session state management
Zip code and county tokenisation

Healthcare.gov requires setting a geographic context before rendering plan data. Our crawlers maintain isolated cookie sessions per geographic area, preventing cross-contamination of premium rates.

API payload interception
Direct JSON extraction from backend endpoints

Rather than scraping the DOM, we intercept the undocumented XHR requests powering the plan comparison tool. This yields cleaner, heavily structured JSON payloads with precise age-curve pricing data.

Formulary PDF parsing
Converting unstructured drug lists to tabular data

Many issuers still publish drug formularies as complex PDFs. We pipeline these documents through OCR and NLP layers to extract NDC codes, tier levels, and restriction flags into structured database records.

Rate limiting & WAF
Bypassing Akamai and federal firewall rules

The federal exchange employs strict rate limiting and Akamai bot protection. We distribute requests across a vast pool of US-based residential proxies, pacing requests to mimic standard user navigation patterns.

Data normalisation
Standardising issuer-specific terminology

Different insurers use varied terminology for copays and tier levels. Our pipeline applies regular expressions and mapping dictionaries to normalise these fields into a unified, queryable schema.

Applications

Who uses healthcare.gov data

Teams across industries use healthcare.gov data to build competitive products and smarter operations.

01
Market Intelligence & Competitive Analysis

Health insurers monitor competitor premiums, network sizing, and metal tier positioning across overlapping rating areas.

02
Broker & Agency Tooling

Health insurance brokerages power their proprietary quoting and comparison engines using our normalised plan datasets.

03
Pharma Market Access

Pharmaceutical companies track formulary tier placement and utilisation management restrictions for their drug portfolios.

04
Actuarial Modelling

Actuarial teams ingest historical premium and deductible data to model risk and price future plan offerings.

05
Provider Network Optimisation

Healthcare systems analyse network adequacy and competitor overlap to negotiate better reimbursement rates with payers.

06
Policy & Academic Research

Health policy researchers track ACA market stability, subsidy impacts, and out-of-pocket cost trends over time.

Why DataFlirt

"Healthcare.gov contains the definitive dataset of US individual health insurance markets, but extracting it across 30,000 zip codes requires significant infrastructure."

Most teams underestimate the complexity of federal exchange data. Reliable healthcare.gov scraping requires managing thousands of geographic sessions, parsing undocumented APIs, and normalising disparate issuer formats. DataFlirt absorbs that complexity so your engineers can focus on actuarial analysis, not pipeline maintenance.

Technical Spec

Healthcare.gov scraper technical capabilities

Everything supported by our healthcare.gov scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Plan pricing & age curves
Extract base rates and age-adjusted premiums per rating area
Supported
Deductible & OOP max
Capture individual and family cost-sharing limits
Supported
Drug formulary tiers
Map RxNorm/NDC codes to coverage tiers and restrictions
Supported
CMS Star Ratings
Extract clinical quality and member experience scores
Supported
Geographic rating areas
Resolve plan availability down to the county and zip code level
Supported
SBC document links
Capture URLs for Summary of Benefits and Coverage documents
Supported
State-based exchanges
Data from Covered California, NY State of Health, etc.
Partial
Member enrollment status
PII/PHI regarding actual user enrollments and eligibility
Partial
Medicaid eligibility API
Access to federal data services hub for income verification
Partial
Infrastructure

Infrastructure powering the pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus
XHR Interception & API Replay

Instead of fragile DOM scraping, we monitor network traffic and replay undocumented API calls to the healthcare.gov backend, extracting clean JSON payloads directly.

Distributed Session Management

We maintain thousands of concurrent geographic sessions using Redis, ensuring premium data is accurately tied to the correct rating area without cross-contamination.

US-Residential Proxy Pool

To navigate federal firewalls and Akamai bot protection, we route traffic exclusively through US-based ISP proxies, rotating IPs dynamically based on response latency and block rates.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested - schema versioned per run
CSV
Flat file with typed columns - Excel/Sheets compatible
Parquet
Columnar format for BigQuery, Snowflake, Athena
S3
Direct bucket delivery - compatible with any data lake
Webhook
HTTP POST per record for real-time downstream processing
XLS
Excel format for business and actuarial teams
API
REST endpoint for querying the extracted plan database
PostgreSQL
Direct upsert into your relational database schema
// faq

Common questions.

About healthcare.gov scraping, legality, and pipeline operations.

Ask us directly →
Is scraping healthcare.gov legal?

Scraping publicly available plan and pricing data from healthcare.gov is generally permissible. DataFlirt targets only public, non-authenticated market data. We do not extract PII/PHI, circumvent authentication walls, or attempt to access the federal data services hub.

How do you handle the geographic variations in plan data?

Healthcare.gov relies on rating areas determined by zip code and county. Our crawlers systematically iterate through a master list of US zip codes, setting the appropriate session state to extract localised premium and network data.

Can you extract data from state-based exchanges?

This specific pipeline targets the federal exchange serving over 30 states. State-based exchanges require separate custom pipelines due to entirely different underlying architectures and schemas.

How do you extract drug formulary data?

Where available, we extract structured JSON from the formulary search endpoints. If issuers only provide PDF formularies, we utilise an OCR and NLP pipeline to convert the documents into structured tabular data mapping NDC codes to tiers.

How fresh is the data?

During the Open Enrollment Period, we can configure daily or weekly runs to capture plan updates and corrections. Outside of OEP, monthly runs are typical for capturing mid-year network changes.

Do you capture cost-sharing reduction plan variations?

Yes. We extract the standard plan designs as well as the 73%, 87%, and 94% actuarial value CSR variations, detailing the reduced deductibles and copays for eligible individuals.

$ dataflirt scope --new-project --source=healthcare.gov ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off export of all ACA plans or a continuous feed of formulary changes, we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →