We extract health insurance plans, premium rates, out-of-pocket limits, metal tiers, and drug formularies from healthcare.gov. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.
Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.
Complete list of extractable fields for Plan Overview objects from healthcare.gov. All fields typed and schema-versioned.
"plan_id": "12345TX0010001", "issuer_name": "Blue Cross Blue Shield", "plan_name": "Blue Advantage Bronze HMO 205", "metal_tier": "Bronze", "plan_type": "HMO", "state": "TX"
| # | plan_id | issuer_name | plan_name | metal_tier | plan_type | rating_area |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Pricing & Premiums objects from healthcare.gov. All fields typed and schema-versioned.
"plan_id": "12345TX0010001", "base_premium": 345.5, "age_40_premium": 412.75, "ehb_percent": 98.5, "tobacco_surcharge": 50.0, "rating_area": "Rating Area 3"
| # | plan_id | base_premium | age_21_premium | age_40_premium | age_60_premium | child_premium |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Cost Sharing & Deductibles objects from healthcare.gov. All fields typed and schema-versioned.
"plan_id": "12345TX0010001", "medical_deductible_individual": 7500.0, "oop_max_individual": 9100.0, "primary_care_copay": 40.0, "specialist_copay": 80.0, "er_copay": 500.0
| # | plan_id | medical_deductible_individual | medical_deductible_family | drug_deductible_individual | drug_deductible_family | oop_max_individual |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Drug Formularies objects from healthcare.gov. All fields typed and schema-versioned.
"rx_cui": "855332", "drug_name": "Atorvastatin 20mg", "tier_level": "Tier 1", "prior_authorisation": false, "step_therapy": false, "coverage_status": "Covered"
| # | rx_cui | ndc_code | drug_name | tier_level | prior_authorisation | step_therapy |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Provider Networks objects from healthcare.gov. All fields typed and schema-versioned.
"network_id": "NW-88392", "provider_type": "Facility", "facility_name": "Methodist Hospital", "npi": "1932485721", "specialty": "Acute Care Hospital", "accepting_new_patients": true
| # | network_id | provider_type | facility_name | npi | specialty | accepting_new_patients |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Our healthcare.gov scraper targets the underlying API endpoints powering the plan comparison tool, extracting clean data across thousands of rating areas without fragile DOM parsing.
Extract metal tiers, plan types, issuer details, and plan IDs across all rating areas and states on the federal exchange.
Capture base premiums, age-curve pricing, tobacco surcharges, and child rates per geographic rating area.
Extract individual and family deductibles, out-of-pocket maximums, and specific copays for primary care, ER, and specialists.
Map NDC codes and RxNorm identifiers to plan tiers, capturing step therapy and prior authorisation requirements.
Extract in-network hospitals, specialists, and primary care physicians linked to specific plan network IDs.
Resolve county-level and zip-code-level plan availability across the 30+ states using the federal exchange.
Extract advanced premium tax credit (APTC) baseline data and cost-sharing reduction (CSR) plan variations.
Extract CMS star ratings, member experience scores, and clinical quality metrics for each health plan.
Capture direct links to Summary of Benefits and Coverage (SBC), plan brochures, and network directories.
Track premium adjustments, network exits, and formulary tier changes across open enrollment periods.
Brief in. Clean data out.
Provide target states, rating areas, or specific issuers. We design the extraction schema together.
We configure Scrapy / Playwright crawlers, handle zip code session states, and bypass rate limits for healthcare.gov.
Schema validation, null-rate checks, and premium-outlier detection before full launch.
JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.
Healthcare.gov relies on strict rate limits and complex session states. Here is how we extract data reliably at scale.
Healthcare.gov requires setting a geographic context before rendering plan data. Our crawlers maintain isolated cookie sessions per geographic area, preventing cross-contamination of premium rates.
Rather than scraping the DOM, we intercept the undocumented XHR requests powering the plan comparison tool. This yields cleaner, heavily structured JSON payloads with precise age-curve pricing data.
Many issuers still publish drug formularies as complex PDFs. We pipeline these documents through OCR and NLP layers to extract NDC codes, tier levels, and restriction flags into structured database records.
The federal exchange employs strict rate limiting and Akamai bot protection. We distribute requests across a vast pool of US-based residential proxies, pacing requests to mimic standard user navigation patterns.
Different insurers use varied terminology for copays and tier levels. Our pipeline applies regular expressions and mapping dictionaries to normalise these fields into a unified, queryable schema.
Health insurers monitor competitor premiums, network sizing, and metal tier positioning across overlapping rating areas.
Health insurance brokerages power their proprietary quoting and comparison engines using our normalised plan datasets.
Pharmaceutical companies track formulary tier placement and utilisation management restrictions for their drug portfolios.
Actuarial teams ingest historical premium and deductible data to model risk and price future plan offerings.
Healthcare systems analyse network adequacy and competitor overlap to negotiate better reimbursement rates with payers.
Health policy researchers track ACA market stability, subsidy impacts, and out-of-pocket cost trends over time.
"Healthcare.gov contains the definitive dataset of US individual health insurance markets, but extracting it across 30,000 zip codes requires significant infrastructure."
Most teams underestimate the complexity of federal exchange data. Reliable healthcare.gov scraping requires managing thousands of geographic sessions, parsing undocumented APIs, and normalising disparate issuer formats. DataFlirt absorbs that complexity so your engineers can focus on actuarial analysis, not pipeline maintenance.
Everything supported by our healthcare.gov scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.
Open-source tooling on proven cloud infra — no vendor lock-in, full observability.
Instead of fragile DOM scraping, we monitor network traffic and replay undocumented API calls to the healthcare.gov backend, extracting clean JSON payloads directly.
We maintain thousands of concurrent geographic sessions using Redis, ensuring premium data is accurately tied to the correct rating area without cross-contamination.
To navigate federal firewalls and Akamai bot protection, we route traffic exclusively through US-based ISP proxies, rotating IPs dynamically based on response latency and block rates.
Data delivered to where your team already works — no new tooling required.
About healthcare.gov scraping, legality, and pipeline operations.
Ask us directly →Scraping publicly available plan and pricing data from healthcare.gov is generally permissible. DataFlirt targets only public, non-authenticated market data. We do not extract PII/PHI, circumvent authentication walls, or attempt to access the federal data services hub.
Healthcare.gov relies on rating areas determined by zip code and county. Our crawlers systematically iterate through a master list of US zip codes, setting the appropriate session state to extract localised premium and network data.
This specific pipeline targets the federal exchange serving over 30 states. State-based exchanges require separate custom pipelines due to entirely different underlying architectures and schemas.
Where available, we extract structured JSON from the formulary search endpoints. If issuers only provide PDF formularies, we utilise an OCR and NLP pipeline to convert the documents into structured tabular data mapping NDC codes to tiers.
During the Open Enrollment Period, we can configure daily or weekly runs to capture plan updates and corrections. Outside of OEP, monthly runs are typical for capturing mid-year network changes.
Yes. We extract the standard plan designs as well as the 73%, 87%, and 94% actuarial value CSR variations, detailing the reduced deductibles and copays for eligible individuals.
20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off export of all ACA plans or a continuous feed of formulary changes, we scope, build, and operate the pipeline. Tell us what you need.