SYSTEM all green source fda.gov queue 11,402 records p99 latency 845ms dataflirt.com · scraper/fda-gov
RUN : 64 active pipelines : fda.gov live

FDA regulatory data,
structured for compliance.

We extract enforcement reports, drug approvals, device clearances, and adverse events from fda.gov databases. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake.

Recalls tracked
43,912 /month
Warning letters
1,204 /year
510(k) clearances
8,491 /quarter
Active pipelines
64
Uptime
99.94%
Data Dictionary

Every field we extract from fda.gov

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Warning Letters objects from fda.gov. All fields typed and schema-versioned.

letter_idissue_datecompany_namesubjectissuing_officeviolative_productsinspections_datesletter_urlextracted_text
warning_letters
● 200 OK
"letter_id": "WL-641928",
"issue_date": "2026-03-14",
"company_name": "PharmaCorp Synthetics Ltd",
"issuing_office": "Center for Drug Evaluation and Research",
"violative_products": "Unapproved New Drugs",
"letter_url": "https://www.fda.gov/inspections/warning-letters/641928"
# letter_idissue_datecompany_namesubjectissuing_officeviolative_products
1
2
3

Complete list of extractable fields for Enforcement Reports objects from fda.gov. All fields typed and schema-versioned.

recall_numberproduct_descriptioncode_inforecalling_firmreason_for_recallrecall_classstatusdistribution_patterndate_initiated
enforcement_reports
● 200 OK
"recall_number": "D-1425-2026",
"product_description": "Aspirin 81mg Tablets, 100 count bottle",
"recalling_firm": "HealthMeds LLC",
"reason_for_recall": "Presence of foreign particulate matter",
"recall_class": "Class II",
"status": "Ongoing",
"date_initiated": "2026-02-28"
# recall_numberproduct_descriptioncode_inforecalling_firmreason_for_recallrecall_class
1
2
3

Complete list of extractable fields for 510(k) Clearances objects from fda.gov. All fields typed and schema-versioned.

k_numberdevice_nameapplicantcontactregulation_numberclassification_product_codedate_receiveddecision_datedecision
510(k)_clearances
● 200 OK
"k_number": "K240182",
"device_name": "Advanced Cardiac Monitor",
"applicant": "CardioTech Devices Inc.",
"classification_product_code": "MHX",
"decision_date": "2026-01-15",
"decision": "Substantially Equivalent (SESE)"
# k_numberdevice_nameapplicantcontactregulation_numberclassification_product_code
1
2
3

Complete list of extractable fields for Orange Book objects from fda.gov. All fields typed and schema-versioned.

appl_noproduct_noformdosageproduct_mkt_statuste_codereference_drugactive_ingredientpatent_expire_date
orange_book
● 200 OK
"appl_no": "NDA214589",
"product_no": "001",
"form": "TABLET;ORAL",
"dosage": "50MG",
"product_mkt_status": "Prescription",
"te_code": "AB",
"active_ingredient": "METOPROLOL SUCCINATE"
# appl_noproduct_noformdosageproduct_mkt_statuste_code
1
2
3

Complete list of extractable fields for Inspections objects from fda.gov. All fields typed and schema-versioned.

fei_numberlegal_namecitystatecountryinspection_end_dateclassificationproject_area
inspections
● 200 OK
"fei_number": "3014582910",
"legal_name": "BioManufacturing Partners",
"city": "Boston",
"country": "United States",
"inspection_end_date": "2026-04-02",
"classification": "Voluntary Action Indicated (VAI)"
# fei_numberlegal_namecitystatecountryinspection_end_date
1
2
3

Capabilities

Regulatory intelligence without the manual overhead

Our fda.gov scraper navigates legacy government architectures, parses unstructured PDFs, and normalises company names across fragmented databases to deliver clean compliance signals.

Warning Letters Extraction

Capture metadata and full text from FDA warning letters. We parse HTML pages and run OCR on legacy PDF documents.

Enforcement Reports

Track Class I, II, and III recalls across drugs, devices, and food. Filter by recalling firm, product code, or distribution pattern.

510(k) and PMA Clearances

Extract medical device clearance records, applicant details, product codes, and decision dates from the premarket notification database.

Orange Book Tracking

Monitor approved drug products, therapeutic equivalence codes, and patent expiration dates to anticipate generic market entry.

MAUDE and FAERS

Extract adverse event reports for medical devices and drugs to monitor post-market safety signals.

Inspection Classifications

Track facility inspections (OAI, VAI, NAI) globally to monitor supply chain compliance and manufacturing risks.

ASP.NET Form Handling

Navigate legacy FDA search forms automatically. We manage ViewState tokens and session cookies to extract deep paginated results.

Scheduled Diffing

Run daily pipelines that only push new or modified records, reducing database bloat and highlighting immediate compliance risks.

Cross-Database Normalisation

We map FEI numbers, applicant names, and product codes across disconnected FDA datasets to create unified entity profiles.

OpenFDA API Augmentation

Combine direct web scraping of the latest portal updates with historical bulk data from the OpenFDA API for complete coverage.

// engagement pipeline

From government portal to data warehouse

Brief in. Clean data out.

Define Scope
d 0

Select the FDA databases, specific product codes, or company names you need to monitor.

Pipeline Build
d 2–4

We configure crawlers to handle FDA search forms, PDF parsing, and pagination limits.

Validation & QA
d 4–6

Schema checks ensure entity names, dates, and classification codes meet your normalisation standards.

Delivery
ongoing

JSON, CSV, or Parquet files pushed to your S3 bucket or data warehouse on a daily or weekly cadence.

Under the hood

How we handle FDA database complexities

Government sites are notoriously difficult to scrape reliably. Here is how we bypass legacy architecture bottlenecks.

pipeline-monitor · fda.gov · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Legacy Architecture
ASP.NET ViewState management

Many FDA databases rely on legacy ASP.NET web forms. Our crawlers capture and submit __VIEWSTATE and __EVENTVALIDATION tokens dynamically, ensuring search queries and deep pagination work flawlessly without session timeouts.

Unstructured Data
PDF parsing and OCR pipelines

Older warning letters and inspection reports exist only as scanned PDFs. We route these documents through a PyPDF2 and Tesseract OCR pipeline to extract actionable text and metadata into structured JSON.

Server Reliability
Strict rate limiting and timeout handling

FDA servers frequently drop connections under load. We implement strict concurrency limits, exponential backoff, and automatic retry queues to ensure complete data extraction without triggering firewall blocks.

Data Fragmentation
Entity resolution across datasets

A company might be listed under different name variations in the Orange Book versus the Inspections database. We extract standard identifiers like FEI numbers to help you link records across disparate FDA systems.

Volume Constraints
Bypassing 10k record display limits

FDA search interfaces often cap results at 10,000 records. We automatically partition search queries by date ranges or product codes to extract the complete historical dataset.

Applications

Who uses FDA data and why

Teams across industries use fda.gov data to build competitive products and smarter operations.

01
Pharma Competitor Intelligence

Track competitor drug approvals, clinical hold notices, and Orange Book patent expirations to inform market entry strategies.

02
Supply Chain Risk Management

Monitor contract manufacturing organisations (CMOs) for warning letters and OAI inspection classifications to prevent supply disruptions.

03
MedTech Market Research

Analyse 510(k) clearances to identify trending device categories, predicate devices, and emerging competitors.

04
Hedge Fund Due Diligence

Quant funds ingest daily FDA enforcement actions and approval decisions to trade on regulatory events affecting public healthcare equities.

05
Legal and Compliance Monitoring

Law firms track warning letters and MAUDE adverse events to identify mass tort litigation opportunities.

06
Pharmacovigilance

Safety teams aggregate FAERS data to detect early signals of adverse drug reactions across patient populations.

Why DataFlirt

"Regulatory intelligence requires absolute precision. Missing a Class I recall or a warning letter is a catastrophic failure for compliance teams."

Extracting data from fda.gov means navigating legacy ASP.NET architectures, unstructured PDF warning letters, and fragmented databases. DataFlirt handles the ViewState tokens, OCR pipelines, and daily diffs so your compliance system always has the latest regulatory signals.

Technical Spec

FDA scraper technical capabilities

Everything supported by our fda.gov scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Warning letter PDF parsing
Extracts text from scanned documents using Tesseract OCR
Supported
ASP.NET form submission
Handles ViewState and EventValidation tokens for deep database queries
Supported
Daily enforcement diffs
Delivers only new or updated recall records since the last run
Supported
MAUDE pagination
Iterates through thousands of adverse event reports automatically
Supported
Orange Book tracking
Captures patent expiration dates and exclusivity codes
Supported
Facility inspection history
Extracts global facility inspection classifications (OAI, VAI, NAI)
Supported
510(k) summary extraction
Downloads and parses device clearance summary documents
Supported
Webhook alerts
HTTP POST notifications for immediate Class I recall events
Supported
Non-public trade secret formulations
Confidential drug formulation data is not publicly accessible
Partial
Pre-market submission drafts
Internal FDA review documents prior to public clearance
Partial
Infrastructure

Infrastructure powering the FDA pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusTesseract OCRPyPDF2
Legacy Form Automation

Custom Scrapy middlewares maintain ASP.NET session state, handling complex multi-step search forms without timing out.

PDF and OCR Pipeline

Documents are downloaded to ephemeral storage, processed via PyPDF2 and Tesseract OCR, and converted into structured JSON fields.

Cloud-Native Orchestration

Pipelines run on AWS ECS with Airflow scheduling. Strict rate limiting prevents IP bans while ensuring complete daily data syncs.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Nested structures for complex adverse event reports
CSV
Flat files for easy import into Excel or compliance software
XLS
Formatted spreadsheets for analyst review
Parquet
Columnar format for fast querying in BigQuery or Athena
AWS S3
Direct bucket delivery on a daily or weekly schedule
Webhook
HTTP POST alerts for critical enforcement actions
API
Query your extracted data via our managed REST endpoints
PostgreSQL
Direct inserts into your internal compliance database
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About fda.gov scraping, legality, and pipeline operations.

Ask us directly →
Is it legal to scrape fda.gov?

Yes. The data published on fda.gov is public domain government information. We only extract publicly available regulatory records and do not attempt to access gated or confidential submission systems.

Why scrape fda.gov instead of using the OpenFDA API?

While the OpenFDA API is useful, it often suffers from data lag and does not cover all datasets (such as recent warning letters or specific inspection details). Direct scraping ensures you have the absolute latest information as soon as it is published on the portal.

How do you handle unstructured warning letters?

We extract the HTML text where available. For older letters provided only as scanned PDFs, we download the file and run it through an OCR pipeline to extract the text into a searchable JSON field.

Can you track updates to existing FDA records?

Yes. Our pipelines use hash-based diffing. If an ongoing Class II recall is upgraded to Class I, or an inspection status changes, we capture the modification and deliver it in the daily update.

How frequently can the data be updated?

For most enforcement and clearance databases, daily extraction is standard. We can configure specific targeted queries to run hourly if you require immediate alerting for specific companies.

Do you normalise company names across datasets?

We extract all available identifiers, such as FEI numbers or applicant IDs. While we provide the raw extracted company name exactly as it appears on the FDA site, we can build custom mapping logic to help you link records across different databases.

What happens when the FDA updates their website?

Government sites change layouts occasionally. Our managed service includes 24/7 pipeline monitoring. If a DOM change breaks an extraction rule, our engineers update the selectors and backfill any missed data to meet our SLA.

$ dataflirt scope --new-project --source=fda.gov ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Stop manually checking government portals. Tell us which FDA databases and entities you need to monitor, and we will build the pipeline.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →