SYSTEM all green source nih.gov queue 114,892 records p99 latency 318ms dataflirt.com · scraper/nih-gov
RUN · 42 active pipelines · nih.gov live

NIH research data,
at warehouse scale.

We extract grant funding records, clinical trial registries, PubMed abstracts, and NCBI genomic datasets from NIH. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Grants tracked
1.2M /month
Trials extracted
487K /total
Publications
36.4M /corpus
Active pipelines
42
Uptime
99.99%
Data Dictionary

Every field we extract from nih.gov

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Clinical Trials objects from nih.gov. All fields typed and schema-versioned.

nct_idbrief_titleofficial_titleconditionsinterventionsphasestudy_typeenrollmentsponsorcollaboratorsprimary_completion_datestatuslocation_countries
clinical_trials
● 200 OK
"nct_id": "NCT04839210",
"brief_title": "Efficacy of Drug X in Asthma",
"phase": "Phase 3",
"status": "Recruiting",
"enrollment": 450,
"conditions": "['Asthma', 'Respiratory Disease']",
"sponsor": "National Heart, Lung, and Blood Institute (NHLBI)"
# nct_idbrief_titleofficial_titleconditionsinterventionsphase
1
2
3

Complete list of extractable fields for Grants (RePORTER) objects from nih.gov. All fields typed and schema-versioned.

project_numproject_titlecontact_piorganizationfiscal_yeartotal_costagency_icaward_notice_dateabstract_textpublic_health_relevance
grants_(reporter)
● 200 OK
"project_num": "1R01CA251138-01A1",
"project_title": "Targeting metabolic vulnerabilities in cancer",
"contact_pi": "SMITH, JOHN",
"organization": "University of California, San Francisco",
"total_cost": 450000,
"agency_ic": "National Cancer Institute"
# project_numproject_titlecontact_piorganizationfiscal_yeartotal_cost
1
2
3

Complete list of extractable fields for Publications (PubMed) objects from nih.gov. All fields typed and schema-versioned.

pmidpmcidtitleabstractauthorsjournalpublication_datedoimesh_termscitation_countfunding_references
publications_(pubmed)
● 200 OK
"pmid": "34981023",
"title": "Genomic epidemiology of SARS-CoV-2",
"journal": "Nature",
"publication_date": "2022-01-05",
"doi": "10.1038/s41586-021-04215-w",
"citation_count": 342,
"authors": "['Doe J', 'Smith A']"
# pmidpmcidtitleabstractauthorsjournal
1
2
3

Complete list of extractable fields for Compounds (PubChem) objects from nih.gov. All fields typed and schema-versioned.

cidiupac_namemolecular_formulamolecular_weightcanonical_smilesisomeric_smilesinchiinchi_keysynonymssafety_summarypharmacological_actions
compounds_(pubchem)
● 200 OK
"cid": 2244,
"iupac_name": "2-acetoxybenzoic acid",
"molecular_formula": "C9H8O4",
"molecular_weight": 180.16,
"canonical_smiles": "CC(=O)OC1=CC=CC=C1C(=O)O",
"synonyms": "['Aspirin', 'Acetylsalicylic acid']"
# cidiupac_namemolecular_formulamolecular_weightcanonical_smilesisomeric_smiles
1
2
3

Complete list of extractable fields for Funding Opportunities objects from nih.gov. All fields typed and schema-versioned.

opportunity_idopportunity_titleopportunity_numberagency_codeopen_dateclose_dateaward_ceilingaward_flooreligible_applicantscfda_numbers
funding_opportunities
● 200 OK
"opportunity_number": "PA-20-185",
"opportunity_title": "NIH Research Project Grant (Parent R01)",
"agency_code": "NIH-OD",
"open_date": "2020-05-05",
"close_date": "2023-05-08",
"award_ceiling": 0
# opportunity_idopportunity_titleopportunity_numberagency_codeopen_dateclose_date
1
2
3

Capabilities

Extract the complete biomedical landscape

Our NIH pipelines navigate complex E-utilities rate limits, deeply nested XML schemas, and disconnected data silos to deliver unified research intelligence.

Clinical Trials Extraction

Capture NCT records, trial phases, sponsor details, primary outcomes, and enrollment numbers from ClinicalTrials.gov.

NIH RePORTER Grants

Track award amounts, principal investigators, institutional affiliations, and project abstracts across all fiscal years.

PubMed & PMC Mining

Extract abstracts, metadata, full-text open access articles, and MeSH terms from the world's largest biomedical literature database.

NCBI E-utilities Optimisation

Automated management of API keys, rate limits, and exponential backoff to extract massive datasets reliably.

PubChem & GenBank Data

Extract chemical structures, SMILES strings, molecular weights, and genomic sequence metadata.

Investigator Mapping

Link principal investigators across grants, publications, and clinical trials to build comprehensive expert profiles.

Historical Data Backfills

Extract decades of research data via baseline downloads, followed by incremental daily or weekly updates.

Change Detection

Track trial status updates, protocol amendments, and grant funding modifications with hash-based diffing.

XML to Relational Parsing

Flatten deeply nested NCBI XML responses into structured, queryable tables for your data warehouse.

Schema Normalisation

Standardise date formats, currency values, and MeSH term hierarchies across disparate NIH databases.

// engagement pipeline

From NIH databases to your warehouse

Brief in. Clean data out.

Define Scope
d 0

Specify the NIH databases, search parameters, or specific record IDs (NCTs, PMIDs). We design the extraction schema together.

Pipeline Build
d 2–4

We configure E-utilities polling, web crawlers, XML parsers, and rate-limit handling for NIH infrastructure.

Validation & QA
d 4–6

Schema validation, cross-referencing PMIDs to RePORTER records, and checking for truncated abstracts before launch.

Delivery
ongoing

JSON, CSV, or Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our NIH pipeline handles the hard parts

NIH houses massive public datasets, but accessing them at scale requires navigating strict API quotas and complex data structures.

pipeline-monitor · nih.gov · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Rate Limiting
Managing strict NCBI API quotas

NIH enforces strict rate limits on their E-utilities APIs (3 to 10 requests per second). We manage API key rotation, exponential backoff, and distributed polling to extract massive datasets without triggering IP bans.

Data Parsing
Flattening deeply nested NCBI schemas

Databases like PubMed and ClinicalTrials.gov return highly nested XML. We build deterministic parsers that flatten hierarchical data, like MeSH term trees and nested author affiliations, into queryable relational schemas.

Pagination
Bypassing 10,000 record limits

Many NIH search interfaces cap pagination at 10,000 results. Our pipelines automatically slice search parameters by date ranges or sub-categories to ensure 100% coverage of large cohorts.

Data Linkage
Cross-referencing across NIH silos

Connecting a RePORTER grant to its resulting PubMed publications and ClinicalTrials records requires careful identifier matching. We extract and normalise cross-references (PMIDs, NCT IDs) to build unified graphs.

Change Detection
Tracking trial and grant amendments

Clinical trials change status frequently. We hash record states and only emit diffs when trial protocols, enrollment numbers, or grant funding amounts are updated, saving downstream processing.

Applications

Who uses NIH data and how

Teams across industries use nih.gov data to build competitive products and smarter operations.

01
Biotech & Pharma Intelligence

Track competitor clinical trials, pipeline developments, and primary completion dates to inform R&D strategy.

02
Academic Research & Meta-Analysis

Aggregate millions of PubMed abstracts and MeSH terms to perform systematic reviews and identify literature gaps.

03
Grant & Funding Analysis

Analyse NIH RePORTER data to spot funding trends, identify top-funded institutions, and forecast emerging research areas.

04
Key Opinion Leader (KOL) Mapping

Link principal investigators to grants, publications, and clinical trials to identify top experts in specific therapeutic areas.

05
Drug Discovery & Repurposing

Mine PubChem compounds and GenBank sequences to train machine learning models for drug discovery.

06
Healthcare Investment Due Diligence

Evaluate biotech startups by auditing their NIH grant history, clinical trial progress, and publication impact factors.

Why DataFlirt

"NIH houses the world's most critical biomedical knowledge, but its fragmented databases and nested XML schemas make large-scale analysis a massive engineering challenge."

Extracting intelligence from NIH requires navigating strict E-utilities rate limits, flattening deeply nested XML trees, and linking identifiers across disparate silos like PubMed, RePORTER, and ClinicalTrials.gov. DataFlirt manages this pipeline complexity so your data science teams can focus on biomedical discovery rather than infrastructure maintenance.

Technical Spec

NIH scraper technical capabilities

Everything supported by our nih.gov scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

PubMed abstract extraction
Full extraction of PMIDs, authors, MeSH terms, and abstracts
Supported
ClinicalTrials.gov history
Extracting the full history of changes for a specific NCT ID
Supported
NIH RePORTER financials
Total funding amounts, fiscal years, and IC allocations
Supported
NCBI E-utilities integration
Automated rate-limit handling and API key rotation
Supported
PubChem compound structures
SMILES, InChI, and molecular property extraction
Supported
Identifier mapping
Linking PMIDs to Grant IDs and NCT IDs natively
Supported
Incremental change detection
Hash-based diffs for updated trials or new publications
Supported
dbGaP Controlled-Access Data
Genomic datasets requiring NIH Data Access Committee (DAC) approval
Partial
Patient-level trial data
Individual patient PII or restricted raw trial results
Partial
Infrastructure

Infrastructure powering the NIH pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheuslxmlBeautifulSoup
NCBI E-utilities Optimisation

Our pipelines integrate directly with NCBI APIs, managing API keys, rate limits, and batch queries to maximise throughput without violating NIH usage policies.

XML Flattening Engine

We use high-performance C-based XML parsers (lxml) to process gigabytes of nested PubMed and ClinicalTrials data, transforming hierarchical trees into flat tables.

Cloud-Native Orchestration

Pipelines run on AWS ECS with Airflow scheduling. We handle the orchestration of massive historical backfills alongside daily incremental updates with SLA-backed reliability.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Nested or flat schemas
CSV
Relational tables for BI tools
Parquet
Columnar storage for data lakes
AWS S3
Direct delivery to your bucket
Webhook
HTTP POST for real-time trial updates
API
RESTful endpoints for querying extracted data
BigQuery
Direct streaming into GCP
Snowflake
Stage and COPY INTO workflows
XLS
Excel compatible for small cohorts
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About nih.gov scraping, legality, and pipeline operations.

Ask us directly →
Is scraping NIH data legal?

Yes. Data on nih.gov, including PubMed, RePORTER, and ClinicalTrials.gov, is public domain and funded by US taxpayers. We strictly adhere to NCBI's E-utilities usage guidelines and robots.txt to ensure compliant extraction.

How do you handle NCBI rate limits?

We use registered API keys, distributed polling, and exponential backoff algorithms to respect the 3 to 10 requests per second limits while maintaining high overall extraction throughput.

Can you link grants to publications?

Yes. We extract the funding references from PubMed and cross-reference them with NIH RePORTER project numbers to build a relational map between funding and research outputs.

Do you extract full-text articles from PubMed Central (PMC)?

We extract full-text XML for open-access articles available in the PMC Open Access Subset. Articles behind publisher paywalls are limited to metadata and abstract extraction.

How often is the data updated?

We configure pipelines to match your needs. Clinical trials and new publications can be tracked daily, while grant funding data is typically updated on a weekly or monthly cadence.

Can you handle massive historical backfills?

Absolutely. We routinely process full baseline downloads of PubMed (millions of records) and ClinicalTrials.gov, followed by daily incremental updates to capture new and modified records.

Do you provide dbGaP genomic data?

We extract public metadata from dbGaP. However, controlled-access genomic data requires explicit approval from the NIH Data Access Committee (DAC) and cannot be scraped or bypassed.

$ dataflirt scope --new-project --source=nih.gov ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full baseline export of PubMed or a daily feed of ClinicalTrials.gov updates, we build and manage the infrastructure. Tell us your target datasets.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →