SYSTEM all green source pro-football-reference.com queue 12,408 pages p99 latency 312ms dataflirt.com · scraper/pro-football-reference-com
RUN · 14 active pipelines · pro-football-reference.com live

NFL historical data,
at warehouse scale.

We extract player statistics, game logs, play-by-play sequences, draft history, and advanced metrics from Pro-Football-Reference. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Player profiles
28,491 /total
Game logs
18,942 /season
Play-by-play events
42,109 /week
Active pipelines
14
Uptime
99.98%
Data Dictionary

Every field we extract from pro-football-reference.com

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Player Profiles objects from pro-football-reference.com. All fields typed and schema-versioned.

player_idnamepositionheightweightdobcollegedraft_pickactive_statuscareer_av
player_profiles
● 200 OK
"player_id": "MahoPa00",
"name": "Patrick Mahomes",
"position": "QB",
"height": "6-2",
"weight": 225,
"college": "Texas Tech",
"career_av": 112,
"active_status": true
# player_idnamepositionheightweightdob
1
2
3

Complete list of extractable fields for Game Logs objects from pro-football-reference.com. All fields typed and schema-versioned.

game_idplayer_iddateteamopponentresultpassing_ydsrushing_ydsreceiving_ydstouchdowns
game_logs
● 200 OK
"game_id": "202402110kan",
"player_id": "MahoPa00",
"date": "2024-02-11",
"team": "KAN",
"opponent": "SFO",
"result": "W 25-22",
"passing_yds": 333,
"touchdowns": 2
# game_idplayer_iddateteamopponentresult
1
2
3

Complete list of extractable fields for Play-by-Play objects from pro-football-reference.com. All fields typed and schema-versioned.

play_idgame_idquartertime_remainingdowndistancefield_positionplay_typedescriptionepa
play-by-play
● 200 OK
"play_id": "202402110kan_142",
"game_id": "202402110kan",
"quarter": 4,
"time_remaining": "00:03",
"down": 1,
"distance": "Goal",
"play_type": "Pass",
"epa": 3.42
# play_idgame_idquartertime_remainingdowndistance
1
2
3

Complete list of extractable fields for Team Stats objects from pro-football-reference.com. All fields typed and schema-versioned.

team_idseasonwinslossestiespoints_forpoints_againstsrsosrsdsrs
team_stats
● 200 OK
"team_id": "KAN",
"season": 2023,
"wins": 11,
"losses": 6,
"ties": 0,
"points_for": 371,
"points_against": 294,
"srs": 4.8
# team_idseasonwinslossestiespoints_for
1
2
3

Complete list of extractable fields for Draft History objects from pro-football-reference.com. All fields typed and schema-versioned.

draft_yearroundpickplayer_idteam_idpositioncollegeavgames_playedpass_yds
draft_history
● 200 OK
"draft_year": 2017,
"round": 1,
"pick": 10,
"player_id": "MahoPa00",
"team_id": "KAN",
"position": "QB",
"college": "Texas Tech",
"games_played": 96
# draft_yearroundpickplayer_idteam_idposition
1
2
3

Capabilities

Structured NFL data without the copy-paste

Pro-Football-Reference contains the definitive history of the NFL, but querying it programmatically requires handling strict rate limits, hidden DOM nodes, and complex multi-header tables. We manage the extraction layer.

Full Player Statistics

Extract passing, rushing, receiving, and defensive metrics across regular season and playoffs. Normalised across eras.

Play-by-Play Parsing

Convert raw text logs into structured event sequences. Includes EPA, win probability added, and down-and-distance context.

Advanced Metrics

Capture Approximate Value (AV), ANY/A, true completion percentage, and defensive pressure rates.

Draft & Combine Records

Historical draft classes mapped to combine measurements (40-yard dash, vertical, broad jump) and career outcomes.

Coaching & Front Office

Extract coaching tree records, coordinator histories, and executive tenures.

Injury Reports & Snap Counts

Weekly injury designations and positional snap percentage breakdowns per game.

Rate Limit Management

Sports Reference enforces strict 20-request-per-minute limits. We distribute load across residential IPs to maintain throughput.

Complex Table Normalisation

Resolve multi-tier headers, hidden columns, and dynamically injected JavaScript tables into flat, typed records.

Historical Backfilling

Run one-off backfills for decades of NFL history, followed by delta updates every Tuesday morning.

// engagement pipeline

From URL list to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide seasons, teams, or specific statistic tables required. We design the extraction schema together.

Pipeline Build
d 2–4

We configure Scrapy crawlers, proxy rotation, request pacing, and table normalisation logic for Pro-Football-Reference.

Validation & QA
d 4–6

Schema validation, null-rate checks, and data type enforcement before full launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our pipeline handles Sports Reference constraints

Pro-Football-Reference employs aggressive rate limiting and complex DOM structures. Here is how we maintain reliable extraction.

pipeline-monitor · pro-football-reference.com · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Rate limiting
Distributed request pacing

Sports Reference bans IPs exceeding 20 requests per minute. We route traffic through rotating residential proxies and pace concurrency to avoid detection while maintaining overall pipeline throughput.

Table structure
Multi-header normalisation

Pro-Football-Reference uses complex, multi-tiered HTML tables. We flatten these structures, resolve merged cells, and enforce strict type casting to ensure clean columnar output.

Hidden data
Parsing commented DOM nodes

Many advanced metrics and snap count tables are commented out in the HTML and injected via client-side JavaScript. We parse the raw DOM comments directly to extract the hidden nodes without heavy browser overhead.

Schema drift
Handling historical missing fields

Statistics tracked in 1985 differ from 2023. Our parsers handle missing fields, handle nulls gracefully, and normalise schema drift across decades of NFL history.

Change detection
Efficient weekly deltas

Only fetch active players and recent games. Historical data remains cached. Deltas are pushed to your warehouse weekly following Monday Night Football.

Applications

Who uses NFL data — and how

Teams across industries use pro-football-reference.com data to build competitive products and smarter operations.

01
Fantasy Sports Modeling

Data scientists build predictive models for DFS platforms using historical snap counts, target shares, and red-zone usage.

02
Sports Betting Analytics

Quantitative syndicates feed play-by-play data and EPA metrics into algorithms to identify inefficient betting lines.

03
Academic Research

Economists and statisticians analyse draft outcomes, coaching decisions, and player longevity trends.

04
Sports Media & Journalism

Publishers automate historical comparisons and generate data-driven narratives for weekly NFL coverage.

05
Machine Learning Training

ML teams use decades of play-by-play sequences to train outcome prediction models and fourth-down decision engines.

06
App Development

Developers populate independent sports applications with historical player statistics and team records.

Why DataFlirt

"Pro-Football-Reference holds the definitive historical record of the NFL, but querying decades of play-by-play data requires a structured pipeline, not manual exports."

Sports Reference sites employ aggressive rate limiting and complex multi-header table structures designed to break naive parsers. DataFlirt handles the proxy rotation, request pacing, and DOM normalisation so your data science team can focus on building predictive models, not fixing broken scrapers.

Technical Spec

Pro-Football-Reference scraper — technical capabilities

Everything supported by our pro-football-reference.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Commented DOM parsing
Extracts tables hidden in HTML comments without requiring JavaScript execution
Supported
Multi-header table flattening
Resolves merged cells and nested headers into flat dictionary structures
Supported
Residential proxy rotation
Bypasses Sports Reference 20-request-per-minute IP bans
Supported
Play-by-play standardisation
Parses raw text descriptions into structured event types and yardage
Supported
Weekly delta updates
Incremental fetching of active player stats post-game
Supported
Historical backfills
Full catalogue extraction dating back to 1920
Supported
Stathead proprietary queries
Custom query generation requiring paid Stathead subscription
Partial
User account saved searches
Extraction of personal saved queries from authenticated accounts
Partial
Infrastructure

Infrastructure powering the NFL data pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusBeautifulSoup4
Scrapy + DOM Parsing

Scrapy handles orchestration and request pacing. Custom middleware parses HTML comments to extract data without the overhead of headless browsers.

Residential Proxy Infrastructure

We maintain pools of residential ISP proxies to distribute request load and strictly adhere to Sports Reference rate limits without triggering blocks.

Cloud-Native Orchestration

Pipelines run on AWS ECS. Airflow handles scheduling, ensuring weekly deltas run reliably after Monday Night Football concludes.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested — schema versioned per run
CSV
Flat file with typed columns — Excel/Sheets compatible
XLS
Excel format for business analysts
Parquet
Columnar format for BigQuery, Snowflake, Athena
AWS S3
Direct bucket delivery — compatible with any data lake
Webhook
HTTP POST per record for real-time downstream processing
API
REST endpoint for on-demand data retrieval
PostgreSQL
Direct insertion into your relational database
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About pro-football-reference.com scraping, legality, and pipeline operations.

Ask us directly →
How do you handle Sports Reference rate limits?

Pro-Football-Reference restricts traffic to 20 requests per minute per IP. We distribute extraction across a large pool of US-based residential proxies and enforce strict concurrency limits in Scrapy to extract data reliably without triggering defensive blocks.

Can you extract data hidden behind Stathead paywalls?

No. We only extract publicly available data from Pro-Football-Reference. We do not bypass authentication walls or extract proprietary data requiring a paid Stathead subscription.

How do you handle the hidden tables in the HTML?

Pro-Football-Reference optimises page load by commenting out secondary tables (like snap counts and advanced metrics) and injecting them via JavaScript. We parse the raw HTML comments directly to extract the table nodes, which is faster and more reliable than executing Playwright.

When is the data updated each week?

For active season pipelines, we run delta updates on Tuesday mornings (UTC) after Monday Night Football concludes, ensuring all statistics and game logs for the week are finalised.

Can you standardise team names across historical eras?

Yes. Our parsers map historical franchise names (e.g., Houston Oilers) to their current franchise identifiers (Tennessee Titans) or maintain historical accuracy based on your schema requirements.

Do you provide play-by-play data parsing?

Yes. We extract the raw play description text and parse it into structured fields including down, distance, play type, yardage gained, and involved players.

Can I request a sample dataset?

Yes. We provide a sample run of up to 50 player profiles or 10 game logs to validate schema fit and data quality before commencing the full extraction.

$ dataflirt scope --new-project --source=pro-football-reference.com ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full historical backfill of 100 years of NFL data or weekly delta updates for active players — we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →