SYSTEM all green source sports-reference.com queue 18,392 pages p99 latency 215ms dataflirt.com · scraper/sports-reference-com

RUN - 41 active pipelines - sports-reference.com live

Sports data,
at warehouse scale.

We extract box scores, player profiles, advanced metrics, and season aggregates from the Sports-Reference network. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Get data from sports-reference.com → See how it works

Box scores extracted

14,291 /day

Player profiles

89,412 /run

Season aggregates

3,450 /24h

Active pipelines

Uptime

99.94%

Data Dictionary

Every field we extract from sports-reference.com

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Player Profiles objects from sports-reference.com. All fields typed and schema-versioned.

player_idnamesportpositionheightweightbirth_datedraft_yeardebut_datefinal_game_datehall_of_famecareer_earningscollege

"player_id": "jorda-01",
"name": "Michael Jordan",
"sport": "basketball",
"position": "Shooting Guard",
"height": "6-6",
"weight": 198,
"birth_date": "1963-02-17",
"draft_year": 1984

#	player_id	name	sport	position	height	weight
1
2
3

Complete list of extractable fields for Box Scores objects from sports-reference.com. All fields typed and schema-versioned.

game_iddatehome_teamaway_teamhome_scoreaway_scoreattendancevenuerefereedurationweatherbox_score_url

"game_id": "202310240DEN",
"date": "2023-10-24",
"home_team": "Denver Nuggets",
"away_team": "Los Angeles Lakers",
"home_score": 119,
"away_score": 107,
"attendance": 19842,
"venue": "Ball Arena"

#	game_id	date	home_team	away_team	home_score	away_score
1
2
3

Complete list of extractable fields for Advanced Analytics objects from sports-reference.com. All fields typed and schema-versioned.

player_idseasonteamleaguegames_playedwarpertrue_shooting_pctusage_ratewin_sharesbox_plus_minusvorp

"player_id": "jokicni01",
"season": "2023-24",
"team": "DEN",
"war": 13.2,
"per": 31.0,
"true_shooting_pct": 0.65,
"usage_rate": 29.3,
"win_shares": 17.0

#	player_id	season	team	league	games_played	war
1
2
3

Complete list of extractable fields for Team Seasons objects from sports-reference.com. All fields typed and schema-versioned.

team_idseasonleaguewinslossestiespoints_forpoints_againstcoachtop_scorerplayoff_result

"team_id": "BOS",
"season": "2023-24",
"wins": 64,
"losses": 18,
"points_for": 120.6,
"points_against": 109.2,
"playoff_result": "Won Finals"

#	team_id	season	league	wins	losses	ties
1
2
3

Complete list of extractable fields for Play-by-Play objects from sports-reference.com. All fields typed and schema-versioned.

game_idquartertime_remainingteamplayeraction_typedescriptionscore_homescore_awaywin_probability

"game_id": "202310240DEN",
"quarter": 4,
"time_remaining": "02:14",
"team": "DEN",
"action_type": "Made Shot",
"description": "Nikola Jokic makes 2-pt jump shot",
"score_home": 115,
"score_away": 103

#	game_id	quarter	time_remaining	team	player	action_type
1
2
3

Capabilities

Extract the definitive sports archive

Our Sports-Reference scraper handles the entire network: Baseball-Reference, FBref, Basketball-Reference, and Pro-Football-Reference. We normalise idiosyncratic table structures and bypass strict rate limits.

Cross-Network Extraction

Unified pipelines for Baseball, Basketball, Football, Hockey, College Sports, and Soccer (FBref). Normalised schemas across disparate dom structures.

Historical Box Score Parsing

Extract line scores, individual player stats, and game metadata for every recorded game in league history, adapting to missing fields from older eras.

Advanced Metric Capture

Capture calculated metrics like WAR, PER, xG, and Win Shares directly from the source tables, updated daily as algorithms adjust.

Play-by-Play Event Logs

Sequential event extraction for game flow analysis, including time remaining, action types, and real-time win probability shifts.

Draft Class Histories

Extract complete draft boards by year, including pick number, team, player, college, and subsequent career value metrics.

Franchise Record Aggregation

Track team histories, coaching changes, stadium shifts, and season-by-season performance records across all major leagues.

Commented HTML Parsing

Sports-Reference hides secondary tables inside HTML comments to optimise load times. Our parsers extract and render this hidden DOM data automatically.

Rate Limit Circumvention

The network enforces strict 20 request per minute limits. We distribute load across residential IP pools to execute high-volume historical backfills without bans.

Scheduled Daily Syncs

Configure continuous pipelines to append new box scores and update season aggregates every morning during active seasons.

// engagement pipeline

From player list to warehouse record

Brief in. Clean data out.

Define Scope

d 0

Provide sport categories, season ranges, or specific team IDs. We design the extraction schema together.

Pipeline Build

d 2–4

We configure Scrapy crawlers, proxy rotation, and specific DOM parsers for the target Sports-Reference sub-site.

Validation & QA

d 4–6

Schema validation, null-rate checks, and cross-era data normalisation before full launch.

Delivery

ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our pipeline handles the hard parts

Sports-Reference employs aggressive rate limiting and relies on legacy HTML structures. Here is how we maintain reliable extraction.

// fingerprinting

Identity rotation

TLS fingerprintrandomised

User-agentrotated

IP poolresidential

Challenges blocked0

// pagination

Page coverage

48,291 pages queued running

// observability

Pipeline health

99.9%

uptime

142ms

p99 lat

0.3%

null rate

alerts

Rate limiting

Handling 429 Too Many Requests

Sports-Reference aggressively bans IPs that exceed 20 requests per minute. We distribute crawl tasks across thousands of residential IPs, ensuring request rates per IP remain well below threshold during massive historical backfills.

DOM structure

Extracting tables from HTML comments

To optimise page load times, Sports-Reference injects the majority of its data tables as commented-out HTML strings. Standard parsers miss this data entirely. Our middleware extracts, unescapes, and parses these comments into standard DOM trees before extraction.

Schema normalisation

Handling cross-era data gaps

A box score from 1962 lacks the fields of a box score from 2024. Our schemas gracefully handle missing data points from historical eras, enforcing null values rather than breaking the pipeline or misaligning columns.

Daily syncs

Overnight box score generation

For active seasons, we schedule pipelines to run at 04:00 EST, capturing the previous night's box scores and updated season aggregates as soon as the site publishes them, delivering fresh data before morning analysis.

Monitoring

Schema drift detection

When Sports-Reference adds new advanced metrics or alters table structures, our observability stack detects the schema drift. We alert on missing fields and update selectors before you receive malformed data.

Applications

Who uses Sports-Reference data

Teams across industries use sports-reference.com data to build competitive products and smarter operations.

Sports Betting Models

Quantitative syndicates ingest historical play-by-play logs and advanced metrics to backtest predictive models and identify line inefficiencies.

Fantasy Sports Analytics

DFS platforms and high-volume players use daily updated box scores and usage rates to project player performance and optimise lineups.

Academic Research

Economists and statisticians analyse decades of salary, draft, and performance data to study labour markets and predictive validity.

Media & Broadcasting

Sports networks use structured historical data to generate automated pre-game notes, on-screen graphics, and historical comparisons.

Player Valuation Models

Front offices and agencies benchmark player performance against historical cohorts to inform contract negotiations and trade evaluations.

Historical Archiving

Data libraries maintain permanent, queryable records of franchise histories and league evolutions independent of third-party interfaces.

Technical Spec

Sports-Reference scraper - technical capabilities

Everything supported by our sports-reference.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Cross-sport schema normalisation

Unified output formats across Baseball, Basketball, Football, and Soccer

Supported

Commented-out HTML parsing

Extracts data tables hidden inside HTML comments for load optimisation

Supported

Play-by-play log extraction

Sequential event data with timestamps and score shifts

Supported

Daily active season updates

Scheduled runs to capture overnight box scores and updated aggregates

Supported

Historical franchise archives

Full backfills of team histories dating back to league inception

Supported

Proxy rotation for rate limits

Distributed crawling to bypass 20 req/min IP bans

Supported

Stathead custom queries

Custom database queries require a paid Stathead subscription and auth

Partial

Direct API endpoint access

Sports-Reference offers a paid API; we scrape the public HTML frontend

Partial

Webhook delivery

HTTP POST per record or batch for downstream pipeline ingestion

Supported

Infrastructure

Infrastructure powering the pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus

Scrapy + Playwright Stack

Scrapy handles crawl orchestration, deduplication, and retry logic. Custom middleware unescapes hidden HTML comments before passing the DOM to the parsing layer.

Residential Proxy Infrastructure

We maintain pools of residential ISP proxies to distribute request load. Rotation happens per-request to ensure no single IP trips the strict rate limiting thresholds.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles scheduling for overnight daily syncs, and SLA alerting ensures data is ready before morning analysis.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON

Newline-delimited or nested - schema versioned per run

CSV

Flat file with typed columns - Excel/Sheets compatible

XLS

Legacy spreadsheet format for direct analyst consumption

Parquet

Columnar format for BigQuery, Snowflake, Athena

AWS S3

Direct bucket delivery - compatible with any data lake

Webhook

HTTP POST per record for real-time downstream processing

API

RESTful endpoints to query your extracted datasets

PostgreSQL

Upsert into your existing schema with conflict resolution

Snowflake

Stage + COPY INTO workflow - incremental or full-replace

Direct bucket delivery — compatible with any data lake

// faq

Common questions.

About sports-reference.com scraping, legality, and pipeline operations.

Ask us directly →

Is scraping Sports-Reference legal?

Scraping publicly available, factual historical data (like box scores and player stats) is generally permissible. DataFlirt targets only public, non-authenticated pages. We do not extract proprietary Stathead subscription data or circumvent authentication walls.

How do you handle their strict rate limits?

Sports-Reference enforces a strict 20 request per minute limit per IP. We distribute crawl tasks across a large pool of residential proxies, ensuring individual IPs remain well below the threshold during high-volume backfills.

Why are some tables missing in standard scrapers?

To optimise page load performance, Sports-Reference injects many secondary data tables as commented-out HTML strings. Standard HTTP clients ignore comments. Our custom middleware extracts and parses these comments into standard DOM elements.

Which sports do you support?

We support the entire network: Baseball-Reference, Basketball-Reference, Pro-Football-Reference, Hockey-Reference, FBref (Soccer), and College Sports sub-sites. All use unified data normalisation pipelines.

Can you normalise data across different eras?

Yes. A box score from 1950 lacks the advanced metrics of a game from 2024. Our schemas gracefully handle missing fields from historical eras, enforcing null values rather than breaking column alignment.

How fast are daily updates delivered?

For active seasons, pipelines run overnight following the conclusion of games. Deliveries typically complete by 06:00 EST, providing fresh box scores and season aggregates before morning analysis begins.

Do you extract Stathead data?

No. Stathead is a paid subscription service that provides custom database query tools behind an authentication wall. We extract data exclusively from the public frontend pages of the Sports-Reference network.

Sports data,
at warehouse scale.

Every field we extract from sports-reference.com

Extract the definitive sports archive

From player list to warehouse record

How our pipeline handles the hard parts

Who uses Sports-Reference data

Sports-Reference scraper - technical capabilities

Infrastructure powering the pipeline

Your data, your destination

Common questions.

Tell us what
to extract.
We do the rest.

Data Extraction for Every Industry

Sports data, at warehouse scale.

Every field we extract from sports-reference.com

Extract the definitive sports archive

From player list to warehouse record

How our pipeline handles the hard parts

Who uses Sports-Reference data

Sports-Reference scraper - technical capabilities

Infrastructure powering the pipeline

Your data, your destination

Common questions.

Tell us whatto extract. We do the rest.

Data Extraction for Every Industry

Sports data,
at warehouse scale.

Tell us what
to extract.
We do the rest.