We extract box scores, player profiles, advanced metrics, and season aggregates from the Sports-Reference network. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.
Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.
Complete list of extractable fields for Player Profiles objects from sports-reference.com. All fields typed and schema-versioned.
"player_id": "jorda-01", "name": "Michael Jordan", "sport": "basketball", "position": "Shooting Guard", "height": "6-6", "weight": 198, "birth_date": "1963-02-17", "draft_year": 1984
| # | player_id | name | sport | position | height | weight |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Box Scores objects from sports-reference.com. All fields typed and schema-versioned.
"game_id": "202310240DEN", "date": "2023-10-24", "home_team": "Denver Nuggets", "away_team": "Los Angeles Lakers", "home_score": 119, "away_score": 107, "attendance": 19842, "venue": "Ball Arena"
| # | game_id | date | home_team | away_team | home_score | away_score |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Advanced Analytics objects from sports-reference.com. All fields typed and schema-versioned.
"player_id": "jokicni01", "season": "2023-24", "team": "DEN", "war": 13.2, "per": 31.0, "true_shooting_pct": 0.65, "usage_rate": 29.3, "win_shares": 17.0
| # | player_id | season | team | league | games_played | war |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Team Seasons objects from sports-reference.com. All fields typed and schema-versioned.
"team_id": "BOS", "season": "2023-24", "wins": 64, "losses": 18, "points_for": 120.6, "points_against": 109.2, "playoff_result": "Won Finals"
| # | team_id | season | league | wins | losses | ties |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Play-by-Play objects from sports-reference.com. All fields typed and schema-versioned.
"game_id": "202310240DEN", "quarter": 4, "time_remaining": "02:14", "team": "DEN", "action_type": "Made Shot", "description": "Nikola Jokic makes 2-pt jump shot", "score_home": 115, "score_away": 103
| # | game_id | quarter | time_remaining | team | player | action_type |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Our Sports-Reference scraper handles the entire network: Baseball-Reference, FBref, Basketball-Reference, and Pro-Football-Reference. We normalise idiosyncratic table structures and bypass strict rate limits.
Unified pipelines for Baseball, Basketball, Football, Hockey, College Sports, and Soccer (FBref). Normalised schemas across disparate dom structures.
Extract line scores, individual player stats, and game metadata for every recorded game in league history, adapting to missing fields from older eras.
Capture calculated metrics like WAR, PER, xG, and Win Shares directly from the source tables, updated daily as algorithms adjust.
Sequential event extraction for game flow analysis, including time remaining, action types, and real-time win probability shifts.
Extract complete draft boards by year, including pick number, team, player, college, and subsequent career value metrics.
Track team histories, coaching changes, stadium shifts, and season-by-season performance records across all major leagues.
Sports-Reference hides secondary tables inside HTML comments to optimise load times. Our parsers extract and render this hidden DOM data automatically.
The network enforces strict 20 request per minute limits. We distribute load across residential IP pools to execute high-volume historical backfills without bans.
Configure continuous pipelines to append new box scores and update season aggregates every morning during active seasons.
Brief in. Clean data out.
Provide sport categories, season ranges, or specific team IDs. We design the extraction schema together.
We configure Scrapy crawlers, proxy rotation, and specific DOM parsers for the target Sports-Reference sub-site.
Schema validation, null-rate checks, and cross-era data normalisation before full launch.
JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.
Sports-Reference employs aggressive rate limiting and relies on legacy HTML structures. Here is how we maintain reliable extraction.
Sports-Reference aggressively bans IPs that exceed 20 requests per minute. We distribute crawl tasks across thousands of residential IPs, ensuring request rates per IP remain well below threshold during massive historical backfills.
To optimise page load times, Sports-Reference injects the majority of its data tables as commented-out HTML strings. Standard parsers miss this data entirely. Our middleware extracts, unescapes, and parses these comments into standard DOM trees before extraction.
A box score from 1962 lacks the fields of a box score from 2024. Our schemas gracefully handle missing data points from historical eras, enforcing null values rather than breaking the pipeline or misaligning columns.
For active seasons, we schedule pipelines to run at 04:00 EST, capturing the previous night's box scores and updated season aggregates as soon as the site publishes them, delivering fresh data before morning analysis.
When Sports-Reference adds new advanced metrics or alters table structures, our observability stack detects the schema drift. We alert on missing fields and update selectors before you receive malformed data.
Quantitative syndicates ingest historical play-by-play logs and advanced metrics to backtest predictive models and identify line inefficiencies.
DFS platforms and high-volume players use daily updated box scores and usage rates to project player performance and optimise lineups.
Economists and statisticians analyse decades of salary, draft, and performance data to study labour markets and predictive validity.
Sports networks use structured historical data to generate automated pre-game notes, on-screen graphics, and historical comparisons.
Front offices and agencies benchmark player performance against historical cohorts to inform contract negotiations and trade evaluations.
Data libraries maintain permanent, queryable records of franchise histories and league evolutions independent of third-party interfaces.
"Sports-Reference holds the definitive historical record of global athletics, but their HTML structure is notoriously dense and idiosyncratic across different sports."
Extracting data from Baseball-Reference or FBref requires parsing commented-out DOM elements, normalising schemas across decades of rule changes, and strictly managing request rates to avoid outright IP bans. DataFlirt handles the proxy rotation and parsing logic so your data scientists can focus on modelling.
Everything supported by our sports-reference.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.
Open-source tooling on proven cloud infra — no vendor lock-in, full observability.
Scrapy handles crawl orchestration, deduplication, and retry logic. Custom middleware unescapes hidden HTML comments before passing the DOM to the parsing layer.
We maintain pools of residential ISP proxies to distribute request load. Rotation happens per-request to ensure no single IP trips the strict rate limiting thresholds.
Pipelines run on AWS Lambda and ECS. Airflow handles scheduling for overnight daily syncs, and SLA alerting ensures data is ready before morning analysis.
Data delivered to where your team already works — no new tooling required.
About sports-reference.com scraping, legality, and pipeline operations.
Ask us directly →Scraping publicly available, factual historical data (like box scores and player stats) is generally permissible. DataFlirt targets only public, non-authenticated pages. We do not extract proprietary Stathead subscription data or circumvent authentication walls.
Sports-Reference enforces a strict 20 request per minute limit per IP. We distribute crawl tasks across a large pool of residential proxies, ensuring individual IPs remain well below the threshold during high-volume backfills.
To optimise page load performance, Sports-Reference injects many secondary data tables as commented-out HTML strings. Standard HTTP clients ignore comments. Our custom middleware extracts and parses these comments into standard DOM elements.
We support the entire network: Baseball-Reference, Basketball-Reference, Pro-Football-Reference, Hockey-Reference, FBref (Soccer), and College Sports sub-sites. All use unified data normalisation pipelines.
Yes. A box score from 1950 lacks the advanced metrics of a game from 2024. Our schemas gracefully handle missing fields from historical eras, enforcing null values rather than breaking column alignment.
For active seasons, pipelines run overnight following the conclusion of games. Deliveries typically complete by 06:00 EST, providing fresh box scores and season aggregates before morning analysis begins.
No. Stathead is a paid subscription service that provides custom database query tools behind an authentication wall. We extract data exclusively from the public frontend pages of the Sports-Reference network.
20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full historical backfill of every MLB box score or a daily sync of NBA advanced metrics - we scope, build, and operate the pipeline. Tell us what you need.