We extract box scores, play-by-play logs, player shooting splits, advanced metrics, and draft history from Basketball-Reference. Delivered as clean JSON, CSV, or Parquet to S3 or BigQuery on your cadence.
Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.
Complete list of extractable fields for Player Profiles objects from basketball-reference.com. All fields typed and schema-versioned.
"player_id": "jamesle01", "name": "LeBron James", "position": "SF", "height": "6-9", "weight": 250, "draft_year": 2003, "career_points": 40474, "career_rebounds": 11185
| # | player_id | name | position | height | weight | birth_date |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Game Box Scores objects from basketball-reference.com. All fields typed and schema-versioned.
"game_id": "202402280LAL", "date": "2024-02-28", "home_team": "LAL", "away_team": "LAC", "player_id": "jamesle01", "minutes_played": 37.2, "points": 34, "assists": 8, "rebounds": 6
| # | game_id | date | home_team | away_team | player_id | minutes_played |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Play-by-Play objects from basketball-reference.com. All fields typed and schema-versioned.
"game_id": "202402280LAL", "quarter": 4, "time_remaining": "11:45", "event_type": "make_3pt", "player_id": "jamesle01", "team_id": "LAL", "description": "LeBron James makes 3-pt jump shot (26 ft)", "home_score": 80, "away_score": 98, "shot_distance": 26
| # | game_id | quarter | time_remaining | event_type | player_id | team_id |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Advanced Stats objects from basketball-reference.com. All fields typed and schema-versioned.
"player_id": "jokicni01", "season": "2023-24", "per": 31.0, "true_shooting_pct": 0.65, "usage_pct": 29.3, "win_shares": 17.0, "box_plus_minus": 13.2, "vorp": 10.6
| # | player_id | season | per | true_shooting_pct | usage_pct | offensive_win_shares |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Draft History objects from basketball-reference.com. All fields typed and schema-versioned.
"draft_year": 2003, "pick_number": 1, "round": 1, "team_id": "CLE", "player_id": "jamesle01", "college": "None", "years_played": 21, "total_games": 1492, "win_shares": 263.6
| # | draft_year | pick_number | round | team_id | player_id | college |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Our Basketball-Reference scraper navigates complex table structures, uncomments hidden HTML data, and maps player identifiers across decades of historical records.
Extract biographical data, draft information, salary history, and career totals for every player in NBA history.
Capture basic and advanced box scores for every game, including inactive players and DNP reasons.
Parse event-level data including shot distances, substitution patterns, and running scores for every possession.
Extract PER, Win Shares, Box Plus/Minus, and VORP calculated per season or per game.
Gather shooting percentages by distance, quarter, opponent, and days of rest.
Scrape all historical draft picks, trade details, and subsequent career outcomes.
Extract data from WNBA, EuroLeague, and G-League databases using the same normalisation schema.
Pull NCAA stats, tournament history, and recruiting rankings for comprehensive prospect models.
Run automated pipelines every morning to capture the previous night's box scores and updated season averages.
Brief in. Clean data out.
Specify seasons, teams, or specific stat tables required. We map the target schema.
We configure crawlers to handle rate limits and parse commented-out HTML tables.
We verify sum totals, check for missing games, and validate advanced metric formulas.
JSON, CSV, or Parquet pushed to your S3 bucket or Snowflake instance daily.
Sports Reference sites employ strict rate limits and unusual DOM structures. Here is how we maintain stable extraction pipelines.
Sports Reference enforces a strict 20 requests per minute limit, blocking IPs that exceed this. We manage distributed crawl clusters with residential proxies to parallelise extraction without triggering bans.
To optimise page load times, Basketball-Reference hides secondary data tables inside HTML comments. Standard scrapers miss this data entirely. Our parsers extract and render these comments into queryable DOM objects.
Players frequently change names or have identical names. We extract and normalise the unique Basketball-Reference player IDs (e.g., 'jamesle01') across all box scores and leaderboards to ensure relational integrity.
Instead of re-scraping historical seasons, our pipelines maintain state. We only poll the previous night's box scores and append new rows to your warehouse, drastically reducing compute costs.
Box scores occasionally contain data entry errors. Our QA layer runs sum-checks (e.g., ensuring player points equal team total points) and flags anomalies before delivering the payload.
Quant syndicates feed play-by-play data and shooting splits into machine learning models to identify pricing inefficiencies in prop markets.
DFS players and season-long fantasy platforms use historical usage rates and pace metrics to project player performance.
NBA and G-League front offices ingest college and international data to build proprietary draft evaluation models.
Journalists and content creators query historical leaderboards and advanced metrics to generate data-driven editorial pieces.
Economists and statisticians use salary and performance data to study contract valuations and labour dynamics.
AI teams use structured play-by-play logs to train predictive text models and automated game recap generators.
"Basketball-Reference holds the definitive history of the NBA, but extracting millions of play-by-play events requires parsing nested, commented-out HTML tables at scale."
Most teams underestimate the complexity of scraping Sports Reference sites. They enforce strict rate limits, embed secondary data tables within HTML comments to optimise load times, and frequently adjust advanced metric formulas. DataFlirt manages the proxy rotation, HTML parsing, and schema validation so your data scientists can focus on building predictive models rather than fixing broken parsers.
Everything supported by our basketball-reference.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.
Open-source tooling on proven cloud infra — no vendor lock-in, full observability.
Scrapy handles crawl orchestration while custom lxml middlewares extract and parse the commented-out HTML tables unique to Sports Reference sites.
We maintain pools of residential ISP proxies to distribute requests safely, ensuring we never trigger the strict rate limit bans.
Pipelines run on AWS ECS. Airflow schedules nightly syncs to capture new box scores immediately after games conclude.
Data delivered to where your team already works — no new tooling required.
About basketball-reference.com scraping, legality, and pipeline operations.
Ask us directly →We deploy a distributed architecture using large pools of residential proxies. Each node respects the rate limits, but parallelising across hundreds of IPs allows us to extract historical seasons rapidly without violating the site's anti-bot protections.
Basketball-Reference embeds secondary tables (like advanced stats and play-by-play) inside HTML comments to speed up initial page rendering. Standard HTTP clients and basic BeautifulSoup scripts ignore comments. Our parsers explicitly target and render these commented blocks.
No. Basketball-Reference updates its database after games conclude. For live, sub-second latency data, you require a direct API feed from an official sports data provider like Sportradar.
No. Stathead requires a paid subscription and authentication. We only extract publicly available historical data from the main Basketball-Reference domain.
Our scheduled pipelines typically run in the early morning hours (EST) to capture the previous night's completed games, processing and delivering the data to your warehouse by 6:00 AM EST.
Yes. Sports Reference operates distinct subdomains for college basketball and the WNBA. We can configure pipelines to extract data from these sources using similar schemas.
20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a complete historical dump of every box score since 1946 or a nightly sync for your betting models, we build and operate the pipeline.