SYSTEM all green source basketball-reference.com queue 11,482 pages p99 latency 184ms dataflirt.com · scraper/basketball-reference-com
RUN * 31 active pipelines * basketball-reference.com live

NBA historical data,
at warehouse scale.

We extract box scores, play-by-play logs, player shooting splits, advanced metrics, and draft history from Basketball-Reference. Delivered as clean JSON, CSV, or Parquet to S3 or BigQuery on your cadence.

Box scores extracted
78,291 /run
Player profiles
5,412 /run
Play-by-play events
14.2M /season
Active pipelines
31
Uptime
99.98%
Data Dictionary

Every field we extract from basketball-reference.com

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Player Profiles objects from basketball-reference.com. All fields typed and schema-versioned.

player_idnamepositionheightweightbirth_datedraft_yearcollegecareer_pointscareer_reboundscareer_assists
player_profiles
● 200 OK
"player_id": "jamesle01",
"name": "LeBron James",
"position": "SF",
"height": "6-9",
"weight": 250,
"draft_year": 2003,
"career_points": 40474,
"career_rebounds": 11185
# player_idnamepositionheightweightbirth_date
1
2
3

Complete list of extractable fields for Game Box Scores objects from basketball-reference.com. All fields typed and schema-versioned.

game_iddatehome_teamaway_teamplayer_idminutes_playedfield_goalsthree_pointersfree_throwsreboundsassistsstealsblocksturnoverspoints
game_box scores
● 200 OK
"game_id": "202402280LAL",
"date": "2024-02-28",
"home_team": "LAL",
"away_team": "LAC",
"player_id": "jamesle01",
"minutes_played": 37.2,
"points": 34,
"assists": 8,
"rebounds": 6
# game_iddatehome_teamaway_teamplayer_idminutes_played
1
2
3

Complete list of extractable fields for Play-by-Play objects from basketball-reference.com. All fields typed and schema-versioned.

game_idquartertime_remainingevent_typeplayer_idteam_iddescriptionhome_scoreaway_scoreshot_distance
play-by-play
● 200 OK
"game_id": "202402280LAL",
"quarter": 4,
"time_remaining": "11:45",
"event_type": "make_3pt",
"player_id": "jamesle01",
"team_id": "LAL",
"description": "LeBron James makes 3-pt jump shot (26 ft)",
"home_score": 80,
"away_score": 98,
"shot_distance": 26
# game_idquartertime_remainingevent_typeplayer_idteam_id
1
2
3

Complete list of extractable fields for Advanced Stats objects from basketball-reference.com. All fields typed and schema-versioned.

player_idseasonpertrue_shooting_pctusage_pctoffensive_win_sharesdefensive_win_shareswin_sharesbox_plus_minusvorp
advanced_stats
● 200 OK
"player_id": "jokicni01",
"season": "2023-24",
"per": 31.0,
"true_shooting_pct": 0.65,
"usage_pct": 29.3,
"win_shares": 17.0,
"box_plus_minus": 13.2,
"vorp": 10.6
# player_idseasonpertrue_shooting_pctusage_pctoffensive_win_shares
1
2
3

Complete list of extractable fields for Draft History objects from basketball-reference.com. All fields typed and schema-versioned.

draft_yearpick_numberroundteam_idplayer_idcollegeyears_playedtotal_gamestotal_pointstotal_reboundswin_shares
draft_history
● 200 OK
"draft_year": 2003,
"pick_number": 1,
"round": 1,
"team_id": "CLE",
"player_id": "jamesle01",
"college": "None",
"years_played": 21,
"total_games": 1492,
"win_shares": 263.6
# draft_yearpick_numberroundteam_idplayer_idcollege
1
2
3

Capabilities

Every stat, split, and box score extracted

Our Basketball-Reference scraper navigates complex table structures, uncomments hidden HTML data, and maps player identifiers across decades of historical records.

Full Player Profiles

Extract biographical data, draft information, salary history, and career totals for every player in NBA history.

Game Box Scores

Capture basic and advanced box scores for every game, including inactive players and DNP reasons.

Play-by-Play Logs

Parse event-level data including shot distances, substitution patterns, and running scores for every possession.

Advanced Metrics

Extract PER, Win Shares, Box Plus/Minus, and VORP calculated per season or per game.

Shooting Splits

Gather shooting percentages by distance, quarter, opponent, and days of rest.

Draft History

Scrape all historical draft picks, trade details, and subsequent career outcomes.

WNBA & International

Extract data from WNBA, EuroLeague, and G-League databases using the same normalisation schema.

College Basketball

Pull NCAA stats, tournament history, and recruiting rankings for comprehensive prospect models.

Scheduled Nightly Updates

Run automated pipelines every morning to capture the previous night's box scores and updated season averages.

// engagement pipeline

From URL list to warehouse tables

Brief in. Clean data out.

Define Scope
d 0

Specify seasons, teams, or specific stat tables required. We map the target schema.

Pipeline Build
d 2–4

We configure crawlers to handle rate limits and parse commented-out HTML tables.

Validation & QA
d 4–6

We verify sum totals, check for missing games, and validate advanced metric formulas.

Delivery
ongoing

JSON, CSV, or Parquet pushed to your S3 bucket or Snowflake instance daily.

Under the hood

Overcoming Sports Reference scraping hurdles

Sports Reference sites employ strict rate limits and unusual DOM structures. Here is how we maintain stable extraction pipelines.

pipeline-monitor · basketball-reference.com · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Rate limits
Strict adherence to request quotas

Sports Reference enforces a strict 20 requests per minute limit, blocking IPs that exceed this. We manage distributed crawl clusters with residential proxies to parallelise extraction without triggering bans.

DOM structure
Parsing commented-out HTML tables

To optimise page load times, Basketball-Reference hides secondary data tables inside HTML comments. Standard scrapers miss this data entirely. Our parsers extract and render these comments into queryable DOM objects.

Data linking
Consistent player ID mapping

Players frequently change names or have identical names. We extract and normalise the unique Basketball-Reference player IDs (e.g., 'jamesle01') across all box scores and leaderboards to ensure relational integrity.

Change detection
Incremental nightly updates

Instead of re-scraping historical seasons, our pipelines maintain state. We only poll the previous night's box scores and append new rows to your warehouse, drastically reducing compute costs.

Validation
Automated sum-check verification

Box scores occasionally contain data entry errors. Our QA layer runs sum-checks (e.g., ensuring player points equal team total points) and flags anomalies before delivering the payload.

Applications

Who uses NBA historical data

Teams across industries use basketball-reference.com data to build competitive products and smarter operations.

01
Sports Betting Models

Quant syndicates feed play-by-play data and shooting splits into machine learning models to identify pricing inefficiencies in prop markets.

02
Fantasy Basketball

DFS players and season-long fantasy platforms use historical usage rates and pace metrics to project player performance.

03
Front Office Analytics

NBA and G-League front offices ingest college and international data to build proprietary draft evaluation models.

04
Sports Media

Journalists and content creators query historical leaderboards and advanced metrics to generate data-driven editorial pieces.

05
Academic Research

Economists and statisticians use salary and performance data to study contract valuations and labour dynamics.

06
ML Training Data

AI teams use structured play-by-play logs to train predictive text models and automated game recap generators.

Why DataFlirt

"Basketball-Reference holds the definitive history of the NBA, but extracting millions of play-by-play events requires parsing nested, commented-out HTML tables at scale."

Most teams underestimate the complexity of scraping Sports Reference sites. They enforce strict rate limits, embed secondary data tables within HTML comments to optimise load times, and frequently adjust advanced metric formulas. DataFlirt manages the proxy rotation, HTML parsing, and schema validation so your data scientists can focus on building predictive models rather than fixing broken parsers.

Technical Spec

Basketball-Reference scraper capabilities

Everything supported by our basketball-reference.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Rate limit circumvention
Distributed residential proxies to respect 20 req/min limits per IP
Supported
Commented HTML extraction
Custom parsers to extract tables hidden within HTML comments
Supported
Historical box scores
Complete extraction of all regular season and playoff games
Supported
Play-by-play parsing
Sequential event extraction with running scores and timestamps
Supported
Advanced metric formulas
Extraction of PER, WS, BPM directly from the source tables
Supported
Shooting location coordinates
Parsing shot chart data into X/Y coordinates when available
Supported
Change detection (diffs)
Only scrape new box scores added since the last pipeline run
Supported
Webhook delivery
HTTP POST delivery upon completion of nightly syncs
Supported
Stathead subscription data
Custom multi-season queries locked behind the Stathead paywall
Partial
Real-time live game feeds
Live in-game data extraction (site updates post-game)
Partial
Infrastructure

Infrastructure powering the extraction

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusBeautifulSoup
Scrapy + Custom Parsers

Scrapy handles crawl orchestration while custom lxml middlewares extract and parse the commented-out HTML tables unique to Sports Reference sites.

Residential Proxy Infrastructure

We maintain pools of residential ISP proxies to distribute requests safely, ensuring we never trigger the strict rate limit bans.

Cloud-Native Orchestration

Pipelines run on AWS ECS. Airflow schedules nightly syncs to capture new box scores immediately after games conclude.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Nested structures ideal for play-by-play event arrays
CSV
Flat files for easy import into pandas or R
XLS
Excel formats for quick manual analysis
Parquet
Columnar format for BigQuery and Snowflake ingestion
AWS S3
Direct bucket delivery on a nightly schedule
Webhook
HTTP POST notifications when new data is ready
API
REST endpoints to query your extracted historical dataset
PostgreSQL
Direct inserts into your relational database schema
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About basketball-reference.com scraping, legality, and pipeline operations.

Ask us directly →
How do you handle the 20 requests per minute limit?

We deploy a distributed architecture using large pools of residential proxies. Each node respects the rate limits, but parallelising across hundreds of IPs allows us to extract historical seasons rapidly without violating the site's anti-bot protections.

Why are some tables missing when I try to scrape it myself?

Basketball-Reference embeds secondary tables (like advanced stats and play-by-play) inside HTML comments to speed up initial page rendering. Standard HTTP clients and basic BeautifulSoup scripts ignore comments. Our parsers explicitly target and render these commented blocks.

Can you provide real-time data during games?

No. Basketball-Reference updates its database after games conclude. For live, sub-second latency data, you require a direct API feed from an official sports data provider like Sportradar.

Do you scrape Stathead queries?

No. Stathead requires a paid subscription and authentication. We only extract publicly available historical data from the main Basketball-Reference domain.

How quickly are new box scores available?

Our scheduled pipelines typically run in the early morning hours (EST) to capture the previous night's completed games, processing and delivering the data to your warehouse by 6:00 AM EST.

Can I get college and WNBA data too?

Yes. Sports Reference operates distinct subdomains for college basketball and the WNBA. We can configure pipelines to extract data from these sources using similar schemas.

$ dataflirt scope --new-project --source=basketball-reference.com ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a complete historical dump of every box score since 1946 or a nightly sync for your betting models, we build and operate the pipeline.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →