SYSTEM all green source basketball-reference.com queue 11,482 pages p99 latency 184ms dataflirt.com · scraper/basketball-reference-com

RUN * 31 active pipelines * basketball-reference.com live

NBA historical data,
at warehouse scale.

We extract box scores, play-by-play logs, player shooting splits, advanced metrics, and draft history from Basketball-Reference. Delivered as clean JSON, CSV, or Parquet to S3 or BigQuery on your cadence.

Get data from basketball-reference.com → See how it works

Box scores extracted

78,291 /run

Player profiles

5,412 /run

Play-by-play events

14.2M /season

Active pipelines

Uptime

99.98%

◆ NBA Player Stats◆ Historical Box Scores◆ Play-by-Play Logs◆ Advanced Metrics◆ Shooting Splits◆ Draft History◆ WNBA Data◆ College Basketball Stats◆ G-League Data◆ Salary Cap Figures◆ Trade History◆ Managed Pipeline◆ S3 / BigQuery Delivery◆ Bengaluru HQ◆ NBA Player Stats◆ Historical Box Scores◆ Play-by-Play Logs◆ Advanced Metrics◆ Shooting Splits◆ Draft History◆ WNBA Data◆ College Basketball Stats◆ G-League Data◆ Salary Cap Figures◆ Trade History◆ Managed Pipeline◆ S3 / BigQuery Delivery◆ Bengaluru HQ

Data Dictionary

Every field we extract from basketball-reference.com

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Player Profiles objects from basketball-reference.com. All fields typed and schema-versioned.

player_idnamepositionheightweightbirth_datedraft_yearcollegecareer_pointscareer_reboundscareer_assists

"player_id": "jamesle01",
"name": "LeBron James",
"position": "SF",
"height": "6-9",
"weight": 250,
"draft_year": 2003,
"career_points": 40474,
"career_rebounds": 11185

#	player_id	name	position	height	weight	birth_date
1
2
3

Complete list of extractable fields for Game Box Scores objects from basketball-reference.com. All fields typed and schema-versioned.

game_iddatehome_teamaway_teamplayer_idminutes_playedfield_goalsthree_pointersfree_throwsreboundsassistsstealsblocksturnoverspoints

"game_id": "202402280LAL",
"date": "2024-02-28",
"home_team": "LAL",
"away_team": "LAC",
"player_id": "jamesle01",
"minutes_played": 37.2,
"points": 34,
"assists": 8,
"rebounds": 6

#	game_id	date	home_team	away_team	player_id	minutes_played
1
2
3

Complete list of extractable fields for Play-by-Play objects from basketball-reference.com. All fields typed and schema-versioned.

game_idquartertime_remainingevent_typeplayer_idteam_iddescriptionhome_scoreaway_scoreshot_distance

"game_id": "202402280LAL",
"quarter": 4,
"time_remaining": "11:45",
"event_type": "make_3pt",
"player_id": "jamesle01",
"team_id": "LAL",
"description": "LeBron James makes 3-pt jump shot (26 ft)",
"home_score": 80,
"away_score": 98,
"shot_distance": 26

#	game_id	quarter	time_remaining	event_type	player_id	team_id
1
2
3

Complete list of extractable fields for Advanced Stats objects from basketball-reference.com. All fields typed and schema-versioned.

player_idseasonpertrue_shooting_pctusage_pctoffensive_win_sharesdefensive_win_shareswin_sharesbox_plus_minusvorp

"player_id": "jokicni01",
"season": "2023-24",
"per": 31.0,
"true_shooting_pct": 0.65,
"usage_pct": 29.3,
"win_shares": 17.0,
"box_plus_minus": 13.2,
"vorp": 10.6

#	player_id	season	per	true_shooting_pct	usage_pct	offensive_win_shares
1
2
3

Complete list of extractable fields for Draft History objects from basketball-reference.com. All fields typed and schema-versioned.

draft_yearpick_numberroundteam_idplayer_idcollegeyears_playedtotal_gamestotal_pointstotal_reboundswin_shares

"draft_year": 2003,
"pick_number": 1,
"round": 1,
"team_id": "CLE",
"player_id": "jamesle01",
"college": "None",
"years_played": 21,
"total_games": 1492,
"win_shares": 263.6

#	draft_year	pick_number	round	team_id	player_id	college
1
2
3

Capabilities

Every stat, split, and box score extracted

Our Basketball-Reference scraper navigates complex table structures, uncomments hidden HTML data, and maps player identifiers across decades of historical records.

Full Player Profiles

Extract biographical data, draft information, salary history, and career totals for every player in NBA history.

Game Box Scores

Capture basic and advanced box scores for every game, including inactive players and DNP reasons.

Play-by-Play Logs

Parse event-level data including shot distances, substitution patterns, and running scores for every possession.

Advanced Metrics

Extract PER, Win Shares, Box Plus/Minus, and VORP calculated per season or per game.

Shooting Splits

Gather shooting percentages by distance, quarter, opponent, and days of rest.

Draft History

Scrape all historical draft picks, trade details, and subsequent career outcomes.

WNBA & International

Extract data from WNBA, EuroLeague, and G-League databases using the same normalisation schema.

College Basketball

Pull NCAA stats, tournament history, and recruiting rankings for comprehensive prospect models.

Scheduled Nightly Updates

Run automated pipelines every morning to capture the previous night's box scores and updated season averages.

// engagement pipeline

From URL list to warehouse tables

Brief in. Clean data out.

Define Scope

d 0

Specify seasons, teams, or specific stat tables required. We map the target schema.

Pipeline Build

d 2–4

We configure crawlers to handle rate limits and parse commented-out HTML tables.

Validation & QA

d 4–6

We verify sum totals, check for missing games, and validate advanced metric formulas.

Delivery

ongoing

JSON, CSV, or Parquet pushed to your S3 bucket or Snowflake instance daily.

Under the hood

Overcoming Sports Reference scraping hurdles

Sports Reference sites employ strict rate limits and unusual DOM structures. Here is how we maintain stable extraction pipelines.

// fingerprinting

Identity rotation

TLS fingerprintrandomised

User-agentrotated

IP poolresidential

Challenges blocked0

// pagination

Page coverage

48,291 pages queued running

// observability

Pipeline health

99.9%

uptime

142ms

p99 lat

0.3%

null rate

alerts

Rate limits

Strict adherence to request quotas

Sports Reference enforces a strict 20 requests per minute limit, blocking IPs that exceed this. We manage distributed crawl clusters with residential proxies to parallelise extraction without triggering bans.

DOM structure

Parsing commented-out HTML tables

To optimise page load times, Basketball-Reference hides secondary data tables inside HTML comments. Standard scrapers miss this data entirely. Our parsers extract and render these comments into queryable DOM objects.

Data linking

Consistent player ID mapping

Players frequently change names or have identical names. We extract and normalise the unique Basketball-Reference player IDs (e.g., 'jamesle01') across all box scores and leaderboards to ensure relational integrity.

Change detection

Incremental nightly updates

Instead of re-scraping historical seasons, our pipelines maintain state. We only poll the previous night's box scores and append new rows to your warehouse, drastically reducing compute costs.

Validation

Automated sum-check verification

Box scores occasionally contain data entry errors. Our QA layer runs sum-checks (e.g., ensuring player points equal team total points) and flags anomalies before delivering the payload.

Applications

Who uses NBA historical data

Teams across industries use basketball-reference.com data to build competitive products and smarter operations.

Sports Betting Models

Quant syndicates feed play-by-play data and shooting splits into machine learning models to identify pricing inefficiencies in prop markets.

Fantasy Basketball

DFS players and season-long fantasy platforms use historical usage rates and pace metrics to project player performance.

Front Office Analytics

NBA and G-League front offices ingest college and international data to build proprietary draft evaluation models.

Sports Media

Journalists and content creators query historical leaderboards and advanced metrics to generate data-driven editorial pieces.

Academic Research

Economists and statisticians use salary and performance data to study contract valuations and labour dynamics.

ML Training Data

AI teams use structured play-by-play logs to train predictive text models and automated game recap generators.

Why DataFlirt

"Basketball-Reference holds the definitive history of the NBA, but extracting millions of play-by-play events requires parsing nested, commented-out HTML tables at scale."

Most teams underestimate the complexity of scraping Sports Reference sites. They enforce strict rate limits, embed secondary data tables within HTML comments to optimise load times, and frequently adjust advanced metric formulas. DataFlirt manages the proxy rotation, HTML parsing, and schema validation so your data scientists can focus on building predictive models rather than fixing broken parsers.

Technical Spec

Basketball-Reference scraper capabilities

Everything supported by our basketball-reference.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Rate limit circumvention

Distributed residential proxies to respect 20 req/min limits per IP

Supported

Commented HTML extraction

Custom parsers to extract tables hidden within HTML comments

Supported

Historical box scores

Complete extraction of all regular season and playoff games

Supported

Play-by-play parsing

Sequential event extraction with running scores and timestamps

Supported

Advanced metric formulas

Extraction of PER, WS, BPM directly from the source tables

Supported

Shooting location coordinates

Parsing shot chart data into X/Y coordinates when available

Supported

Change detection (diffs)

Only scrape new box scores added since the last pipeline run

Supported

Webhook delivery

HTTP POST delivery upon completion of nightly syncs

Supported

Stathead subscription data

Custom multi-season queries locked behind the Stathead paywall

Partial

Real-time live game feeds

Live in-game data extraction (site updates post-game)

Partial

Infrastructure

Infrastructure powering the extraction

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusBeautifulSoup

Scrapy + Custom Parsers

Scrapy handles crawl orchestration while custom lxml middlewares extract and parse the commented-out HTML tables unique to Sports Reference sites.

Residential Proxy Infrastructure

We maintain pools of residential ISP proxies to distribute requests safely, ensuring we never trigger the strict rate limit bans.

Cloud-Native Orchestration

Pipelines run on AWS ECS. Airflow schedules nightly syncs to capture new box scores immediately after games conclude.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON

Nested structures ideal for play-by-play event arrays

CSV

Flat files for easy import into pandas or R

XLS

Excel formats for quick manual analysis

Parquet

Columnar format for BigQuery and Snowflake ingestion

AWS S3

Direct bucket delivery on a nightly schedule

Webhook

HTTP POST notifications when new data is ready

API

REST endpoints to query your extracted historical dataset

PostgreSQL

Direct inserts into your relational database schema

Direct bucket delivery — compatible with any data lake

// faq

Common questions.

About basketball-reference.com scraping, legality, and pipeline operations.

Ask us directly →

How do you handle the 20 requests per minute limit?

We deploy a distributed architecture using large pools of residential proxies. Each node respects the rate limits, but parallelising across hundreds of IPs allows us to extract historical seasons rapidly without violating the site's anti-bot protections.

Why are some tables missing when I try to scrape it myself?

Basketball-Reference embeds secondary tables (like advanced stats and play-by-play) inside HTML comments to speed up initial page rendering. Standard HTTP clients and basic BeautifulSoup scripts ignore comments. Our parsers explicitly target and render these commented blocks.

Can you provide real-time data during games?

No. Basketball-Reference updates its database after games conclude. For live, sub-second latency data, you require a direct API feed from an official sports data provider like Sportradar.

Do you scrape Stathead queries?

No. Stathead requires a paid subscription and authentication. We only extract publicly available historical data from the main Basketball-Reference domain.

How quickly are new box scores available?

Our scheduled pipelines typically run in the early morning hours (EST) to capture the previous night's completed games, processing and delivering the data to your warehouse by 6:00 AM EST.

Can I get college and WNBA data too?

Yes. Sports Reference operates distinct subdomains for college basketball and the WNBA. We can configure pipelines to extract data from these sources using similar schemas.

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a complete historical dump of every box score since 1946 or a nightly sync for your betting models, we build and operate the pipeline.

Start a basketball-reference.com pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

Services

Data Extraction for Every Industry

View All Services →

🛍️ eCommerce → 🔍 Search Engine → ⚽ Sports Data → 📱 App Store → 🍕 Food Delivery → 📉 Betting Odds → ✈️ Aviation & Flight → 🛒 Grocery → 🎓 E-Learning → 💹 Stock Market → 🏠 Real Estate → 🤖 AI Training Data → 🧠 LLM Data → 📰 News → ⭐ Reviews → 💼 Job Board → 🏥 Healthcare → 💊 Pharma → 🏢 Company Data → 🤝 B2B Marketplace → 🚗 Automotive → 🌍 Travel → 🏨 Hospitality → 🪙 Cryptocurrency → 💡 IP & Patents → 📈 SEO Data → ⚖️ Legal → 🛡️ Insurance → 📲 Mobile App → 📸 Influencer → 🏛️ Government → 🚚 Transportation → 🎟️ Events → 📂 Directory → ⚡ Dynamic Websites → 📄 PDF Extraction → ✍️ Blog Content → ☁️ Weather → 🖥️ Cloud Scraping → 👨‍💻 Managed Service →

NBA historical data, at warehouse scale.

Every field we extract from basketball-reference.com

Every stat, split, and box score extracted

From URL list to warehouse tables

Overcoming Sports Reference scraping hurdles

Who uses NBA historical data

Basketball-Reference scraper capabilities

Infrastructure powering the extraction

Your data, your destination

Common questions.

Tell us whatto extract. We do the rest.

Data Extraction for Every Industry

NBA historical data,
at warehouse scale.

Tell us what
to extract.
We do the rest.