SYSTEM all green source hockey-reference.com queue 12,491 pages p99 latency 845ms dataflirt.com · scraper/hockey-reference-com
RUN : 142 active pipelines : hockey-reference.com live

NHL analytics,
at warehouse scale.

We extract player biographies, advanced metrics, game logs, and play-by-play data from Hockey-Reference. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Players extracted
24,512 /run
Game logs
68,201 /day
Stat rows
12.4M /run
Active pipelines
142
Uptime
99.96%
Data Dictionary

Every field we extract from hockey-reference.com

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Player Profiles objects from hockey-reference.com. All fields typed and schema-versioned.

player_idnamepositionshootsheight_inchesweight_lbsbirth_datebirth_countrydraft_yeardraft_rounddraft_pickhall_of_fame
player_profiles
● 200 OK
"player_id": "mcdavco01",
"name": "Connor McDavid",
"position": "C",
"shoots": "L",
"height_inches": 73,
"weight_lbs": 193,
"birth_date": "1997-01-13",
"draft_year": 2015
# player_idnamepositionshootsheight_inchesweight_lbs
1
2
3

Complete list of extractable fields for Standard Stats objects from hockey-reference.com. All fields typed and schema-versioned.

player_idseasonteamleaguegames_playedgoalsassistspointsplus_minuspenalty_minutespower_play_goalsshots_on_goal
standard_stats
● 200 OK
"player_id": "mcdavco01",
"season": "2022-23",
"team": "EDM",
"games_played": 82,
"goals": 64,
"assists": 89,
"points": 153,
"plus_minus": 22
# player_idseasonteamleaguegames_playedgoals
1
2
3

Complete list of extractable fields for Advanced Metrics objects from hockey-reference.com. All fields typed and schema-versioned.

player_idseasonteamcorsi_for_pctfenwick_for_pctpdooffensive_zone_start_pctexpected_plus_minuspoint_sharestime_on_ice_per_gameblockshits
advanced_metrics
● 200 OK
"player_id": "mcdavco01",
"season": "2022-23",
"corsi_for_pct": 53.4,
"fenwick_for_pct": 53.1,
"pdo": 103.2,
"offensive_zone_start_pct": 62.1,
"point_shares": 16.5,
"time_on_ice_per_game": "22:23"
# player_idseasonteamcorsi_for_pctfenwick_for_pctpdo
1
2
3

Complete list of extractable fields for Game Logs objects from hockey-reference.com. All fields typed and schema-versioned.

game_iddateteamopponenthome_awayresultgoalsassistspointsplus_minustime_on_iceshifts
game_logs
● 200 OK
"game_id": "202304130EDM",
"date": "2023-04-13",
"team": "EDM",
"opponent": "SJS",
"home_away": "Home",
"goals": 1,
"assists": 2,
"time_on_ice": "18:45"
# game_iddateteamopponenthome_awayresult
1
2
3

Complete list of extractable fields for Box Scores objects from hockey-reference.com. All fields typed and schema-versioned.

game_iddateaway_teamhome_teamaway_goalshome_goalsperiod_scoringpenaltiesshots_on_goalattendancevenueduration
box_scores
● 200 OK
"game_id": "202304130EDM",
"away_team": "SJS",
"home_team": "EDM",
"away_goals": 2,
"home_goals": 5,
"attendance": 18347,
"venue": "Rogers Place",
"shots_on_goal": 35
# game_iddateaway_teamhome_teamaway_goalshome_goals
1
2
3

Capabilities

Complete NHL statistical extraction

Hockey-Reference houses the most comprehensive public NHL database. We extract every table, parse commented-out HTML structures, and normalise multi-season data while adhering to strict rate limits.

Player Biographies

Extract physical attributes, draft positions, birthplaces, and career milestones for every player in NHL history.

Standard Statistics

Capture goals, assists, points, penalty minutes, and plus-minus across regular season and playoff campaigns.

Advanced Analytics

Parse Corsi, Fenwick, PDO, zone starts, and point shares from heavily nested statistical tables.

Goalie Metrics

Extract save percentages, goals against averages, shutouts, and quality starts for goaltenders.

Game Logs

Collect per-game performance records, including time on ice and shift counts, for any player across any season.

Box Scores

Scrape team-level game summaries, period-by-period scoring, penalty summaries, and venue attendance.

Draft History

Extract historical NHL entry draft results, including pick numbers, teams, and amateur leagues.

Award Voting

Capture Hart, Norris, Vezina, and Calder trophy voting histories and point distributions.

Daily Updates

Schedule pipelines to fetch the previous night's box scores and updated player statistics every morning.

// engagement pipeline

From player list to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide player IDs, team slugs, or season years. We design the extraction schema together.

Pipeline Build
d 2–4

We configure Scrapy crawlers, proxy rotation, and rate-limit compliance logic for hockey-reference.com.

Validation & QA
d 4–6

Schema validation, null-rate checks, and table-parsing verification before full launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

Overcoming Sports Reference rate limits

Sports Reference sites aggressively throttle automated access. Here is how we maintain pipeline stability.

pipeline-monitor · hockey-reference.com · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Rate limiting
Distributed request timing

Hockey-Reference enforces strict 20 requests-per-minute limits. We distribute requests across residential IP pools and enforce global concurrency locks to prevent 429 Too Many Requests errors while maintaining throughput.

DOM parsing
Extracting commented-out tables

Many advanced statistical tables on Hockey-Reference are hidden inside HTML comments to speed up page loads. Our parsers automatically detect, uncomment, and extract this data without requiring full browser rendering.

Schema stability
Handling historical data gaps

Stats like time on ice or blocked shots were not recorded in early NHL seasons. Our schema handles nulls gracefully and normalises historical records against modern datasets.

Change detection
Incremental nightly updates

For ongoing seasons, we maintain a state file of completed games. Pipelines only scrape new box scores and update cumulative player stats, drastically reducing request volume.

Monitoring
Automated anomaly detection

We monitor for missing tables, unexpected schema changes, and IP bans. If a stat column shifts, our alerting stack catches it before bad data reaches your warehouse.

Applications

Who uses Hockey-Reference data

Teams across industries use hockey-reference.com data to build competitive products and smarter operations.

01
Fantasy Hockey & DFS

Platform providers ingest daily game logs and advanced metrics to power player projections and pricing algorithms.

02
Predictive Modelling

Data scientists train machine learning models on historical Corsi and Fenwick data to predict game outcomes.

03
Sports Betting Analytics

Syndicates track team performance trends, goalie metrics, and schedule fatigue to find edges in the betting market.

04
Media & Broadcast

Sports networks populate on-screen graphics and pre-game research packets with historical franchise records.

05
Academic Research

Economists and statisticians analyse draft history and career longevity for sports economics papers.

06
Team Front Offices

Minor league and international teams benchmark player performance using NHL historical comparables.

Why DataFlirt

"Hockey-Reference holds the definitive historical record of the NHL, but pulling multi-season advanced metrics requires navigating strict rate limits and complex table structures."

Sports Reference properties aggressively throttle automated access. Extracting multi-season game logs or play-by-play data at scale requires distributed request timing, IP rotation, and parsing massive, heavily nested HTML tables. DataFlirt handles the infrastructure so your data science team can focus on building better models.

Technical Spec

Hockey-Reference scraper specifications

Everything supported by our hockey-reference.com scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

HTML uncommenting
Automatic parsing of stats tables hidden in HTML comments
Supported
Rate limit compliance
Strict adherence to 20 RPM limits via distributed IP routing
Supported
Residential proxy rotation
US-based residential IPs to prevent geographic blocking
Supported
Historical season support
Extraction spanning back to the inaugural NHL seasons
Supported
Play-by-play parsing
Event-level extraction for goals, penalties, and goalie pulls
Supported
Daily incremental updates
Nightly runs to append new box scores and update season totals
Supported
Stathead premium queries
Custom multi-season conditional searches require a paid Stathead subscription
Partial
User saved leaderboards
Accessing private user accounts or saved custom leaderboards
Partial
Infrastructure

Infrastructure powering the pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus
Scrapy with Custom Middleware

Scrapy handles high-concurrency extraction. Custom middleware detects HTML comments containing payload data and parses them into standard DOM elements before extraction.

Distributed Rate Limiting

Redis-backed token buckets enforce global rate limits across distributed worker nodes, ensuring we never exceed Sports Reference strict request thresholds.

Cloud-Native Orchestration

Pipelines run on Kubernetes. Airflow triggers nightly jobs immediately after West Coast games conclude, ensuring fresh data for morning analytics.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested arrays
CSV
Flat file with typed columns
Parquet
Columnar format for analytics workloads
S3
Direct delivery to your AWS environment
BigQuery
Streamed directly into your dataset
Webhook
HTTP POST for real-time downstream triggers
Postgres
Upsert into your existing database schema
Snowflake
Stage and COPY INTO workflow
// faq

Common questions.

About hockey-reference.com scraping, legality, and pipeline operations.

Ask us directly →
Is scraping Hockey-Reference legal?

Scraping publicly available statistical data is generally permissible. DataFlirt extracts only public, non-authenticated sports statistics. We do not bypass authentication for Stathead premium features. Clients should review the site Terms of Service and consult legal counsel.

How do you handle the strict rate limits?

We use distributed request timing, Redis-backed concurrency locks, and residential IP rotation to ensure our extraction stays below the mandated 20 requests per minute per IP threshold.

Can you extract data hidden in HTML comments?

Yes. Hockey-Reference loads many advanced tables as commented HTML to improve render speeds. Our parsers automatically strip the comment tags and extract the underlying table structures.

How frequently can the data be updated?

For active seasons, we configure pipelines to run daily, typically scheduled early morning to capture all box scores and updated statistics from the previous night.

Do you support historical NHL seasons?

Yes. We can extract data going back to the league inception. Our schema handles missing fields gracefully, as older seasons lack data like time on ice or blocked shots.

Can I request a sample dataset?

Yes. We provide a sample run covering a specific team season or a list of 100 players during the scoping phase so you can validate the schema and data quality.

$ dataflirt scope --new-project --source=hockey-reference.com ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need historical franchise records or daily game log updates, we build and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →