Khan Academy Scraper — Course, Video & Exercise Data Extraction

Data Dictionary

Every field we extract from khanacademy.org

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Courses & Units objects from khanacademy.org. All fields typed and schema-versioned.

course_idtitlesubjectdomaindescriptionunit_countlesson_counttotal_duration_minsstandards_alignmenturl

"course_id": "cc-eighth-grade-math",
"title": "8th grade (Eureka Math/EngageNY)",
"subject": "Math",
"domain": "Math",
"unit_count": 7,
"lesson_count": 124,
"standards_alignment": "['CCSS.MATH.CONTENT.8.EE.A.1', 'CCSS.MATH.CONTENT.8.EE.A.2']",
"url": "https://www.khanacademy.org/math/cc-eighth-grade-math"

#	course_id	title	subject	domain	description	unit_count
1
2
3

Complete list of extractable fields for Video Lessons objects from khanacademy.org. All fields typed and schema-versioned.

video_idyoutube_idtitledescriptionduration_secondsauthortranscript_availableunit_idcourse_idpublish_date

"video_id": "v-intro-to-derivatives",
"youtube_id": "rAof9Ld5sOg",
"title": "Introduction to derivatives",
"duration_seconds": 542,
"author": "Sal Khan",
"transcript_available": true,
"unit_id": "derivatives-1",
"course_id": "ap-calculus-ab"

#	video_id	youtube_id	title	description	duration_seconds	author
1
2
3

Complete list of extractable fields for Transcripts objects from khanacademy.org. All fields typed and schema-versioned.

video_idlanguagesegment_start_secsegment_end_sectextspeakerauto_generatedword_count

"video_id": "v-intro-to-derivatives",
"language": "en",
"segment_start_sec": 14.5,
"segment_end_sec": 18.2,
"text": "Let us look at the slope of the tangent line.",
"auto_generated": false,
"word_count": 10

#	video_id	language	segment_start_sec	segment_end_sec	text	speaker
1
2
3

Complete list of extractable fields for Exercises objects from khanacademy.org. All fields typed and schema-versioned.

exercise_idtitletopicquestion_countdifficulty_leveltagsstandards_mappedrelated_video_idurl

"exercise_id": "e-power-rule",
"title": "Power rule practice",
"topic": "Derivatives",
"question_count": 4,
"difficulty_level": "intermediate",
"tags": "['calculus', 'power-rule', 'differentiation']",
"standards_mapped": "['AP.CALC.2.1']",
"url": "https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-1-new/ab-2-5/e/power-rule"

#	exercise_id	title	topic	question_count	difficulty_level	tags
1
2
3

Complete list of extractable fields for Articles objects from khanacademy.org. All fields typed and schema-versioned.

article_idtitleauthorcontent_htmlword_countreading_time_minsunit_idcourse_idlast_updated

"article_id": "a-limits-intro",
"title": "Limits introduction",
"author": "Khan Academy Mathematics",
"word_count": 845,
"reading_time_mins": 5,
"unit_id": "limits-1",
"course_id": "ap-calculus-ab",
"last_updated": "2024-01-15T10:00:00Z"

#	article_id	title	author	content_html	word_count	reading_time_mins
1
2
3

Capabilities

Extract the entire educational graph

Khan Academy structures content in a deep hierarchy. We traverse domains, subjects, courses, units, and lessons to extract raw video metadata, transcripts, and exercise structures.

Deep Taxonomy Mapping

Extract the exact hierarchical relationship from top-level domains down to individual lesson nodes. Maintain relational integrity across the entire curriculum.

Video & Transcript Extraction

Capture YouTube IDs, duration, titles, and full closed-caption transcripts with timestamped segments for NLP processing.

Article Content Parsing

Extract instructional text, inline equations (LaTeX/KaTeX), and embedded image URLs from article nodes.

Standards Alignment

Map courses and exercises to Common Core State Standards (CCSS), Next Generation Science Standards (NGSS), and AP curriculum tags.

Exercise Metadata

Extract practice problem sets, question counts, difficulty markers, and related prerequisite lesson mappings.

Multi-Language Support

Extract localized curriculum data from es.khanacademy.org, pt.khanacademy.org, and other supported regional subdomains.

GraphQL Interception

Bypass HTML parsing by intercepting Relay GraphQL queries directly, ensuring clean, structured JSON extraction with zero DOM reliance.

Incremental Updates

Track changes to course structures, new lesson additions, and updated transcripts with hash-based diffing. Only ingest what changes.

Instructor & Contributor Data

Extract author attribution, voiceover credits, and content contributor metadata attached to specific video lessons and articles.

// engagement pipeline

From curriculum target to structured dataset

Brief in. Clean data out.

Define Scope

d 0

Specify target domains, subjects, or specific course URLs. We map the required data points and delivery schema.

Pipeline Build

d 2–4

We configure GraphQL interceptors, Playwright renderers, and proxy rotation to traverse the course hierarchy.

Validation & QA

d 4–6

Schema validation, null-rate checks on transcripts, and relational integrity testing across unit-lesson mappings.

Delivery

ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

Navigating Khan Academy infrastructure

Modern educational platforms use complex single-page architectures. Here is how we extract clean data from React and GraphQL backends.

// fingerprinting

Identity rotation

TLS fingerprintrandomised

User-agentrotated

IP poolresidential

Challenges blocked0

// pagination

Page coverage

48,291 pages queued running

// observability

Pipeline health

99.9%

uptime

142ms

p99 lat

0.3%

null rate

alerts

GraphQL parsing

Direct API interception over DOM scraping

Khan Academy uses React and Relay. Scraping the DOM is brittle. Our pipelines intercept the underlying GraphQL queries, extracting clean JSON payloads before they hit the rendering engine.

Hierarchy traversal

Maintaining relational integrity

Educational content is useless without context. We maintain foreign keys linking every transcript, exercise, and article back to its parent lesson, unit, course, and subject domain.

YouTube integration

Extracting embedded video data

Video content relies on embedded YouTube iframes. We extract the underlying video IDs, query YouTube metadata endpoints, and pull full transcript files directly into your dataset.

Math rendering

Preserving KaTeX and LaTeX equations

Math and science articles use complex rendering libraries. We extract the raw KaTeX/LaTeX strings rather than the rendered SVG elements, ensuring the data remains machine-readable.

Change detection

Curriculum version control

Courses update frequently. We hash node content and emit diffs, allowing you to track curriculum evolution without re-ingesting the entire catalogue.

Applications

Who uses Khan Academy data

Teams across industries use khanacademy.org data to build competitive products and smarter operations.

AI Tutor Training Data

Machine learning teams ingest structured transcripts and articles to fine-tune LLMs on high-quality, pedagogically sound educational content.

Curriculum Mapping

EdTech platforms map their own content against Khan Academy taxonomy to identify curriculum gaps and align with Common Core standards.

Multi-language Resource Aggregation

Global education initiatives extract localized content to build supplementary learning portals for non-English speaking regions.

Academic Research

Researchers analyze pedagogical structures, video durations, and exercise sequencing to study optimal digital learning pathways.

Alternative Credentialing

Skills platforms map open-source course content to specific job competencies to build custom learning tracks.

EdTech Content Gap Analysis

Publishers analyze the density of practice questions per topic to identify underserved subject areas for new product development.

Technical Spec

Khan Academy scraper — technical capabilities

Everything supported by our khanacademy.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

GraphQL interception

Direct extraction of Relay GraphQL payloads for clean JSON data

Supported

YouTube transcript extraction

Full closed-caption text with timestamp arrays

Supported

Taxonomy tree mapping

Preserves relational links from Domain to Lesson node

Supported

Multi-language support

Target specific localized subdomains (es., pt., fr.)

Supported

Common Core extraction

Capture standards tags attached to courses and exercises

Supported

Article HTML extraction

Raw HTML and KaTeX equation strings

Supported

Change detection (diffs)

Only emit records for new or updated curriculum nodes

Supported

User progress data

Individual student mastery points and completion states

Partial

Teacher dashboard analytics

Classroom performance metrics and assignment tracking

Partial

Infrastructure

Infrastructure powering the pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus

Scrapy + Playwright Stack

Scrapy handles crawl orchestration and deduplication. Playwright handles JavaScript rendering, GraphQL interception, and complex interaction flows.

GraphQL Interception Layer

We bypass brittle HTML parsing by monitoring network traffic and capturing raw JSON responses from internal API endpoints directly.

Cloud-Native Orchestration

Pipelines run on AWS Lambda (burst) and ECS (sustained). Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON

Newline-delimited or nested — schema versioned per run

CSV

Flat file with typed columns — Excel/Sheets compatible

XLS

Excel format for non-technical analyst teams

Parquet

Columnar format for BigQuery, Snowflake, Athena

AWS S3

Direct bucket delivery — compatible with any data lake

Webhook

HTTP POST per record for real-time downstream processing

API

Queryable REST endpoints for on-demand data retrieval

PostgreSQL

Upsert into your existing schema with conflict resolution

BigQuery

Streamed directly into your dataset with schema auto-detect

Snowflake

Stage + COPY INTO workflow — incremental or full-replace

Direct bucket delivery — compatible with any data lake

// faq

Common questions.

About khanacademy.org scraping, legality, and pipeline operations.

Ask us directly →

Is scraping Khan Academy legal?

Scraping publicly available educational content is generally permissible under fair use for research, analysis, and non-commercial aggregation. DataFlirt targets only public, non-authenticated course, video, and exercise metadata. We do not extract personal student data or circumvent authentication walls. Clients must ensure their downstream use complies with Khan Academy terms of service and copyright law.

How do you handle the React single-page application?

We use Playwright to execute JavaScript and intercept the underlying Relay GraphQL network requests. This allows us to extract clean, structured JSON data directly from the API layer rather than relying on brittle CSS selectors.

Can you extract video transcripts?

Yes. We extract the embedded YouTube video IDs and pull the full closed-caption transcripts, including timestamped segments and speaker attribution where available.

Do you capture Common Core standards?

Yes. When courses, units, or exercises are tagged with Common Core State Standards (CCSS) or Next Generation Science Standards (NGSS), we extract and map those tags in the final dataset.

Can I get non-English content?

Yes. We can target specific localized subdomains (e.g., es.khanacademy.org) to extract curriculum data in Spanish, Portuguese, French, and other supported languages.

How fresh is the data?

Curriculum structures change relatively slowly. We typically configure these pipelines for weekly or monthly full-catalogue refreshes, with hash-based diffing to highlight new or modified courses.

Educational data,
structured for scale.

Every field we extract from khanacademy.org

Extract the entire educational graph

From curriculum target to structured dataset

Navigating Khan Academy infrastructure

Who uses Khan Academy data

Khan Academy scraper — technical capabilities

Infrastructure powering the pipeline

Your data, your destination

Common questions.

Tell us what
to extract.
We do the rest.

Data Extraction for Every Industry

Educational data, structured for scale.

Every field we extract from khanacademy.org

Extract the entire educational graph

From curriculum target to structured dataset

Navigating Khan Academy infrastructure

Who uses Khan Academy data

Khan Academy scraper — technical capabilities

Infrastructure powering the pipeline

Your data, your destination

Common questions.

Tell us whatto extract. We do the rest.

Data Extraction for Every Industry

Educational data,
structured for scale.

Tell us what
to extract.
We do the rest.