We extract course taxonomies, video metadata, transcripts, and exercise structures from Khan Academy. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.
Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.
Complete list of extractable fields for Courses & Units objects from khanacademy.org. All fields typed and schema-versioned.
"course_id": "cc-eighth-grade-math", "title": "8th grade (Eureka Math/EngageNY)", "subject": "Math", "domain": "Math", "unit_count": 7, "lesson_count": 124, "standards_alignment": "['CCSS.MATH.CONTENT.8.EE.A.1', 'CCSS.MATH.CONTENT.8.EE.A.2']", "url": "https://www.khanacademy.org/math/cc-eighth-grade-math"
| # | course_id | title | subject | domain | description | unit_count |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Video Lessons objects from khanacademy.org. All fields typed and schema-versioned.
"video_id": "v-intro-to-derivatives", "youtube_id": "rAof9Ld5sOg", "title": "Introduction to derivatives", "duration_seconds": 542, "author": "Sal Khan", "transcript_available": true, "unit_id": "derivatives-1", "course_id": "ap-calculus-ab"
| # | video_id | youtube_id | title | description | duration_seconds | author |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Transcripts objects from khanacademy.org. All fields typed and schema-versioned.
"video_id": "v-intro-to-derivatives", "language": "en", "segment_start_sec": 14.5, "segment_end_sec": 18.2, "text": "Let us look at the slope of the tangent line.", "auto_generated": false, "word_count": 10
| # | video_id | language | segment_start_sec | segment_end_sec | text | speaker |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Exercises objects from khanacademy.org. All fields typed and schema-versioned.
"exercise_id": "e-power-rule", "title": "Power rule practice", "topic": "Derivatives", "question_count": 4, "difficulty_level": "intermediate", "tags": "['calculus', 'power-rule', 'differentiation']", "standards_mapped": "['AP.CALC.2.1']", "url": "https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-1-new/ab-2-5/e/power-rule"
| # | exercise_id | title | topic | question_count | difficulty_level | tags |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Articles objects from khanacademy.org. All fields typed and schema-versioned.
"article_id": "a-limits-intro", "title": "Limits introduction", "author": "Khan Academy Mathematics", "word_count": 845, "reading_time_mins": 5, "unit_id": "limits-1", "course_id": "ap-calculus-ab", "last_updated": "2024-01-15T10:00:00Z"
| # | article_id | title | author | content_html | word_count | reading_time_mins |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Khan Academy structures content in a deep hierarchy. We traverse domains, subjects, courses, units, and lessons to extract raw video metadata, transcripts, and exercise structures.
Extract the exact hierarchical relationship from top-level domains down to individual lesson nodes. Maintain relational integrity across the entire curriculum.
Capture YouTube IDs, duration, titles, and full closed-caption transcripts with timestamped segments for NLP processing.
Extract instructional text, inline equations (LaTeX/KaTeX), and embedded image URLs from article nodes.
Map courses and exercises to Common Core State Standards (CCSS), Next Generation Science Standards (NGSS), and AP curriculum tags.
Extract practice problem sets, question counts, difficulty markers, and related prerequisite lesson mappings.
Extract localized curriculum data from es.khanacademy.org, pt.khanacademy.org, and other supported regional subdomains.
Bypass HTML parsing by intercepting Relay GraphQL queries directly, ensuring clean, structured JSON extraction with zero DOM reliance.
Track changes to course structures, new lesson additions, and updated transcripts with hash-based diffing. Only ingest what changes.
Extract author attribution, voiceover credits, and content contributor metadata attached to specific video lessons and articles.
Brief in. Clean data out.
Specify target domains, subjects, or specific course URLs. We map the required data points and delivery schema.
We configure GraphQL interceptors, Playwright renderers, and proxy rotation to traverse the course hierarchy.
Schema validation, null-rate checks on transcripts, and relational integrity testing across unit-lesson mappings.
JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.
Modern educational platforms use complex single-page architectures. Here is how we extract clean data from React and GraphQL backends.
Khan Academy uses React and Relay. Scraping the DOM is brittle. Our pipelines intercept the underlying GraphQL queries, extracting clean JSON payloads before they hit the rendering engine.
Educational content is useless without context. We maintain foreign keys linking every transcript, exercise, and article back to its parent lesson, unit, course, and subject domain.
Video content relies on embedded YouTube iframes. We extract the underlying video IDs, query YouTube metadata endpoints, and pull full transcript files directly into your dataset.
Math and science articles use complex rendering libraries. We extract the raw KaTeX/LaTeX strings rather than the rendered SVG elements, ensuring the data remains machine-readable.
Courses update frequently. We hash node content and emit diffs, allowing you to track curriculum evolution without re-ingesting the entire catalogue.
Machine learning teams ingest structured transcripts and articles to fine-tune LLMs on high-quality, pedagogically sound educational content.
EdTech platforms map their own content against Khan Academy taxonomy to identify curriculum gaps and align with Common Core standards.
Global education initiatives extract localized content to build supplementary learning portals for non-English speaking regions.
Researchers analyze pedagogical structures, video durations, and exercise sequencing to study optimal digital learning pathways.
Skills platforms map open-source course content to specific job competencies to build custom learning tracks.
Publishers analyze the density of practice questions per topic to identify underserved subject areas for new product development.
"Khan Academy houses the internet's most comprehensive structured curriculum, but navigating its GraphQL endpoints requires a purpose-built extraction layer."
Extracting educational taxonomies at scale means handling deeply nested node structures, React-driven dynamic rendering, and embedded video states. DataFlirt manages the entire pipeline, mapping domains to units to lessons into relational tables so your engineering team can focus on product development rather than DOM parsing.
Everything supported by our khanacademy.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.
Open-source tooling on proven cloud infra — no vendor lock-in, full observability.
Scrapy handles crawl orchestration and deduplication. Playwright handles JavaScript rendering, GraphQL interception, and complex interaction flows.
We bypass brittle HTML parsing by monitoring network traffic and capturing raw JSON responses from internal API endpoints directly.
Pipelines run on AWS Lambda (burst) and ECS (sustained). Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.
Data delivered to where your team already works — no new tooling required.
About khanacademy.org scraping, legality, and pipeline operations.
Ask us directly →Scraping publicly available educational content is generally permissible under fair use for research, analysis, and non-commercial aggregation. DataFlirt targets only public, non-authenticated course, video, and exercise metadata. We do not extract personal student data or circumvent authentication walls. Clients must ensure their downstream use complies with Khan Academy terms of service and copyright law.
We use Playwright to execute JavaScript and intercept the underlying Relay GraphQL network requests. This allows us to extract clean, structured JSON data directly from the API layer rather than relying on brittle CSS selectors.
Yes. We extract the embedded YouTube video IDs and pull the full closed-caption transcripts, including timestamped segments and speaker attribution where available.
Yes. When courses, units, or exercises are tagged with Common Core State Standards (CCSS) or Next Generation Science Standards (NGSS), we extract and map those tags in the final dataset.
Yes. We can target specific localized subdomains (e.g., es.khanacademy.org) to extract curriculum data in Spanish, Portuguese, French, and other supported languages.
Curriculum structures change relatively slowly. We typically configure these pipelines for weekly or monthly full-catalogue refreshes, with hash-based diffing to highlight new or modified courses.
20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full dump of the math curriculum or continuous updates on new science units, we scope, build, and operate the pipeline. Tell us what you need.