← Glossary / Apache Airflow

What is Apache Airflow?

Apache Airflow is an open-source platform used to programmatically author, schedule, and monitor data pipelines. In the context of web scraping, it acts as the orchestration layer that triggers spiders, manages dependencies between extraction and transformation tasks, and handles retries when target sites inevitably timeout or block requests. It turns a collection of isolated scraping scripts into a resilient, observable data supply chain.

OrchestrationDAGsTask SchedulingPythonPipeline Observability
// 02 — definitions

Orchestrate
the chaos.

How to manage thousands of concurrent scraping jobs, handle inevitable failures, and ensure downstream data consumers aren't fed partial datasets.

Ask a DataFlirt engineer →

TL;DR

Apache Airflow models scraping pipelines as Directed Acyclic Graphs (DAGs) written in Python. It doesn't run the actual browser or HTTP requests itself; instead, it schedules the workers, tracks task states, and triggers alerts when a target site's schema drift breaks an extraction job.

01Definition & structure
Apache Airflow is a platform created by the community to programmatically author, schedule and monitor workflows. In a scraping context, it replaces fragile cron jobs and bash scripts with Python-based Directed Acyclic Graphs (DAGs). A typical scraping DAG consists of discrete tasks: fetching URLs, extracting data, validating schemas, and loading to a data warehouse. Airflow ensures these tasks run in the correct order, retries them if they fail, and provides a UI to monitor the entire process.
02DAGs and Task Dependencies
The core concept in Airflow is the DAG. It defines the relationship between tasks. For example, you cannot run the transform_data task until the scrape_website task completes successfully. If the scraper fails due to a proxy ban, the transform task is skipped, preventing corrupted or empty data from polluting your downstream databases. This dependency management is what makes Airflow essential for production data engineering.
03Handling Scraping Failures
Web scraping is inherently unstable. Target sites go down, rate limits are triggered, and proxies timeout. Airflow handles this through robust retry mechanisms. You can configure a task to retry 5 times, with an exponential backoff delay between attempts, and even specify different behavior based on the type of error encountered. If all retries fail, Airflow can automatically trigger a Slack alert or PagerDuty incident.
04How DataFlirt handles it
We use Airflow as the brain of our infrastructure, but we strictly separate orchestration from execution. Our Airflow workers never run HTTP requests or headless browsers directly. Instead, they use the KubernetesPodOperator to launch ephemeral scraping containers in our Kubernetes cluster. This ensures that memory leaks in Playwright or Scrapy don't crash the Airflow scheduler, maintaining high availability across thousands of daily pipeline runs.
05The "Worker Overload" Misconception
A common mistake is treating Airflow like a distributed computing framework (like Apache Spark). Engineers will write a PythonOperator that downloads 10GB of HTML into the worker's memory, parses it, and writes it to a database. This will quickly cause Out-Of-Memory (OOM) kills and bring down the entire Airflow cluster. Airflow should only pass metadata and trigger external systems; the heavy lifting of parsing and data manipulation must happen elsewhere.
// 03 — pipeline reliability

How resilient
is your DAG?

Airflow's value lies in its ability to handle failure gracefully. These metrics define how we configure task retries and SLAs for DataFlirt's managed pipelines.

Task Success Rate = Stask = 1 − (failed_tasks / total_tasks)
Target > 0.99. Lower indicates selector rot or proxy pool exhaustion. DataFlirt Pipeline SLO
DAG Duration Variance = σ² = Σ(ti − μ)² / N
High variance in scrape times usually points to target rate-limiting or CAPTCHA tarpits. Airflow SLA Monitoring
Retry Backoff Delay = Delay = base × 2attempt + jitter
Exponential backoff prevents thundering herd attacks on target servers during recovery. Standard Airflow Retry Policy
// 04 — airflow task execution

A scraping DAG
in motion.

Trace of an Airflow worker executing a daily product catalog scrape, handling a proxy timeout, and successfully delivering the payload to S3.

CeleryExecutorPythonOperatorS3Hook
edge.dataflirt.io — live
CAPTURED
[2026-05-19 08:00:01] INFO - Executing <Task(PythonOperator): fetch_catalog_urls>
dag_id: "ecommerce_daily_sync"
run_id: "scheduled__2026-05-19T08:00:00+00:00"

[2026-05-19 08:00:05] INFO - Triggering Scrapy spider via API
spider.status: "running"
items_scraped: 1420

[2026-05-19 08:05:12] ERROR - Proxy connection timeout on worker node 4
task.state: up_for_retry
[2026-05-19 08:10:12] INFO - Retrying task (attempt 2 of 3)

[2026-05-19 08:15:44] INFO - Spider completed successfully
items_scraped: 50000
task.state: success

[2026-05-19 08:15:45] INFO - Executing <Task(S3CreateObjectOperator): upload_to_datalake>
s3.key: "raw/ecommerce/2026-05-19/catalog.json"
dag.state: success
// 05 — orchestration bottlenecks

Where Airflow
struggles at scale.

Airflow is an orchestrator, not an execution engine. When scraping pipelines fail at the orchestration layer, it's rarely a bug in Airflow itself, but rather an architectural mismatch between how tasks are defined and how web scraping actually behaves.

PIPELINES ANALYZED ·  ·   850+ DAGs
AVG TASKS/DAG ·  ·  ·  ·  14
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Database overload (Metadata DB)

Scheduler lag · Too many active DAG runs thrashing the Postgres backend
02

Worker memory exhaustion

OOM Kills · Running heavy Pandas transforms inside the Airflow worker
03

Zombie tasks

Stuck running · Scraping processes that hang indefinitely without timing out
04

Sensor deadlocks

Pool starvation · Too many tasks waiting on external APIs, consuming all worker slots
05

DAG parsing timeouts

CPU spikes · Dynamically generating thousands of tasks with complex logic
// 06 — our architecture

Decouple orchestration from execution,

let Airflow schedule, let Kubernetes scrape.

A common anti-pattern is running heavy scraping workloads directly inside Airflow workers using the PythonOperator. This leads to memory leaks, dependency conflicts, and worker starvation. At DataFlirt, Airflow acts strictly as the control plane. It uses the KubernetesPodOperator to spin up isolated, ephemeral scraping containers. If a scraper crashes due to a memory leak or a browser context error, the pod dies, Airflow detects the failure, and a fresh pod is scheduled. The orchestrator remains lightweight and highly available.

Airflow DAG Configuration

Standard task definition for a Kubernetes-backed scraping job.

operator KubernetesPodOperator
image dataflirt/scraper:v4.2
retries 3exponential_backoff
execution_timeout timedelta(hours=2)
trigger_rule all_success
on_failure_callback alert_slack_channel

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About Airflow architecture, task scheduling, common anti-patterns, and how DataFlirt orchestrates enterprise-scale scraping pipelines.

Ask us directly →
Should I use Airflow to run my Selenium/Playwright scripts directly? +
No. Running browser automation directly inside Airflow workers is a recipe for disaster. Browsers leak memory and leave zombie processes. Use Airflow to trigger external services (like an AWS ECS task or a Kubernetes pod) that run the actual browser, keeping your Airflow workers clean and responsive.
How does Airflow compare to cron for scheduling scrapers? +
Cron is a time-based job scheduler; Airflow is a dependency-aware orchestrator. If your extraction script fails, cron won't automatically retry it, nor will it prevent the downstream data-loading script from running on empty data. Airflow handles retries, alerts, and ensures downstream tasks only run if upstream dependencies succeed.
Is Airflow suitable for real-time or streaming scraping? +
No. Airflow is designed for batch processing. Its scheduler has inherent latency (typically seconds to minutes), making it unsuitable for sub-second, event-driven scraping. For real-time data feeds, use message queues like Kafka or RabbitMQ combined with stream processing frameworks.
How does DataFlirt use Airflow? +
We use Airflow as our central control plane to manage thousands of daily scraping DAGs. It handles the complex dependencies between proxy rotation schedules, target site discovery crawls, data extraction jobs, and final schema validation checks before delivering datasets to clients.
What happens if the target website structure changes mid-scrape? +
The extraction task will fail its schema validation check. Airflow catches this failure, halts downstream tasks (preventing bad data from entering your warehouse), and triggers an alert to the maintenance team. Once the selector is fixed, you can clear the failed task in the Airflow UI, and the DAG will resume from exactly where it stopped.
Can I dynamically generate scraping tasks based on a database of URLs? +
Yes, Airflow supports dynamic DAG generation. However, querying a database at the top level of your DAG file will overload the Airflow scheduler, as it parses DAG files continuously. Instead, use dynamic task mapping (introduced in Airflow 2.3) to expand tasks at runtime based on the output of an upstream task.
$ dataflirt scope --new-project --target=apache-airflow READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h