← Glossary / DAG (Directed Acyclic Graph)

What is DAG (Directed Acyclic Graph)?

DAG (Directed Acyclic Graph) is the structural foundation of modern data orchestration, defining the execution order of tasks where dependencies flow in one direction and never loop back on themselves. In scraping pipelines, a DAG ensures that extraction only runs after a successful fetch, and delivery only triggers once validation passes. It's the difference between a brittle cron job that fails silently and a resilient pipeline that knows exactly where to resume after an interruption.

OrchestrationAirflowDependenciesPipeline StateETL

// 02 — definitions

Order out
of chaos.

How complex scraping and transformation workflows are modeled to guarantee execution order and prevent infinite loops.

Ask a DataFlirt engineer →

TL;DR

A DAG is a mathematical graph where edges have a direction and no cycles exist. In data engineering (via tools like Airflow, Dagster, or Prefect), it represents a workflow. Each node is a task (e.g., fetch HTML, parse JSON, write to S3), and each edge is a dependency. If task A fails, task B waits, preventing cascading data corruption.

01Definition & structure

A DAG consists of nodes (tasks) connected by directed edges (dependencies). "Directed" means execution flows in one specific direction (Task A → Task B). "Acyclic" means you can never follow the edges and end up back at a node you've already visited. This mathematical structure is perfect for data pipelines because it guarantees that dependencies are resolved in the correct order and that the workflow will eventually terminate.

02Why cron isn't enough

Cron schedules time; DAGs schedule state. If you use cron to run a fetch script at 1:00 AM and a parse script at 2:00 AM, the parse script will run even if the fetch script failed, resulting in empty or corrupted data. A DAG explicitly links them: the parse task cannot start until the fetch task reports a successful state.

03Idempotency in DAGs

For a DAG to be useful, its nodes must be idempotent — meaning they can be run multiple times without changing the final result beyond the initial application. If a database insert task fails halfway through, retrying that node shouldn't result in duplicate rows. Idempotency allows orchestrators to safely retry failed nodes without manual cleanup.

04How DataFlirt handles it

We compile all client scraping jobs into dynamic DAGs. A single pipeline might fan out into 10,000 parallel fetch tasks, join into 50 validation tasks, and funnel into a single delivery task. By tracking state at the node level, we can guarantee exactly-once delivery semantics even when target websites are highly unstable.

05The "Acyclic" constraint

Why are loops forbidden? Because an orchestrator needs to calculate a deterministic execution plan before the first task runs. If Task A triggers Task B, which triggers Task A, the scheduler cannot determine the critical path, allocate workers efficiently, or guarantee that the pipeline will ever finish. Loops belong in your code (while loops); DAGs manage the macro-architecture.

// 03 — graph metrics

Measuring pipeline
complexity.

A DAG's shape dictates how well it scales. DataFlirt monitors these graph metrics to optimize worker allocation and identify bottlenecks in multi-stage extraction pipelines.

Critical Path Length = L_cp = max(Σ T_execution)

The longest sequence of dependent tasks. Dictates the absolute minimum time a pipeline takes to run. Graph Theory

Graph Density = E / (V × (V - 1) / 2)

Ratio of edges to possible edges. High density means tight coupling; low density allows massive parallelization. Network Analysis

Worker Concurrency Bound = W_max = max(nodes_at_depth_d)

The maximum number of parallel tasks possible at the widest point of the DAG. DataFlirt Scheduler SLO

// 04 — workflow execution

Resolving a
dependency graph.

A live trace of an Airflow-style DAG executing a daily e-commerce catalog scrape. The scheduler resolves dependencies before allocating workers.

Airflowtask resolutionXCom

edge.dataflirt.io — live

CAPTURED

// initialization
[scheduler] parsing dag: ecommerce_daily_v4
[scheduler] graph validated: 12 nodes, 15 edges, 0 cycles

// execution phase 1: discovery
[worker-1] task fetch_sitemap SUCCESS
[scheduler] downstream tasks unblocked: extract_urls
[worker-2] task extract_urls SUCCESS

// execution phase 2: fan-out
[scheduler] fanning out to 4 parallel fetch workers
[worker-3] task fetch_category_A SUCCESS
[worker-4] task fetch_category_B RETRY (1/3) - 429 Too Many Requests
[worker-5] task fetch_category_C SUCCESS
[worker-6] task fetch_category_D SUCCESS
[worker-4] task fetch_category_B SUCCESS

// execution phase 3: join and deliver
[scheduler] join node reached: validate_schema
[worker-1] task validate_schema SUCCESS
[worker-2] task load_to_snowflake SUCCESS
[scheduler] dag run ecommerce_daily_v4 completed in 14m 22s

// 05 — failure modes

Where workflows
break down.

DAGs fail for reasons distinct from the code inside their tasks. These are the most common orchestration-level failures across DataFlirt's managed pipelines.

DAG RUNS · · · · · 1.2M+ monthly

ORCHESTRATOR · · · · Airflow / Temporal

UPDATED · · · · · · 2026-05-19

Upstream task timeout

cascading delay · Blocks downstream execution indefinitely if no SLA is set

Sensor deadlock

state lock · Waiting on external state (e.g., S3 file) that never arrives

XCom / State bloat

memory exhaustion · Passing too much raw data between nodes crashes the scheduler

Unhandled partial failures

logic error · Missing trigger rules for skip states halts the pipeline

Cyclic dependency error

parse failure · Accidental loop introduced in dynamic DAG generation

// 06 — orchestration layer

Stateful execution,

because cron doesn't know when a proxy fails.

At DataFlirt, we don't just schedule scripts; we orchestrate state machines. Every extraction job is compiled into a dynamic DAG where fetch, parse, validate, and deliver are isolated nodes. If a target site throws a 502 during the fetch phase, the DAG halts, alerts, and retries only the fetch node. Downstream validation and delivery tasks remain safely pending. This isolation prevents dirty data from ever reaching the delivery sink and saves massive amounts of compute on retries.

DAG Run Status

Live state of a multi-stage extraction pipeline.

dag.id extract_b2b_pricing

run.id req_8f92b1a

state running

nodes.total 24 tasks

nodes.success 18 tasks

nodes.failed 0 tasks

nodes.running 4 tasks

critical_path.eta 2m 14s

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about DAGs, task orchestration, Airflow, and pipeline resilience.

Ask us directly →

Why use a DAG instead of a simple Python script? +

A single script fails entirely if one step breaks. A DAG isolates failures. If your script crashes during the database upload, you have to re-scrape everything. A DAG lets you resume exactly at the upload step, saving compute, proxy bandwidth, and time. It turns a monolithic failure into a granular, recoverable state.

What happens if a DAG has a cycle? +

It ceases to be a DAG (it becomes a Directed Cyclic Graph). Orchestrators like Airflow or Prefect will refuse to parse or execute it, throwing an immediate error. Cycles create infinite loops where task A waits for B, which waits for A. The acyclic constraint is what guarantees the pipeline will eventually finish.

Should I pass scraped data between DAG tasks? +

No. Passing large payloads (like raw HTML or JSON arrays) through the orchestrator's metadata database (e.g., Airflow XComs) will crash the scheduler. Tasks should write data to external storage (S3, GCS) and pass only the URI or metadata pointer to the next task.

How does DataFlirt handle dynamic DAG generation? +

We use factory patterns to generate DAGs based on client configuration files. If a client adds 50 new target domains to their pipeline, the orchestrator automatically fans out 50 new parallel fetch-and-extract branches on the next run without manual code changes.

What is a sensor task in a DAG? +

A sensor is a special node that waits for an external event before completing. In scraping, a sensor might poll an S3 bucket waiting for a proxy rotation to finish, or wait for a third-party API to publish a daily dump before triggering the extraction tasks.

How do you handle retries at the DAG level? +

Retries are configured per-task. A network fetch task might have 5 retries with exponential backoff to handle transient 502s, while a database load task might have 0 retries to prevent duplicate inserts if the failure was due to a schema mismatch. Granular retries are the primary benefit of DAG-based orchestration.

$ dataflirt scope --new-project --target=dag-(directed-acyclic-graph) READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

What is DAG (Directed Acyclic Graph)?

Order outof chaos.

TL;DR

Measuring pipelinecomplexity.

Resolving adependency graph.

Where workflowsbreak down.

Upstream task timeout

Sensor deadlock

XCom / State bloat

Unhandled partial failures

Cyclic dependency error

Stateful execution,

DAG Run Status

Stay ahead of the pipeline

Data engineeringintel, weekly.

Commonquestions.

Tell us whatto extract.We do the rest.

Related glossary terms

Apache Airflow

Pipeline Orchestration

ETL Pipeline

Idempotent Scraping