← Glossary / Webhook

What is Webhook?

Webhooks are user-defined HTTP callbacks that push data to your infrastructure the moment an event occurs, rather than forcing you to poll an API for changes. In a scraping pipeline, webhooks are the standard delivery mechanism for asynchronous jobs. When a 100,000-URL crawl finishes or a spot-price monitor detects a drop, the extraction engine POSTs the structured JSON payload directly to your endpoint.

Data DeliveryEvent-DrivenHTTP POSTAsync ScrapingPayloads
// 02 — definitions

Don't ask.
We'll tell you.

The shift from polling to pushing. Why event-driven data delivery is the only sane way to run asynchronous scraping pipelines at scale.

Ask a DataFlirt engineer →

TL;DR

Webhooks eliminate the overhead of continuous polling by reversing the client-server relationship. Instead of your infrastructure asking "is the scrape done yet?" every 30 seconds, the scraping engine POSTs the final dataset to your server the millisecond it's ready. It is the backbone of modern, decoupled data pipelines.

01Definition & structure
A webhook is an HTTP-based callback function that facilitates communication between two APIs. Unlike typical APIs where the client initiates the request (polling), a webhook is event-driven. When a specific event occurs in the source system—like a scraping job completing—the source system makes an HTTP POST request to a URL provided by the destination system. The payload is typically formatted as JSON and contains the relevant data or a pointer to it.
02Webhooks vs. Polling
Polling requires your server to repeatedly ask, "Is the data ready?" This wastes compute cycles, consumes unnecessary bandwidth, and introduces latency (if you poll every 60 seconds, your data is on average 30 seconds stale). Webhooks invert this. Your server sits idle until the data is ready, at which point the scraping engine pushes it directly to you. It is the only scalable way to handle asynchronous jobs that might take anywhere from 5 seconds to 5 hours to complete.
03Payload signatures and security
Because webhook endpoints must be publicly accessible to receive data, they are vulnerable to spoofing. To secure them, the sender generates an HMAC (Hash-based Message Authentication Code) using the payload body and a shared secret, sending it in a custom header (e.g., X-Signature). The receiver recalculates the hash using their copy of the secret. If the hashes match, the payload is authentic and hasn't been tampered with in transit.
04How DataFlirt handles delivery
We treat webhook delivery as a first-class infrastructure component. Every outbound webhook is dispatched via a distributed message queue. If your endpoint is unreachable, we apply exponential backoff with jitter to prevent thundering herd problems when your infrastructure recovers. For massive datasets, we don't clog your endpoint with gigabytes of JSON; we upload the data to S3 and webhook you a secure, time-bound presigned URL.
05The synchronous processing trap
The most common mistake engineers make is processing the webhook payload synchronously. If your endpoint receives a POST, parses the JSON, runs database inserts, and then returns a 200 OK, you will inevitably hit the sender's timeout threshold (usually 5-10 seconds). The sender assumes failure and retries, leading to duplicate data. Always accept the payload, push it to an internal queue, and return a 202 Accepted immediately.
// 03 — delivery math

The cost of
polling vs pushing.

Polling wastes compute and bandwidth on empty responses. Webhooks reduce network overhead to near zero for idle states. Here is how we calculate delivery efficiency and retry logic.

Polling Overhead = O = requests × empty_responses
Wasted cycles checking for data that isn't ready yet. Network Efficiency Model
Webhook Efficiency = E = 1 − (failed_deliveries / total_events)
Target is >0.999. Failures are usually receiver-side timeouts. DataFlirt Delivery SLO
Exponential Backoff = T = base × 2attempt + jitter
How long we wait before retrying a failed webhook POST. Standard Retry Protocol
// 04 — the delivery trace

A 200 OK
from your server.

A live trace of DataFlirt's delivery engine pushing a completed e-commerce pricing dataset to a client's ingestion webhook.

POST /webhook/df-ingestHMAC-SHA256JSON
edge.dataflirt.io — live
CAPTURED
// outbound webhook request
POST https://api.client-infra.com/v1/webhooks/dataflirt
Content-Type: application/json
X-DataFlirt-Signature: sha256=8f9a...2b1c
X-DataFlirt-Event: job.completed

// payload snippet
{
"job_id": "job_987654321",
"status": "success",
"records_extracted": 14205,
"data_url": "s3://df-client-bucket/exports/job_987654321.json.gz"
}

// client response
HTTP/1.1 202 Accepted
Response-Time: 142ms
delivery.status: CONFIRMED
// 05 — failure modes

Why webhooks
fail in production.

Webhooks are fire-and-forget by nature. If your receiving endpoint isn't robust, data gets dropped. These are the most common reasons client endpoints reject our payloads.

DELIVERIES/DAY ·  ·  ·    2.4M+
RETRY RATE ·  ·  ·  ·  ·  1.2%
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Receiver Timeout

>10s response · Client processes data synchronously instead of queueing
02

502/503 Bad Gateway

infra overload · Client ingestion layer down or scaling too slowly
03

Signature Mismatch

auth failure · Secret rotation mismatch or bad HMAC calculation
04

Payload Size Limits

413 error · WAF or reverse proxy blocking large JSON bodies
05

Rate Limiting

429 error · Client API limits hit during burst delivery
// 06 — delivery architecture

Guaranteed delivery,

even when your ingestion layer goes down.

DataFlirt's webhook engine doesn't just fire a POST and hope for the best. We wrap every delivery in a durable retry queue with exponential backoff and jitter. If your endpoint returns a 503 or times out, the payload is safely parked in our dead-letter queue and retried for up to 72 hours before requiring manual intervention. Your downtime shouldn't mean lost data.

Delivery Job Status

Real-time state of a webhook dispatch for a large dataset.

delivery.id del_8847291a
endpoint api.client.com/ingest
attempt 3/10
last_error HTTP 504 Gateway Timeout
next_retry in 4m 30s
payload.size 2.4 MB
queue.state parkeddurable

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About webhook security, payload limits, retry logic, and how to properly ingest asynchronous scraping results.

Ask us directly →
How do I secure my webhook endpoint from unauthorized requests? +
Never rely on obscurity. We sign every webhook payload using an HMAC-SHA256 hash of the request body and a shared secret. Your endpoint must compute the hash of the incoming body and compare it to the X-DataFlirt-Signature header. If they don't match, drop the request with a 401.
What happens if my server is down when the scrape finishes? +
We don't drop the data. DataFlirt uses an exponential backoff retry strategy. If your server returns anything other than a 2xx status code, or takes longer than 10 seconds to respond, we queue the event and retry. We will attempt delivery for up to 72 hours before moving it to a dead-letter queue.
Should I process the data synchronously in the webhook handler? +
No. This is the most common cause of webhook timeouts. Your webhook handler should do exactly three things: verify the signature, push the payload to a local message queue (like Kafka, RabbitMQ, or Redis), and immediately return a 202 Accepted. Process the data asynchronously on your end.
Is there a limit to the payload size sent via webhook? +
Yes. For payloads under 5MB, we POST the raw JSON directly in the body. For anything larger, we upload the dataset to an S3 bucket and send a webhook containing a pre-signed download URL. Your system then fetches the file at its own pace.
How can I test webhooks locally during development? +
Use a tunneling service like ngrok, localtunnel, or Cloudflare Tunnels. They provide a public HTTPS URL that forwards traffic directly to your localhost port. You can plug this URL into the DataFlirt dashboard to test your ingestion logic before deploying to production.
Can I use webhooks for real-time price monitoring? +
Absolutely. For spot-price monitoring, we configure the pipeline to fire a webhook only when a price change is detected (a diff-based delivery). This means your system only receives a POST when there is actionable data, reducing ingestion noise to zero.
$ dataflirt scope --new-project --target=webhook READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h