← Glossary / Apify SDK

What is Apify SDK?

Apify SDK is an open-source Node.js and Python toolkit designed to build, run, and scale web scrapers and browser automation jobs. It abstracts away the boilerplate of managing request queues, proxy rotation, and browser lifecycles. While it excels at standardizing scraper architecture and accelerating initial development, relying on its default configurations in high-stakes environments often leads to predictable fingerprinting and rapid blocking by modern anti-bot systems.

Node.jsPythonPlaywrightPuppeteerScraping Framework
// 02 — definitions

Framework over
boilerplate.

The standard toolkit for structuring Node.js and Python scrapers, and why abstracting the browser lifecycle is a double-edged sword.

Ask a DataFlirt engineer →

TL;DR

Apify SDK provides a unified API for managing request queues, key-value stores, and proxy rotation across HTTP clients and headless browsers. It's excellent for developer velocity, but its default browser fingerprints are heavily profiled by Cloudflare and DataDome. Production deployments require extensive customisation of the underlying Playwright or Puppeteer instances.

01Definition & structure
Apify SDK (and its Node.js successor, Crawlee) is a framework that provides a unified API for building web scrapers. Instead of manually writing loops to manage Playwright instances, handle retries, and save data to CSVs, the SDK provides classes like PlaywrightCrawler or CheerioCrawler. It standardizes the boilerplate of web scraping, allowing developers to focus purely on the extraction logic.
02How it manages state
The SDK relies on three core storage abstractions: the RequestQueue (manages URLs to visit, handles deduplication, and tracks retries), the Dataset (append-only storage for extracted records), and the KeyValueStore (for saving files, state, or configuration). When run locally, these are backed by SQLite and JSON files; when deployed to a cloud platform, they sync with remote APIs.
03The fingerprinting problem
Frameworks aim for ease of use, which means they ship with default configurations. The SDK's built-in fingerprint generation and stealth plugins are widely known to anti-bot vendors. Because thousands of developers use the exact same default stealth profiles, those profiles become signatures of bot traffic. Bypassing modern WAFs requires stripping away these defaults and implementing custom TLS and browser engine patching.
04Migrating from SDKs to DataFlirt
We frequently migrate clients who have hit the ceiling of what a framework can do. When a target implements strict TLS fingerprinting or behavioral biometrics, a Node.js framework cannot save you. We replace SDK-based codebases with our bare-metal orchestration layer, which handles extraction at the network and engine level without the overhead or predictable signatures of a high-level framework.
05Did you know?
In 2022, the Node.js version of Apify SDK was rebranded to Crawlee. The move was designed to decouple the open-source framework from the Apify commercial cloud platform, encouraging wider community adoption. However, the Python version retained the Apify SDK naming convention, leading to occasional confusion in multi-language engineering teams.
// 03 — the overhead

What does the
abstraction cost?

Frameworks trade CPU and memory for developer velocity. Here is how we model the overhead of running a heavy SDK versus a bare-metal Playwright implementation.

Memory overhead per worker = M = Mbrowser + Msdk_state + Mqueue_cache
SDK state management adds ~15-20% memory overhead per container. DataFlirt infrastructure benchmarks
Queue synchronization latency = Tsync = Nrequests × Tsqlite_write
Local RequestQueue uses SQLite, which bottlenecks on disk I/O above 500 req/s. Crawlee architecture docs
Fingerprint predictability = Pblock = 1 − (Ecustom / Edefault_stealth)
Default stealth plugins have near-zero entropy against modern WAFs. DataFlirt anti-bot research
// 04 — execution trace

Bootstrapping a
PlaywrightCrawler.

A standard SDK initialization sequence, showing the proxy configuration, queue hydration, and the inevitable fingerprinting flags when hitting a protected target.

Node.jsCrawleePlaywright
edge.dataflirt.io — live
CAPTURED
// init
INFO PlaywrightCrawler: Starting the crawler.
INFO SystemInfo: Memory: 1.2GB/4.0GB, CPU: 12%

// state hydration
DEBUG RequestQueue: Hydrating 14,200 requests from sqlite...
DEBUG ProxyConfiguration: Initialized with 50 datacenter IPs.

// execution
INFO PlaywrightCrawler: Opening new browser page...
WARN FingerprintGenerator: Using default stealth profile.
DEBUG Request: GET https://target.com/category/shoes

// waf response
DEBUG Response: 403 Forbidden
WARN PlaywrightCrawler: Request failed, returning to queue.
DEBUG RequestQueue: Reclaiming request (retries: 1/3)

// analysis
ERROR WAF_Flag: navigator.webdriver detected.
ERROR WAF_Flag: Canvas hash matches known puppeteer-stealth.
INFO PlaywrightCrawler: Pausing execution due to high error rate.
// 05 — failure modes

Where SDK defaults
fall short.

Apify SDK is built for general-purpose scraping. When deployed against tier-1 anti-bot systems, its default abstractions become its biggest liabilities.

SAMPLE SIZE ·  ·  ·  ·    1,200+ migrations
TARGETS ·  ·  ·  ·  ·  ·  Tier-1 WAF protected
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Fingerprint staleness

94% block rate · Default stealth plugins are fingerprinted by vendors
02

SQLite queue locking

I/O bottleneck · Disk I/O bottlenecks on high-concurrency local runs
03

Memory bloat

OOM crashes · Stateful dataset caching in long-running jobs
04

Proxy rotation predictability

behavioral flag · Round-robin defaults fail behavioral checks
05

Event loop blocking

latency spike · Heavy DOM parsing on the main Node.js thread
// 06 — architecture

Frameworks are for building,

infrastructure is for scaling.

Many DataFlirt clients start by building their MVP with Apify SDK (or Crawlee). It's the right choice for a V1. But as target complexity increases, the abstraction layer gets in the way. You can't patch the TLS stack from within a Node.js framework, and you can't control the exact timing of Chrome's rendering pipeline when it's wrapped in three layers of promises. We replace SDK-heavy architectures with bare-metal orchestration, dropping memory usage by 40% and eliminating framework-induced fingerprint leaks.

SDK vs Bare-Metal Orchestration

Performance delta after migrating a 2M-page daily crawl off a standard SDK setup.

memory.per_worker 1.4 GB850 MB
tls.ja3_spoofing impossiblenative
queue.throughput ~400 req/s12,000+ req/s
fingerprint.entropy low (stealth plugin)high (real hardware)
block_rate.datadome 14.2%0.03%
maintenance.overhead high (patching internals)low (API driven)

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about Apify SDK, Crawlee, memory management, and how DataFlirt handles migrations from framework-heavy codebases.

Ask us directly →
What is the difference between Apify SDK and Crawlee? +
Apify rebranded the Node.js version of their SDK to Crawlee in 2022 to separate the open-source tool from their commercial cloud platform. The Python version remains branded as Apify SDK for Python. Both serve the same architectural purpose: abstracting crawler state and browser management.
Can I bypass Cloudflare using just the SDK's stealth mode? +
No. The built-in stealth plugins patch basic JavaScript properties like navigator.webdriver, but they do not alter the TLS fingerprint or the underlying browser engine's rendering quirks. Cloudflare detects this mismatch instantly on high-security targets.
How does the RequestQueue handle millions of URLs? +
Locally, it uses SQLite, which can bottleneck on disk I/O at high concurrency. In the cloud, it uses Apify's API, which introduces network latency. For massive scale, you need a dedicated distributed queue like Kafka or Redis, which requires bypassing the SDK's default queueing mechanisms.
Why does my SDK scraper consume so much memory over time? +
Long-running jobs often suffer from memory bloat due to the SDK caching dataset entries and request states in memory before flushing to disk. This is compounded by Playwright's own memory leaks if browser contexts aren't aggressively recycled and garbage collected.
How does DataFlirt compare to building on Apify SDK? +
Apify SDK is a tool for your engineers to build and maintain scrapers. DataFlirt is a managed pipeline. You don't write scraping code with us; you define the schema, and our bare-metal infrastructure handles the extraction, proxying, and anti-bot bypass. We eliminate the engineering overhead entirely.
Is it legal to use open-source scraping SDKs for commercial data? +
The legality of scraping depends on what you scrape and how you access it, not the tool you use. Using an SDK doesn't grant immunity from CFAA, GDPR, or copyright claims. Always respect robots.txt and target ToS where applicable, and consult counsel for your specific use case.
$ dataflirt scope --new-project --target=apify-sdk READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h