← Glossary / Wget

What is Wget?

Wget is a classic, non-interactive command-line utility for downloading files over HTTP, HTTPS, and FTP. While it excels at recursive mirroring and resuming broken downloads on stable networks, its rigid TLS fingerprint and lack of JavaScript execution make it instantly detectable by modern anti-bot systems. For scraping engineers, it's a reliable tool for pulling static datasets or open directories, but a guaranteed block when pointed at protected commercial targets.

CLIFile TransferRecursive CrawlingStatic FetchLegacy

// 02 — definitions

The original
mirror tool.

Built for a web of static files and directories, Wget remains the standard for bulk data retrieval where JavaScript and stealth aren't required.

Ask a DataFlirt engineer →

TL;DR

Wget is a GNU command-line tool designed for robust file downloading and recursive website mirroring. It handles retries, respects robots.txt by default, and can resume interrupted transfers. However, its default User-Agent and static TLS handshake make it trivial for Cloudflare or DataDome to flag it as a bot.

01Definition & structure

Wget (World Wide Web get) is a free software package for retrieving files using HTTP, HTTPS, FTP, and FTPS. Unlike interactive web browsers, it runs in the background, making it ideal for automated scripts and cron jobs. Its defining feature is its ability to recursively download HTML pages, parse the links within them, and fetch the linked documents, effectively mirroring entire websites locally.

02Recursive mirroring in practice

When invoked with the --mirror flag, Wget turns on recursion, infinite recursion depth, and time-stamping. It fetches the initial URL, parses the HTML for <a href> and <img src> tags, and adds those URLs to its queue. It respects robots.txt by default and can be configured to convert absolute links to relative links for offline viewing.

03The fingerprinting problem

Wget is highly vulnerable to modern anti-bot systems. It uses standard system libraries (like OpenSSL or GnuTLS) for its TLS handshake, resulting in a JA3 fingerprint that is instantly recognizable as non-human. Furthermore, its HTTP header order is rigid, and it cannot execute JavaScript, meaning any target protected by a JS challenge or a CAPTCHA will block it immediately.

04How DataFlirt uses it

We do not use Wget for scraping HTML from commercial targets. However, it remains a critical utility in our data engineering pipelines for ingestion. When a client provides a secure FTP server or an open S3 bucket containing terabytes of raw CSV or JSON files, we use Wget (or aria2) to pull the data. Its ability to resume broken downloads (--continue) makes it perfect for massive, static payloads.

05Did you know?

Wget was originally written in 1996 by Hrvoje Nikšić. Because it is single-threaded, it can be slow when mirroring sites with thousands of small files due to network latency. A modern rewrite, Wget2, supports multi-threading, HTTP/2, and compression, significantly speeding up recursive downloads while maintaining the same command-line interface.

// 03 — the fetch model

How fast can
Wget mirror?

Wget is single-threaded by design. Its throughput is bounded by network latency and the target's time-to-first-byte, making it inefficient for high-concurrency scraping.

Total mirror time = T = N × (RTT + TTFB + Transfer)

Sequential fetching means latency stacks linearly across N files. Network fundamentals

Wget respect rate = Wait = Crawl-delay + random_wait

--random-wait adds 0.5x to 1.5x jitter to avoid basic rate limits. GNU Wget manual

DataFlirt static fetch efficiency = E = Payload / (CPU_cycles × Memory)

Wget uses negligible RAM compared to headless browsers. Internal benchmarking

// 04 — the command line

Mirroring an open
data directory.

A standard Wget invocation to recursively download a public dataset, showing its built-in retry and rate-limiting mechanics.

GNU Wget 1.21recursiverate-limited

edge.dataflirt.io — live

CAPTURED

$ wget --mirror --wait=2 --random-wait https://data.example.gov/public/
// Resolving host...
DNS: data.example.gov -> 192.0.2.42
// Initiating handshake...
TLS: TLSv1.3 / TLS_AES_256_GCM_SHA384
Request: GET /public/ HTTP/1.1
User-Agent: Wget/1.21.2 // Default UA exposed
Response: 200 OK
Length: 4096 (text/html)
// Parsing links for recursion...
Queue: 14 URLs discovered
// Applying wait...
Sleep: 2.4s (randomized)
Fetch: /public/dataset_2026_01.csv.gz
Progress: 100% [===================>] 42.1M 12.4MB/s
Status: Downloaded 15 files, 640M in 1m 12s

// 05 — detection vectors

Why Wget fails
on modern targets.

Wget was built for utility, not stealth. Its network signature is static and widely known, making it the easiest tool for a WAF to block.

WAF BLOCK RATE · · · >98% on commercial

JS EXECUTION · · · · None

UPDATED · · · · · · 2026-05-19

01

User-Agent string

instant block · Default is Wget/1.x, trivially filtered

02

TLS Fingerprint / JA3

network layer · Static OpenSSL/GnuTLS signature

03

HTTP Header Order

protocol layer · Predictable, non-browser sequence

04

Lack of JS execution

runtime layer · Fails all JS challenges silently

05

Sequential requests

behavioral · Single-threaded fetching looks mechanical

// 06 — our stack

Right tool for the job,

using Wget where it belongs.

We don't use Wget for scraping HTML. When a target requires stealth, we use our custom TLS-patched HTTP clients or headless fleet. But when a pipeline's final step is pulling a 500 GB daily CSV dump from a client's secure FTP server or an open government S3 bucket, Wget is unbeatable. It handles network drops gracefully, resumes partial downloads, and consumes almost zero CPU. Engineering is about matching the tool's complexity to the target's defenses.

Static payload ingestion

A DataFlirt worker pulling a daily bulk dataset via Wget.

job.type bulk_ingest

tool wget --continue

target.auth basic_auth

payload.size 142.4 GB

resume.status active

cpu.overhead < 1%

waf.interference none

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About Wget's capabilities, its limitations in modern scraping, and how it compares to other command-line tools.

Ask us directly →

What is the difference between Wget and cURL? +

Wget is built for downloading files and recursive mirroring — it can follow links and download entire directory trees. cURL is a generic network transfer tool better suited for API testing, complex HTTP requests, and single-file transfers. cURL does not support recursive downloading natively.

Can I use Wget for web scraping? +

For static, unprotected sites, yes. For modern commercial targets behind Cloudflare or DataDome, no. It cannot execute JavaScript, and its TLS fingerprint is instantly flagged. Wget is a downloader, not a scraper.

How do I change Wget's User-Agent? +

You can use the --user-agent="Mozilla/5.0..." flag. However, modern anti-bot systems check the TLS fingerprint and HTTP header order, so spoofing the User-Agent alone will not bypass a WAF.

Does Wget respect robots.txt? +

Yes, by default. It will download and parse robots.txt before mirroring a site. You can disable this with -e robots=off, but doing so on protected targets increases your risk of an IP ban.

How does DataFlirt handle large file downloads? +

We use Wget or aria2 for raw, authenticated bulk data transfers where stealth isn't needed (like pulling from a client's FTP). For protected assets, we route the download through our residential proxy network using custom, TLS-patched HTTP clients.

What is Wget2? +

Wget2 is the successor to Wget, featuring multi-threading, HTTP/2 support, and faster parsing. It's significantly better for high-throughput mirroring but shares the same fundamental anti-bot limitations as the original.

$ dataflirt scope --new-project --target=wget READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h