← Glossary / Colly (Go)

What is Colly (Go)?

Colly is a lightning-fast, elegant web scraping framework written in Go. Designed for high-throughput extraction on the surface web, it uses a callback-based architecture to parse HTML streams with minimal memory overhead. For pipelines that don't require JavaScript rendering, Colly is the gold standard for maximizing requests per second per CPU core, allowing a single node to crawl millions of pages a day without breaking a sweat.

GolangConcurrencyHTML ParsingHigh ThroughputStateless

// 02 — definitions

Speed by
design.

Why Go's concurrency model makes Colly the weapon of choice for massive, stateless catalog crawls.

Ask a DataFlirt engineer →

TL;DR

Colly is a Go-based scraping framework that excels at raw speed and low resource consumption. It relies on Go's goroutines to handle thousands of concurrent requests, making it ideal for surface web targets. However, it cannot execute JavaScript natively, meaning single-page applications (SPAs) require a different toolchain or API-level interception.

01Definition & structure

Colly is a scraping framework for Go that relies on a Collector struct to manage HTTP requests and a series of callback functions (OnHTML, OnRequest, OnResponse) to process the data stream. Because it leverages Go's lightweight goroutines, it can maintain thousands of concurrent connections with a memory footprint measured in megabytes, not gigabytes.

02How it works in practice

You initialize a collector, define rules for which domains to visit, and attach callbacks. When you call Visit(url), Colly fetches the page. If the response matches an OnHTML CSS selector, the callback executes, allowing you to extract text, attributes, or enqueue new links. The asynchronous nature means the main thread never blocks waiting for network I/O.

03The JavaScript limitation

Colly is strictly an HTTP client and HTML parser (using goquery, a jQuery-like syntax for Go). It does not embed V8 or Blink. If a target site delivers an empty HTML shell and relies on React or Vue to render the DOM client-side, Colly will only see the empty shell. For these targets, you must either intercept the underlying JSON APIs or switch to a headless browser.

04How DataFlirt handles it

We use Colly extensively for our high-speed discovery layer. When a client needs to monitor a 5-million-page e-commerce site for new products, running Playwright is cost-prohibitive. We deploy Colly workers to rapidly traverse sitemaps and category pages, identifying new URLs and pushing them to Kafka. This hybrid approach gives us Go's speed for discovery and Playwright's rendering power only when strictly necessary.

05Did you know?

A standard $5/month VPS running a well-optimized Colly script can easily sustain 1,000+ requests per second, bandwidth permitting. The framework is so efficient that the most common failure mode for new users isn't crashing their own server—it's accidentally taking down the target website by overwhelming their database with concurrent requests.

// 03 — performance model

How fast can
Colly run?

Colly's throughput is rarely CPU-bound; it is almost entirely constrained by network I/O, proxy latency, and target rate limits. Here is how we model Colly worker capacity at DataFlirt.

Effective RPS = R_eff = min(N_goroutines / T_latency, R_{target_limit})

Throughput scales linearly with goroutines until you hit the target's WAF or proxy limits. Queue theory basics

Memory footprint = M_total = M_base + (N_concurrent × M_buffer)

Colly typically uses < 50MB base memory, plus a few KB per active request buffer. Go runtime profiler

DataFlirt discovery efficiency = E = URLs_discovered / CPU_core_seconds

Colly yields ~40x higher efficiency for URL discovery compared to headless browsers. Internal benchmark, 2026

// 04 — colly execution trace

10,000 pages
in 14 seconds.

A live trace of a Colly worker discovering and parsing product URLs from an e-commerce sitemap, utilizing 100 concurrent goroutines.

Go 1.21100 goroutinessurface web

edge.dataflirt.io — live

CAPTURED

// initializing colly collector
colly.NewCollector: Async(true)
colly.Limit: DomainGlob("*target.com*"), Parallelism(100)
proxy.Dialer: "residential_pool_IN"

// execution started
[0.00s] GET https://target.com/sitemap.xml 200 OK
[0.12s] Discovered 10,000 product URLs
[0.15s] Queueing URLs to workers...

// mid-crawl metrics
[5.00s] requests_sent: 3,450 rps: 690
[5.00s] goroutines: 104 alloc_mem: 42 MB
[10.00s] requests_sent: 7,120 rps: 712
[10.00s] status_403: 12 // proxy rotated

// completion
[14.20s] total_requests: 10,000
[14.20s] items_extracted: 9,988
pipeline.status: SUCCESS

// 05 — bottleneck analysis

Where Colly
actually blocks.

Because Colly is so CPU-efficient, the bottlenecks in a Go-based pipeline shift entirely away from compute and onto the network and the target's infrastructure.

AVG LATENCY · · · · 180ms per req

BASE MEMORY · · · · ~35 MB

UPDATED · · · · · · 2026-05-19

01

Target rate limits / WAF

external constraint · The primary limit on Colly's speed is getting blocked

02

Proxy pool latency

network constraint · Residential proxies add 200-800ms per request

03

DNS resolution

network constraint · High concurrency requires aggressive DNS caching

04

HTML parsing (goquery)

compute constraint · DOM traversal on very large documents

05

Disk/DB I/O

sink constraint · Writing extracted records to the delivery sink

// 06 — our go stack

Discover fast,

extract carefully.

At DataFlirt, we decouple URL discovery from data extraction. Colly powers our discovery fleet—sweeping sitemaps, category trees, and pagination links at thousands of requests per second to find new or updated URLs. These URLs are then pushed to a Kafka queue, where heavier, JS-capable workers handle the actual extraction if needed. Using Colly for discovery reduces our compute footprint by 80% compared to running headless browsers for the entire pipeline.

colly_worker_04.log

Live telemetry from a Colly discovery worker on a retail catalog.

worker.id colly-disc-eu-04

target.domain retail-giant.co.uk

goroutines.active 150optimal

rps.current 420 req/s

memory.alloc 68 MBstable

proxy.pool datacenter_eu_rot

status.429_rate 0.01%below threshold

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about using Colly, handling JavaScript, and scaling Go-based scraping pipelines.

Ask us directly →

How does Colly compare to Python's Scrapy? +

Colly is significantly faster and uses a fraction of the memory due to Go's compiled nature and goroutines. However, Scrapy has a much larger ecosystem, built-in middleware for retries and anti-bot bypass, and better out-of-the-box support for complex pipeline orchestration. Choose Colly for raw speed; choose Scrapy for ecosystem maturity.

Can Colly scrape Single Page Applications (SPAs)? +

Not natively. Colly is an HTTP client and HTML parser; it does not have a JavaScript engine. To scrape SPAs with Go, you either need to reverse-engineer the backend API and have Colly fetch the JSON directly, or integrate a headless browser library like chromedp or Playwright-go.

How do you handle proxy rotation in Colly? +

Colly has a built-in proxy.RoundRobinProxySwitcher, but for production pipelines, it's often too simplistic. We implement custom http.Transport dialers that integrate with our proxy gateways, allowing us to handle sticky sessions, ASN targeting, and automatic retries on proxy failure without polluting the Colly callback logic.

Is Colly suitable for deep web scraping? +

Only if the authentication is straightforward (e.g., passing a static bearer token or a session cookie). If the target requires solving complex JS challenges, executing CAPTCHAs, or navigating dynamic login flows, Colly will fail. It is best suited for the surface web.

How does DataFlirt scale Colly across multiple nodes? +

Colly itself is a single-node framework. To scale it, we run Colly inside Docker containers orchestrated by Kubernetes. We disable Colly's in-memory queue and instead feed URLs to the workers via Apache Kafka. This allows us to scale horizontally to hundreds of nodes while maintaining a centralized deduplication state.

What is the most common mistake developers make with Colly? +

Ignoring rate limits. Because Colly makes it so easy to spin up 1,000 concurrent requests with two lines of code, developers often accidentally launch denial-of-service attacks against their targets. This results in immediate, permanent IP bans. Always configure colly.LimitRule to respect the target's capacity and robots.txt.

$ dataflirt scope --new-project --target=colly-(go) READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h