SYSTEM all green source greenhouse.io queue 12,492 boards p99 latency 184ms dataflirt.com · scraper/greenhouse-io
RUN · 84 active pipelines · greenhouse.io live

Greenhouse ATS data,
at warehouse scale.

We extract job listings, department structures, office locations, and custom application fields from Greenhouse boards. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Jobs extracted
842K /day
Board updates
14.2K /24h
Departments mapped
94K /run
Active pipelines
84
Uptime
99.98%
Data Dictionary

Every field we extract from greenhouse.io

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Job Postings objects from greenhouse.io. All fields typed and schema-versioned.

req_idinternal_job_idtitledepartmentlocationremote_statusemployment_typeposted_dateurldescription_htmldescription_textsalary_minsalary_maxcurrency
job_postings
● 200 OK
"req_id": "req_84921",
"title": "Senior Backend Engineer",
"department": "Engineering",
"location": "London, UK",
"remote_status": "Hybrid",
"employment_type": "Full-time",
"salary_min": 85000,
"salary_max": 110000,
"currency": "GBP"
# req_idinternal_job_idtitledepartmentlocationremote_status
1
2
3

Complete list of extractable fields for Department Data objects from greenhouse.io. All fields typed and schema-versioned.

department_iddepartment_nameparent_department_idparent_department_nameboard_tokenactive_job_countmanager_namecost_centerinternal_code
department_data
● 200 OK
"department_id": "dept_402",
"department_name": "Data Infrastructure",
"parent_department_id": "dept_101",
"parent_department_name": "Engineering",
"board_token": "dataflirt",
"active_job_count": 14,
"cost_center": "CC-ENG-04"
# department_iddepartment_nameparent_department_idparent_department_nameboard_tokenactive_job_count
1
2
3

Complete list of extractable fields for Location Data objects from greenhouse.io. All fields typed and schema-versioned.

office_idoffice_namecitystatecountryregionis_remotetimezoneaddress
location_data
● 200 OK
"office_id": "loc_883",
"office_name": "Bengaluru HQ",
"city": "Bengaluru",
"state": "Karnataka",
"country": "India",
"region": "APAC",
"is_remote": false,
"timezone": "Asia/Kolkata"
# office_idoffice_namecitystatecountryregion
1
2
3

Complete list of extractable fields for Application Fields objects from greenhouse.io. All fields typed and schema-versioned.

field_idfield_namefield_typeis_requiredoptions_listjob_idboard_tokenvalidation_rules
application_fields
● 200 OK
"field_id": "custom_49201",
"field_name": "LinkedIn Profile URL",
"field_type": "url",
"is_required": true,
"options_list": "[]",
"job_id": "req_84921",
"board_token": "dataflirt"
# field_idfield_namefield_typeis_requiredoptions_listjob_id
1
2
3

Complete list of extractable fields for Board Metadata objects from greenhouse.io. All fields typed and schema-versioned.

board_tokencompany_namelogo_urltotal_active_jobsdepartments_countlocations_countlast_updatedscrape_timestamp
board_metadata
● 200 OK
"board_token": "dataflirt",
"company_name": "DataFlirt",
"logo_url": "https://boards.greenhouse.io/dataflirt/logo.png",
"total_active_jobs": 42,
"departments_count": 8,
"locations_count": 3,
"scrape_timestamp": "2026-08-14T10:22:15Z"
# board_tokencompany_namelogo_urltotal_active_jobsdepartments_countlocations_count
1
2
3

Capabilities

Everything you need from Greenhouse ATS

Our Greenhouse scraper targets the underlying API structures and custom domain implementations to extract clean, normalised job and department data across thousands of companies.

Full Job Listing Extraction

Extract title, rich text description, requirements, employment type, and custom metadata fields per requisition.

Board Discovery

Identify hidden Greenhouse boards via footprinting and standardise custom domain implementations back to a unified schema.

Department Mapping

Reconstruct nested department hierarchies to understand organisational structure and hiring focus areas.

Salary Band Extraction

Parse pay transparency data, extracting minimum and maximum salary ranges along with currency codes.

Location & Remote Flags

Standardise location strings and extract explicit remote, hybrid, or on-site designations.

Custom Field Parsing

Extract custom application questions and requirements configured by the employer on a per-job basis.

Historical Tracking

Track requisition lifecycle events. Identify exact dates when roles are opened, updated, or closed.

Multi-Region Support

Handle international Greenhouse instances, local language postings, and region-specific compliance fields.

Scheduled Diffs

Run continuous pipelines that only output new, modified, or deleted roles to reduce downstream processing load.

// engagement pipeline

From board list to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide company names, domains, or Greenhouse board tokens. We design the extraction schema together.

Pipeline Build
d 2–4

We configure Scrapy / Playwright crawlers, proxy rotation, and session management for greenhouse.io endpoints.

Validation & QA
d 4–6

Schema validation, null-rate checks, and department mapping verification before full launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our Greenhouse pipeline handles the hard parts

Greenhouse implementations vary wildly across companies. Here is how we normalise the data and maintain pipeline stability.

pipeline-monitor · greenhouse.io · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
API vs DOM extraction
Targeting the Greenhouse embedded API

Whenever possible, we target the Greenhouse embedded API (boards-api.greenhouse.io) for structured JSON. For companies using custom hosted boards or iframe implementations, we fall back to DOM parsing, normalising the output to match the API schema.

Rate limiting
Distributed request timing

Greenhouse rate limits aggressively on single IP addresses. We distribute requests across a large pool of datacenter and residential proxies, implementing exponential backoff and jitter to stay under rate limit thresholds.

Schema stability
Handling custom board structures

Employers frequently customise their Greenhouse boards with unique JavaScript frameworks and CSS layouts. Our selector strategy uses structural heuristics to identify job blocks, departments, and locations regardless of the visual presentation.

Change detection
Requisition lifecycle tracking

We maintain a hash index of active job IDs per board. Subsequent runs compare the current state against the index, emitting structured diffs for newly opened roles, updated descriptions, and closed requisitions.

Monitoring & alerting
Anomaly detection on board availability

Companies frequently migrate ATS platforms or change board tokens. We alert on 404 errors, sudden drops in job counts, and domain redirects, allowing us to update the target configuration before data gaps occur.

Applications

Who uses Greenhouse data

Teams across industries use greenhouse.io data to build competitive products and smarter operations.

01
Competitor Intelligence

Track hiring velocity, department expansion, and strategic focus areas of competitor companies based on open requisitions.

02
Lead Generation

Identify companies using specific technology stacks or expanding certain departments based on job requirements and titles.

03
Labour Market Analytics

Aggregate salary bands, remote work trends, and skill demand across thousands of high-growth companies.

04
Job Aggregation

Feed job boards, recruitment marketplaces, and talent networks with fresh, structured job postings.

05
Investment Research

Private equity and venture capital firms monitor startup growth signals via headcount expansion and executive hiring.

06
HR Tech Integrations

Sync ATS data for external tooling, compensation benchmarking platforms, and diversity analytics software.

Why DataFlirt

"Greenhouse powers the hiring for the world's fastest-growing companies. Tracking their open requisitions is the clearest signal of corporate strategy and financial health."

Most teams underestimate the investment required to track thousands of disparate Greenhouse boards: reliable scraping requires discovering hidden board tokens, normalising custom domain implementations, handling rate limits, and standardising nested department structures. DataFlirt absorbs that complexity so your engineers can focus on the analysis, not the infrastructure.

Technical Spec

Greenhouse scraper technical capabilities

Everything supported by our greenhouse.io scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Embedded API extraction
Direct extraction from boards-api.greenhouse.io endpoints
Supported
Custom domain board parsing
Extraction from custom hosted careers pages and iframes
Supported
Department hierarchy mapping
Nested parent-child department relationships
Supported
Salary band normalisation
Extraction of min/max numerical values and currency codes
Supported
Remote status classification
Standardised remote, hybrid, or on-site designations
Supported
Historical job tracking
Time-series tracking of job open and close dates
Supported
Daily change detection (diffs)
Emit records only for new, modified, or closed roles
Supported
Internal candidate notes
Interview feedback and recruiter notes gated behind ATS login
Partial
Applicant PII and resumes
Candidate profiles and submitted applications gated behind ATS login
Partial
Infrastructure

Infrastructure powering the Greenhouse pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus
API & DOM Extraction Engine

A hybrid approach targeting the Greenhouse embedded API for structured JSON where possible, falling back to Playwright DOM parsing for custom hosted boards.

Proxy Infrastructure

Datacenter and residential IP rotation to bypass Cloudflare protections and distribute requests across rate limit windows.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles scheduling, dependency management, and SLA alerting for thousands of concurrent board scrapes.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested arrays
CSV
Flat file with typed columns
XLS
Excel compatible format for manual review
Parquet
Columnar format for data warehouses
AWS S3
Direct bucket delivery
Webhook
HTTP POST per record or batch
API
Queryable REST endpoints for extracted data
BigQuery
Streamed directly into your dataset
Snowflake
Stage and COPY INTO workflow
Postgres
Upsert into your existing schema
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About greenhouse.io scraping, legality, and pipeline operations.

Ask us directly →
Is scraping Greenhouse public job boards legal?

Scraping publicly available job postings from Greenhouse is generally permissible. DataFlirt targets only public, non-authenticated job and department data. We do not extract personal candidate data, circumvent authentication walls, or access internal ATS systems.

How do you find Greenhouse boards for target companies?

We use domain footprinting, DNS records, and search engine dorking to identify Greenhouse board tokens and custom careers page implementations for your target company list.

Do you support custom domain boards?

Yes. Many companies host Greenhouse boards on custom domains or via iframes. Our pipeline identifies the underlying API calls or parses the custom DOM structure to extract the data into our normalised schema.

How fresh is the data?

Pipelines can be configured for daily, hourly, or near real-time cadences depending on the number of boards tracked and your freshness requirements.

Can you extract salary transparency data?

Yes. We extract minimum and maximum salary bands, currency codes, and equity indicators where provided in the job description or structured metadata fields.

What is the minimum viable engagement?

Our minimum engagement typically starts at tracking 500 target boards with daily delivery. For larger aggregations or custom schema requirements, we price based on volume and frequency.

$ dataflirt scope --new-project --source=greenhouse.io ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off dump of target companies or a continuous feed of job market data across 10,000 boards, we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →