RIBA Scraper — Architecture Practice & Project Data Extraction

Data Dictionary

Every field we extract from riba.org

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Chartered Practices objects from riba.org. All fields typed and schema-versioned.

practice_idnameurladdresscitypostcoderegionphoneemailwebsitestaff_countspecialismssectorsawards_won

"practice_id": "PR-84921",
"name": "Foster + Partners",
"city": "London",
"postcode": "SW11 4AN",
"staff_count": "100+",
"specialisms": "['Commercial', 'Masterplanning', 'Transport']",
"region": "London"

#	practice_id	name	url	address	city	postcode
1
2
3

Complete list of extractable fields for Building Case Studies objects from riba.org. All fields typed and schema-versioned.

project_idtitlearchitectclientlocationcompletion_datecontract_valuegross_internal_areaawards_wonsustainability_ratingdescriptionimage_urls

"project_id": "CS-9921",
"title": "Elizabeth Line",
"architect": "Grimshaw",
"completion_date": "2022-05-24",
"contract_value": 18900000000,
"gross_internal_area": 45000,
"awards_won": "['RIBA Stirling Prize 2024']"

#	project_id	title	architect	client	location	completion_date
1
2
3

Complete list of extractable fields for RIBA Awards objects from riba.org. All fields typed and schema-versioned.

award_yearaward_nameproject_namearchitect_nameregionbuilding_typeshortlist_statuswinner_statuscitation_textjudges_comments

"award_year": 2024,
"award_name": "RIBA National Award",
"project_name": "Chowdhury Walk",
"architect_name": "Al-Jawad Pike",
"region": "London",
"building_type": "Residential",
"winner_status": true

#	award_year	award_name	project_name	architect_name	region	building_type
1
2
3

Complete list of extractable fields for CPD Providers objects from riba.org. All fields typed and schema-versioned.

provider_idprovider_namecourse_titleformatdurationcore_curriculum_topicknowledge_levelcontact_emailbooking_urlcost

"provider_id": "CPD-442",
"provider_name": "Kingspan Insulation",
"course_title": "Fire Performance of Insulated Panel Systems",
"format": "Webinar",
"duration": "60 mins",
"core_curriculum_topic": "Health, safety and wellbeing",
"knowledge_level": "General Awareness"

#	provider_id	provider_name	course_title	format	duration	core_curriculum_topic
1
2
3

Complete list of extractable fields for RIBA Jobs objects from riba.org. All fields typed and schema-versioned.

job_idjob_titlepractice_namelocationsalary_bandcontract_typeremote_policyposted_dateclosing_daterequirementsapplication_url

"job_id": "JB-8831",
"job_title": "Part 2 Architectural Assistant",
"practice_name": "Haworth Tompkins",
"location": "London",
"salary_band": "£32,000 - £36,000",
"contract_type": "Permanent",
"remote_policy": "Hybrid"

#	job_id	job_title	practice_name	location	salary_band	contract_type
1
2
3

Capabilities

Extract the definitive graph of British architecture

Our RIBA scraper captures structured data across the entire institute portal: practice directories, award histories, technical case studies, and recruitment data — parsed, cleaned, and normalised.

Chartered Practice Directory

Extract full contact details, staff counts, specialisms, and regional affiliations for over 4,000 RIBA chartered practices.

Awards & Recognition Tracking

Map Stirling Prize, Royal Gold Medal, and Regional Award winners to specific practices and project case studies.

Project Case Studies

Capture gross internal area (GIA), contract values, sustainability credentials, and client metadata from published projects.

CPD Course Mining

Scrape the RIBA CPD provider network for course topics, delivery formats, and core curriculum alignment.

RIBA Jobs Scraping

Monitor architectural hiring trends, salary bands, and remote working policies across the UK sector.

Geographic Normalisation

Map practices and projects to RIBA regional chapters and standard UK postcode districts.

Specialism Filtering

Isolate practices by specific sectors: conservation, passivhaus, masterplanning, or commercial fit-out.

Document Parsing

Extract text and metadata from public RIBA Plan of Work PDFs and technical guidance documents.

Scheduled Directory Syncs

Run monthly diffs to identify newly chartered practices, address changes, or revoked memberships.

// engagement pipeline

From RIBA directory to warehouse record

Brief in. Clean data out.

Define Scope

d 0

Select target datasets: practice directories, awards, case studies, or jobs. We design the extraction schema.

Pipeline Build

d 2–4

We configure Scrapy crawlers, handle pagination, and set up document parsing for case study metadata.

Validation & QA

d 4–6

Schema validation, null-rate checks on contract values, and geographic standardisation before launch.

Delivery

ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our RIBA pipeline handles the hard parts

Extracting data from professional institutes requires handling legacy DOM structures, inconsistent user-submitted data, and nested document metadata.

// fingerprinting

Identity rotation

TLS fingerprintrandomised

User-agentrotated

IP poolresidential

Challenges blocked0

// pagination

Page coverage

48,291 pages queued running

// observability

Pipeline health

99.9%

uptime

142ms

p99 lat

0.3%

null rate

alerts

Anti-bot layer

Residential proxy rotation

We utilise UK-based residential proxies to distribute requests across the directory, preventing IP bans and rate-limiting from standard WAF configurations.

DOM complexity

Handling legacy and modern components

The RIBA website mixes legacy directory structures with modern React-based components. Our Playwright instances execute JavaScript to render dynamic search results and hydration states reliably.

Data cleaning

Normalising user-submitted practice data

Practice profiles are often filled inconsistently by members. We apply post-extraction regex and NLP to standardise addresses, phone formats, and staff count brackets into queryable fields.

Document parsing

Extracting PDF metadata

Many technical case studies and CPD materials are hosted as PDFs. We route these through a dedicated document parsing microservice to extract text blocks, tables, and metadata alongside the web scraping run.

Change detection

Tracking directory churn

We maintain a hash index of all chartered practices. Monthly runs only emit diffs, allowing you to easily identify new practices opening or existing practices changing status.

Applications

Who uses RIBA data — and how

Teams across industries use riba.org data to build competitive products and smarter operations.

Lead Generation for Suppliers

Building material manufacturers use the practice directory to target architects based on their specific sector specialisms and regional location.

Market Research

Industry analysts aggregate contract values and gross internal areas from case studies to track construction market health.

Recruitment & Talent Acquisition

Agencies monitor RIBA Jobs to track hiring volume, salary band fluctuations, and demand for specific software skills like Revit or ArchiCAD.

Competitor Intelligence

Architectural practices track peer award wins, completed project metrics, and stated staff counts to benchmark their own market position.

Academic Research

Universities extract sustainability ratings and material choices from award-winning case studies to analyse trends in sustainable design.

CPD Marketing

Training providers analyse the CPD directory to identify gaps in the core curriculum and price their own courses competitively.

Technical Spec

RIBA scraper — technical capabilities

Everything supported by our riba.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

JavaScript rendering

Full Playwright sessions to hydrate dynamic search components

Supported

Residential proxy rotation

UK-based IP pools to respect rate limits and avoid WAF blocks

Supported

PDF metadata extraction

Parsing text and tables from attached case study documents

Supported

Practice directory pagination

Deep traversal of all A-Z and regional directory listings

Supported

Change detection (diffs)

Hash-based diff to identify new or removed practices

Supported

Webhook delivery

HTTP POST per record for real-time job board monitoring

Supported

Image URL extraction

High-resolution asset links from project galleries

Supported

Historical project archives

Extraction of legacy Stirling Prize winners dating back to 1996

Supported

RIBA Members-only CPD videos

Gated video content requiring active RIBA membership credentials

Partial

Private member contact details

Personal emails of individual architects hidden behind login walls

Partial

Infrastructure

Infrastructure powering the RIBA pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus

Scrapy + Playwright Stack

Scrapy handles the broad directory traversal and deduplication, while Playwright executes JavaScript on modern React components to ensure complete data capture.

PDF & Document Parsing Pipeline

Custom Python microservices parse RIBA Plan of Work PDFs and technical guidance documents, extracting structured text arrays alongside the primary HTML scrape.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles scheduling for monthly directory syncs, dependency management, and SLA alerting.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON

Newline-delimited or nested — schema versioned per run

CSV

Flat file with typed columns — Excel/Sheets compatible

XLS

Formatted spreadsheet for non-technical stakeholders

Parquet

Columnar format for BigQuery, Snowflake, Athena

AWS S3

Direct bucket delivery — compatible with any data lake

Webhook

HTTP POST per record for real-time downstream processing

API

REST endpoints to query your extracted RIBA datasets

BigQuery

Streamed directly into your dataset with schema auto-detect

Snowflake

Stage + COPY INTO workflow — incremental or full-replace

Postgres

Upsert into your existing schema with conflict resolution

Direct bucket delivery — compatible with any data lake

// faq

Common questions.

About riba.org scraping, legality, and pipeline operations.

Ask us directly →

Is scraping RIBA legal?

Scraping publicly available information from the RIBA directory is generally permissible. DataFlirt targets only public, non-authenticated practice data, case studies, and jobs. We do not extract personal data of individual non-practicing members or circumvent authentication walls.

How do you handle directory pagination?

Our crawlers traverse the entire directory tree using a combination of A-Z index scraping and regional filters, ensuring no chartered practice is missed during the extraction run.

Can you extract project values and GIA from case studies?

Yes. Where published in the case study metadata or text body, we extract contract values, gross internal area (GIA), and completion dates, normalising them into standard numeric formats.

Do you parse RIBA Plan of Work PDFs?

Yes. We operate a secondary document parsing pipeline that can extract text blocks and tables from publicly linked PDFs on the RIBA domain.

How fresh is the jobs data?

For RIBA Jobs, we can configure daily or sub-daily pipelines to ensure you capture new postings immediately and track closing dates accurately.

What is the minimum viable engagement?

Our smallest packages start with a one-off extraction of the chartered practice directory. For continuous monitoring of jobs or case studies, we price based on delivery frequency and schema complexity.

RIBA architecture data,
at warehouse scale.

Every field we extract from riba.org

Extract the definitive graph of British architecture

From RIBA directory to warehouse record

How our RIBA pipeline handles the hard parts

Who uses RIBA data — and how

RIBA scraper — technical capabilities

Infrastructure powering the RIBA pipeline

Your data, your destination

Common questions.

Tell us what
to extract.
We do the rest.

Data Extraction for Every Industry

RIBA architecture data, at warehouse scale.

Every field we extract from riba.org

Extract the definitive graph of British architecture

From RIBA directory to warehouse record

How our RIBA pipeline handles the hard parts

Who uses RIBA data — and how

RIBA scraper — technical capabilities

Infrastructure powering the RIBA pipeline

Your data, your destination

Common questions.

Tell us whatto extract. We do the rest.

Data Extraction for Every Industry

RIBA architecture data,
at warehouse scale.

Tell us what
to extract.
We do the rest.