SYSTEM all green source riba.org queue 14,291 pages p99 latency 214ms dataflirt.com · scraper/riba-org
RUN · 32 active pipelines · riba.org live

RIBA architecture data,
at warehouse scale.

We extract chartered practice directories, Stirling Prize case studies, CPD course listings, and architect profiles from RIBA. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Practices extracted
4,128 /run
Case studies
12,492 /total
CPD courses
3,841 /month
Active pipelines
32
Uptime
99.94%
Data Dictionary

Every field we extract from riba.org

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Chartered Practices objects from riba.org. All fields typed and schema-versioned.

practice_idnameurladdresscitypostcoderegionphoneemailwebsitestaff_countspecialismssectorsawards_won
chartered_practices
● 200 OK
"practice_id": "PR-84921",
"name": "Foster + Partners",
"city": "London",
"postcode": "SW11 4AN",
"staff_count": "100+",
"specialisms": "['Commercial', 'Masterplanning', 'Transport']",
"region": "London"
# practice_idnameurladdresscitypostcode
1
2
3

Complete list of extractable fields for Building Case Studies objects from riba.org. All fields typed and schema-versioned.

project_idtitlearchitectclientlocationcompletion_datecontract_valuegross_internal_areaawards_wonsustainability_ratingdescriptionimage_urls
building_case studies
● 200 OK
"project_id": "CS-9921",
"title": "Elizabeth Line",
"architect": "Grimshaw",
"completion_date": "2022-05-24",
"contract_value": 18900000000,
"gross_internal_area": 45000,
"awards_won": "['RIBA Stirling Prize 2024']"
# project_idtitlearchitectclientlocationcompletion_date
1
2
3

Complete list of extractable fields for RIBA Awards objects from riba.org. All fields typed and schema-versioned.

award_yearaward_nameproject_namearchitect_nameregionbuilding_typeshortlist_statuswinner_statuscitation_textjudges_comments
riba_awards
● 200 OK
"award_year": 2024,
"award_name": "RIBA National Award",
"project_name": "Chowdhury Walk",
"architect_name": "Al-Jawad Pike",
"region": "London",
"building_type": "Residential",
"winner_status": true
# award_yearaward_nameproject_namearchitect_nameregionbuilding_type
1
2
3

Complete list of extractable fields for CPD Providers objects from riba.org. All fields typed and schema-versioned.

provider_idprovider_namecourse_titleformatdurationcore_curriculum_topicknowledge_levelcontact_emailbooking_urlcost
cpd_providers
● 200 OK
"provider_id": "CPD-442",
"provider_name": "Kingspan Insulation",
"course_title": "Fire Performance of Insulated Panel Systems",
"format": "Webinar",
"duration": "60 mins",
"core_curriculum_topic": "Health, safety and wellbeing",
"knowledge_level": "General Awareness"
# provider_idprovider_namecourse_titleformatdurationcore_curriculum_topic
1
2
3

Complete list of extractable fields for RIBA Jobs objects from riba.org. All fields typed and schema-versioned.

job_idjob_titlepractice_namelocationsalary_bandcontract_typeremote_policyposted_dateclosing_daterequirementsapplication_url
riba_jobs
● 200 OK
"job_id": "JB-8831",
"job_title": "Part 2 Architectural Assistant",
"practice_name": "Haworth Tompkins",
"location": "London",
"salary_band": "£32,000 - £36,000",
"contract_type": "Permanent",
"remote_policy": "Hybrid"
# job_idjob_titlepractice_namelocationsalary_bandcontract_type
1
2
3

Capabilities

Extract the definitive graph of British architecture

Our RIBA scraper captures structured data across the entire institute portal: practice directories, award histories, technical case studies, and recruitment data — parsed, cleaned, and normalised.

Chartered Practice Directory

Extract full contact details, staff counts, specialisms, and regional affiliations for over 4,000 RIBA chartered practices.

Awards & Recognition Tracking

Map Stirling Prize, Royal Gold Medal, and Regional Award winners to specific practices and project case studies.

Project Case Studies

Capture gross internal area (GIA), contract values, sustainability credentials, and client metadata from published projects.

CPD Course Mining

Scrape the RIBA CPD provider network for course topics, delivery formats, and core curriculum alignment.

RIBA Jobs Scraping

Monitor architectural hiring trends, salary bands, and remote working policies across the UK sector.

Geographic Normalisation

Map practices and projects to RIBA regional chapters and standard UK postcode districts.

Specialism Filtering

Isolate practices by specific sectors: conservation, passivhaus, masterplanning, or commercial fit-out.

Document Parsing

Extract text and metadata from public RIBA Plan of Work PDFs and technical guidance documents.

Scheduled Directory Syncs

Run monthly diffs to identify newly chartered practices, address changes, or revoked memberships.

// engagement pipeline

From RIBA directory to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Select target datasets: practice directories, awards, case studies, or jobs. We design the extraction schema.

Pipeline Build
d 2–4

We configure Scrapy crawlers, handle pagination, and set up document parsing for case study metadata.

Validation & QA
d 4–6

Schema validation, null-rate checks on contract values, and geographic standardisation before launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our RIBA pipeline handles the hard parts

Extracting data from professional institutes requires handling legacy DOM structures, inconsistent user-submitted data, and nested document metadata.

pipeline-monitor · riba.org · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Anti-bot layer
Residential proxy rotation

We utilise UK-based residential proxies to distribute requests across the directory, preventing IP bans and rate-limiting from standard WAF configurations.

DOM complexity
Handling legacy and modern components

The RIBA website mixes legacy directory structures with modern React-based components. Our Playwright instances execute JavaScript to render dynamic search results and hydration states reliably.

Data cleaning
Normalising user-submitted practice data

Practice profiles are often filled inconsistently by members. We apply post-extraction regex and NLP to standardise addresses, phone formats, and staff count brackets into queryable fields.

Document parsing
Extracting PDF metadata

Many technical case studies and CPD materials are hosted as PDFs. We route these through a dedicated document parsing microservice to extract text blocks, tables, and metadata alongside the web scraping run.

Change detection
Tracking directory churn

We maintain a hash index of all chartered practices. Monthly runs only emit diffs, allowing you to easily identify new practices opening or existing practices changing status.

Applications

Who uses RIBA data — and how

Teams across industries use riba.org data to build competitive products and smarter operations.

01
Lead Generation for Suppliers

Building material manufacturers use the practice directory to target architects based on their specific sector specialisms and regional location.

02
Market Research

Industry analysts aggregate contract values and gross internal areas from case studies to track construction market health.

03
Recruitment & Talent Acquisition

Agencies monitor RIBA Jobs to track hiring volume, salary band fluctuations, and demand for specific software skills like Revit or ArchiCAD.

04
Competitor Intelligence

Architectural practices track peer award wins, completed project metrics, and stated staff counts to benchmark their own market position.

05
Academic Research

Universities extract sustainability ratings and material choices from award-winning case studies to analyse trends in sustainable design.

06
CPD Marketing

Training providers analyse the CPD directory to identify gaps in the core curriculum and price their own courses competitively.

Why DataFlirt

"The RIBA directory is the definitive graph of British architectural practice, but extracting structured project data requires traversing thousands of nested case studies."

Most teams underestimate the investment required: reliable RIBA scraping requires handling paginated directories, extracting nested PDF metadata, and standardising inconsistent practice formats. DataFlirt absorbs that complexity so your engineers can focus on the analysis — not the infrastructure.

Technical Spec

RIBA scraper — technical capabilities

Everything supported by our riba.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

JavaScript rendering
Full Playwright sessions to hydrate dynamic search components
Supported
Residential proxy rotation
UK-based IP pools to respect rate limits and avoid WAF blocks
Supported
PDF metadata extraction
Parsing text and tables from attached case study documents
Supported
Practice directory pagination
Deep traversal of all A-Z and regional directory listings
Supported
Change detection (diffs)
Hash-based diff to identify new or removed practices
Supported
Webhook delivery
HTTP POST per record for real-time job board monitoring
Supported
Image URL extraction
High-resolution asset links from project galleries
Supported
Historical project archives
Extraction of legacy Stirling Prize winners dating back to 1996
Supported
RIBA Members-only CPD videos
Gated video content requiring active RIBA membership credentials
Partial
Private member contact details
Personal emails of individual architects hidden behind login walls
Partial
Infrastructure

Infrastructure powering the RIBA pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus
Scrapy + Playwright Stack

Scrapy handles the broad directory traversal and deduplication, while Playwright executes JavaScript on modern React components to ensure complete data capture.

PDF & Document Parsing Pipeline

Custom Python microservices parse RIBA Plan of Work PDFs and technical guidance documents, extracting structured text arrays alongside the primary HTML scrape.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles scheduling for monthly directory syncs, dependency management, and SLA alerting.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested — schema versioned per run
CSV
Flat file with typed columns — Excel/Sheets compatible
XLS
Formatted spreadsheet for non-technical stakeholders
Parquet
Columnar format for BigQuery, Snowflake, Athena
AWS S3
Direct bucket delivery — compatible with any data lake
Webhook
HTTP POST per record for real-time downstream processing
API
REST endpoints to query your extracted RIBA datasets
BigQuery
Streamed directly into your dataset with schema auto-detect
Snowflake
Stage + COPY INTO workflow — incremental or full-replace
Postgres
Upsert into your existing schema with conflict resolution
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About riba.org scraping, legality, and pipeline operations.

Ask us directly →
Is scraping RIBA legal?

Scraping publicly available information from the RIBA directory is generally permissible. DataFlirt targets only public, non-authenticated practice data, case studies, and jobs. We do not extract personal data of individual non-practicing members or circumvent authentication walls.

How do you handle directory pagination?

Our crawlers traverse the entire directory tree using a combination of A-Z index scraping and regional filters, ensuring no chartered practice is missed during the extraction run.

Can you extract project values and GIA from case studies?

Yes. Where published in the case study metadata or text body, we extract contract values, gross internal area (GIA), and completion dates, normalising them into standard numeric formats.

Do you parse RIBA Plan of Work PDFs?

Yes. We operate a secondary document parsing pipeline that can extract text blocks and tables from publicly linked PDFs on the RIBA domain.

How fresh is the jobs data?

For RIBA Jobs, we can configure daily or sub-daily pipelines to ensure you capture new postings immediately and track closing dates accurately.

What is the minimum viable engagement?

Our smallest packages start with a one-off extraction of the chartered practice directory. For continuous monitoring of jobs or case studies, we price based on delivery frequency and schema complexity.

$ dataflirt scope --new-project --source=riba.org ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off practice directory export or continuous monitoring of RIBA Jobs and new case studies — we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →