SYSTEM all green source wikitravel.org queue 11,842 pages p99 latency 184ms dataflirt.com · scraper/wikitravel-org
RUN · 42 active pipelines · wikitravel.org live

Wikitravel data,
structured for scale.

We extract destination guides, POI listings, itineraries, and transport details from Wikitravel. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Destinations mapped
92.4K /run
POIs extracted
1.8M /run
Revision updates
14.2K /24h
Active pipelines
42
Uptime
99.98%
Data Dictionary

Every field we extract from wikitravel.org

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Destination Guides objects from wikitravel.org. All fields typed and schema-versioned.

destination_idtitlehierarchycoordinatesintroductionget_inget_aroundclimatelanguagesrespectstay_safeurllast_updated
destination_guides
● 200 OK
"destination_id": "WT-3921",
"title": "Kyoto",
"hierarchy": "['Asia', 'Japan', 'Kansai', 'Kyoto']",
"coordinates": "35.0116, 135.7681",
"introduction": "Kyoto is the former capital of Japan...",
"last_updated": "2023-10-14T08:12:00Z"
# destination_idtitlehierarchycoordinatesintroductionget_in
1
2
3

Complete list of extractable fields for POI Listings (See & Do) objects from wikitravel.org. All fields typed and schema-versioned.

poi_iddestination_idcategorynamealternate_nameaddressdirectionsphoneemailwebsitehourspricecoordinatesdescription
poi_listings (see & do)
● 200 OK
"poi_id": "POI-8492",
"category": "See",
"name": "Kinkaku-ji",
"address": "1 Kinkakujicho, Kita Ward, Kyoto",
"hours": "09:00-17:00",
"price": "¥400",
"coordinates": "35.0393, 135.7292"
# poi_iddestination_idcategorynamealternate_nameaddress
1
2
3

Complete list of extractable fields for Accommodation (Sleep) objects from wikitravel.org. All fields typed and schema-versioned.

listing_iddestination_idnametypeaddressphoneemailwebsitecheck_incheck_outprice_rangedescriptioncoordinates
accommodation_(sleep)
● 200 OK
"listing_id": "SLP-1102",
"type": "Ryokan",
"name": "Tawaraya",
"address": "Fuyacho-dori, Nakagyo-ku",
"check_in": "15:00",
"price_range": "¥40,000+",
"coordinates": "35.0111, 135.7649"
# listing_iddestination_idnametypeaddressphone
1
2
3

Complete list of extractable fields for Dining (Eat & Drink) objects from wikitravel.org. All fields typed and schema-versioned.

listing_iddestination_idcategorynamecuisineaddressphonewebsitehoursprice_rangedescriptioncoordinatesalcohol_served
dining_(eat & drink)
● 200 OK
"listing_id": "EAT-9932",
"category": "Eat",
"name": "Nishiki Market",
"cuisine": "Street Food",
"hours": "09:00-18:00",
"price_range": "¥500-¥2000",
"description": "Historic marketplace known as Kyoto's Kitchen."
# listing_iddestination_idcategorynamecuisineaddress
1
2
3

Complete list of extractable fields for Transport & Logistics objects from wikitravel.org. All fields typed and schema-versioned.

destination_idtransit_typeoperatorroutesfrequencydurationcostbooking_urlterminal_infonotes
transport_& logistics
● 200 OK
"destination_id": "WT-3921",
"transit_type": "Train",
"operator": "JR Central",
"routes": "['Tokyo to Kyoto']",
"duration": "2h 15m",
"cost": "¥13,080"
# destination_idtransit_typeoperatorroutesfrequencyduration
1
2
3

Capabilities

Everything you need from Wikitravel, structured

Our Wikitravel scraper handles the complexities of MediaWiki parsing: normalising vCard templates, extracting geocoordinates, and mapping deep geographical hierarchies.

Wikitext Normalisation

Clean unstructured wiki markup into strict JSON. We remove formatting artifacts and output clean text.

VCard Template Parsing

Extract specific fields from standard Wikitravel listing templates, handling malformed user inputs gracefully.

Geospatial Extraction

Capture lat/long coordinates for destinations and individual POIs, standardising them into GeoJSON formats.

Hierarchical Mapping

Maintain continent-to-city parent-child relationships using breadcrumb links and category tags.

Multi-Language Scraping

Extract guides across English, French, German, and 18 other language editions from respective subdomains.

Revision Tracking

Monitor specific pages for edits, capturing diffs and update timestamps from the MediaWiki API.

Media & Image Metadata

Extract image URLs, captions, and attribution data linked within articles.

Section-Specific Targeting

Isolate 'Get in', 'See', 'Do', or 'Sleep' sections without pulling irrelevant surrounding text.

Phrasebook Mining

Extract structured translation tables and pronunciation guides for NLP training corpora.

Continuous Diffing

Only push updates when a destination guide is modified, reducing redundant downstream processing.

// engagement pipeline

From wiki markup to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide target regions, languages, or specific categories. We design the extraction schema together.

Pipeline Build
d 2–4

We configure Scrapy crawlers, wikitext parsers, and hierarchy mappers for wikitravel.org.

Validation & QA
d 4–6

Schema validation, null-rate checks, and POI coordinate verification before full launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our Wikitravel pipeline handles unstructured wikis

MediaWiki sites present unique parsing challenges. Here is how we convert free-text travel guides into strict relational data.

pipeline-monitor · wikitravel.org · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Wikitext parsing
Unstructured Wikitext Parsing

We use DOM traversal and custom parsers to convert raw wiki markup and nested tables into clean JSON, stripping out irrelevant formatting tags.

Template extraction
VCard Template Extraction

Wikitravel relies on vCard templates for POIs. We handle malformed or incomplete listing templates without breaking the schema, applying fallback heuristics.

Hierarchy mapping
Hierarchy Resolution

We reconstruct the geographical tree from breadcrumb links and category tags, ensuring every city is correctly mapped to its parent region and country.

Coordinate normalisation
Coordinate Normalisation

User-submitted lat/long formats vary wildly. We standardise these variations into standard GeoJSON points for immediate mapping use.

Change detection
Revision-based Change Detection

Using MediaWiki revision IDs, we only process updated articles. This saves compute and provides you with a clean stream of actual content changes.

Applications

Who uses Wikitravel data and how

Teams across industries use wikitravel.org data to build competitive products and smarter operations.

01
Travel App Content Seeding

Populate new travel applications with baseline destination data, POIs, and introductory text.

02
LLM Context Grounding

Feed structured travel guides into RAG pipelines to train accurate AI travel assistants.

03
Geospatial Mapping

Overlay Wikitravel POI coordinates onto custom maps for routing, discovery, and spatial analysis.

04
Itinerary Generation

Use 'See' and 'Do' listings to train automated trip planning algorithms and recommendation engines.

05
Competitor POI Analysis

Compare Wikitravel listings against proprietary databases to identify missing POIs or outdated information.

06
Language Translation Training

Utilise Wikitravel phrasebooks as parallel corpora for NLP models and translation software.

Why DataFlirt

"Wikitravel holds decades of crowdsourced travel intelligence, but its wiki markup makes it notoriously difficult to query programmatically."

Converting crowdsourced travel guides into production-ready databases requires complex wikitext parsing, template normalisation, and geospatial standardisation. DataFlirt handles the extraction and normalisation layers so your engineering team can focus on building travel products.

Technical Spec

Wikitravel scraper technical capabilities

Everything supported by our wikitravel.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

VCard listing extraction
Parses standard POI templates into structured fields
Supported
Geocoordinate standardisation
Normalises varying lat/long inputs into GeoJSON
Supported
Breadcrumb hierarchy mapping
Links cities to regions, countries, and continents
Supported
Revision history tracking
Captures timestamp and diffs for updated articles
Supported
Multi-language editions
Supports en, fr, de, and other language subdomains
Supported
Section-targeted scraping
Isolates specific headers like 'Get in' or 'Sleep'
Supported
Media URL extraction
Captures image links and attribution metadata
Supported
Change detection (diffs)
Only emit records with changed fields since last run
Supported
User account credentials & private drafts
Access to unpublished or user-gated draft pages
Partial
Admin-level deleted revision history
Access to revisions purged by site administrators
Partial
Infrastructure

Infrastructure powering the Wikitravel pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus
MediaWiki Parsing Engine

Custom Scrapy middleware designed specifically to parse and normalise MediaWiki DOM structures and VCard templates into relational data.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Change Data Capture

Hash-based diffing ensures downstream systems only receive updates when a Wikitravel article is actually modified, reducing redundant processing.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested array structures
CSV
Flat file with typed columns for simple ingestion
XLS
Excel compatible output for business analysts
Parquet
Columnar format for BigQuery, Snowflake, Athena
AWS S3
Direct bucket delivery on defined schedules
Webhook
HTTP POST per record for immediate downstream processing
API
REST endpoints to query extracted datasets
PostgreSQL
Direct database inserts with conflict resolution
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About wikitravel.org scraping, legality, and pipeline operations.

Ask us directly →
Is scraping Wikitravel legal?

Wikitravel content is generally available under Creative Commons licenses. Scraping public domain and openly licensed content is permissible. We adhere strictly to rate limits to ensure we do not degrade the site's performance.

How do you handle malformed listing templates?

User-generated content often breaks standard templates. We use fallback regex patterns and NLP heuristics to extract addresses, phone numbers, and coordinates even when the vCard markup is broken.

Can you extract data in languages other than English?

Yes. We support extraction across all active Wikitravel language subdomains, maintaining the schema structure regardless of the source language.

Do you provide lat/long coordinates?

Yes. We extract and normalise geolocation data for both overarching destinations and individual POI listings.

How frequently can you update the data?

We can configure daily, weekly, or monthly cadences. For high-frequency needs, we monitor the MediaWiki recent changes feed to trigger targeted updates.

Can you extract images from the articles?

We extract image URLs, captions, and attribution metadata. We do not typically download and host the binary image files, though this can be configured for custom pipelines.

What is the minimum viable engagement?

Our minimum engagement typically starts with a defined target list of regions or cities. Contact us with your specific coverage requirements for a precise quote.

$ dataflirt scope --new-project --source=wikitravel.org ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full global destination dump or continuous updates for specific regions, we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →