SYSTEM all green source wikivoyage.org queue 32,104 articles p99 latency 115ms dataflirt.com · scraper/wikivoyage-org
RUN * 14 active pipelines * wikivoyage.org live

Global travel data,
parsed and structured.

We extract destination guides, POI listings, geographic coordinates, and regional hierarchies from Wikivoyage. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Destinations parsed
134K /run
POI listings
1.2M /run
Edits tracked
45K /24h
Active pipelines
14
Uptime
99.99%
Data Dictionary

Every field we extract from wikivoyage.org

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Destinations objects from wikivoyage.org. All fields typed and schema-versioned.

page_idtitleurlcontinentcountryregionlatitudelongitudeunderstand_textclimate_textlanguage_spokenlast_updated
destinations
● 200 OK
"page_id": "27845",
"title": "Kyoto",
"country": "Japan",
"region": "Kansai",
"latitude": 35.0116,
"longitude": 135.7681,
"climate_text": "Kyoto has a humid subtropical climate...",
"last_updated": "2023-10-14T08:22:11Z"
# page_idtitleurlcontinentcountryregion
1
2
3

Complete list of extractable fields for POI Listings objects from wikivoyage.org. All fields typed and schema-versioned.

listing_iddestinationcategorynamealt_nameaddressdirectionsphoneemailwebsitehourspricelatitudelongitudedescription
poi_listings
● 200 OK
"destination": "Kyoto",
"category": "See",
"name": "Kinkaku-ji",
"alt_name": "Golden Pavilion",
"address": "1 Kinkakujicho, Kita Ward",
"hours": "09:00-17:00",
"price": "¥400",
"latitude": 35.0393,
"longitude": 135.7292
# listing_iddestinationcategorynamealt_nameaddress
1
2
3

Complete list of extractable fields for Regional Hierarchy objects from wikivoyage.org. All fields typed and schema-versioned.

entity_nameentity_typeparent_regionsub_regionscitiesother_destinationsgeo_bounding_boxpopulationarea_sq_km
regional_hierarchy
● 200 OK
"entity_name": "Kansai",
"entity_type": "Region",
"parent_region": "Japan",
"sub_regions": "['Kyoto Prefecture', 'Osaka Prefecture', 'Nara Prefecture']",
"cities": "['Kyoto', 'Osaka', 'Nara', 'Kobe']",
"population": 22757897,
"area_sq_km": 33112.62
# entity_nameentity_typeparent_regionsub_regionscitiesother_destinations
1
2
3

Complete list of extractable fields for Transport Logistics objects from wikivoyage.org. All fields typed and schema-versioned.

destinationtransport_modeprovider_nameroute_detailsfrequencydurationprice_estimatebooking_urlterminal_infonotes
transport_logistics
● 200 OK
"destination": "Kyoto",
"transport_mode": "Train",
"provider_name": "JR Tokaido Shinkansen",
"route_details": "Tokyo to Kyoto",
"duration": "2 hours 15 minutes",
"price_estimate": "¥13,080",
"terminal_info": "Kyoto Station",
"notes": "Covered by Japan Rail Pass"
# destinationtransport_modeprovider_nameroute_detailsfrequencyduration
1
2
3

Complete list of extractable fields for Phrasebooks objects from wikivoyage.org. All fields typed and schema-versioned.

language_namelanguage_familyregions_spokenpronunciation_guidephrase_categoryenglish_phrasetranslated_phrasetransliterationcultural_notes
phrasebooks
● 200 OK
"language_name": "Japanese",
"regions_spoken": "['Japan']",
"phrase_category": "Basics",
"english_phrase": "Thank you",
"translated_phrase": "ありがとう",
"transliteration": "Arigatou",
"cultural_notes": "Use 'Arigatou gozaimasu' for politeness"
# language_namelanguage_familyregions_spokenpronunciation_guidephrase_categoryenglish_phrase
1
2
3

Capabilities

Extract structured intelligence from open travel data

Wikivoyage wikitext is notoriously difficult to parse. We handle the VCard templates, nested regional hierarchies, and coordinate extraction so you receive clean, normalised database records.

Hierarchical Mapping

Extract the exact parent-child relationships from continents down to specific districts and neighbourhoods.

VCard POI Extraction

Parse 'See', 'Do', 'Eat', 'Sleep', and 'Buy' listings into distinct fields: name, address, coordinates, hours, and price.

Geodata Normalisation

Convert inline wikitext coordinates and GeoJSON map data into standard latitude and longitude decimal values.

Phrasebook Parsing

Structure language guides into queryable dictionaries mapping English phrases to local scripts and transliterations.

Transport Logistics

Isolate 'Get in' and 'Get around' sections to extract airport details, train routes, and local transit advice.

Multi-Language Linking

Map articles across different language editions of Wikivoyage to build comprehensive global datasets.

Travel Warnings

Extract safety banners and 'Stay safe' sections to monitor regional instability and health advisories.

Itinerary Generation Data

Capture suggested routes, day trips, and multi-day itineraries with ordered waypoints.

Revision Tracking

Monitor recent changes via the MediaWiki API to only update records that have been modified since the last crawl.

// engagement pipeline

From wikitext to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Specify target regions, languages, and required data fields. We configure the parsing logic for specific VCard templates.

Pipeline Build
d 2–4

We deploy MediaWiki API consumers and custom wikitext parsers to extract structured fields from unstructured prose.

Validation & QA
d 4–6

Coordinate boundary checks, null-rate monitoring on addresses, and template fallback validation before full launch.

Delivery
ongoing

JSON, CSV, or Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on your required schedule.

Under the hood

How our pipeline handles Wikivoyage parsing

Extracting data from community-edited wikis means dealing with inconsistent formatting and broken templates. Here is how we maintain data quality.

pipeline-monitor · wikivoyage.org · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Template parsing
Handling inconsistent VCard usage

Editors frequently break wikitext templates or use deprecated formats. Our parsers use fallback regex patterns and AST traversal to extract listings even when standard tags are missing or malformed.

Coordinate resolution
Normalising spatial data

Coordinates on Wikivoyage appear in infoboxes, inline text, and Wikidata links. We aggregate these sources, convert DMS (degrees, minutes, seconds) to decimal degrees, and validate them against regional bounding boxes.

Hierarchy extraction
Rebuilding the geographical tree

The 'Isin' (is in) relationships define the travel hierarchy. We traverse these links to build a strict parent-child relational database, resolving circular references and orphaned articles.

Differential updates
Scraping via the RecentChanges API

Instead of re-parsing 30,000 articles daily, we monitor the MediaWiki RecentChanges feed. We only fetch and parse articles that have been modified, drastically reducing latency and compute overhead.

Wikidata integration
Enriching POIs with external IDs

When Wikivoyage listings include a Wikidata Q-identifier, we resolve it to append external metadata, alternative language names, and official website URLs that might be missing from the local text.

Applications

Who uses Wikivoyage data

Teams across industries use wikivoyage.org data to build competitive products and smarter operations.

01
Travel Aggregator Apps

Mobile applications use the destination hierarchies and POI listings to populate offline maps and city guides.

02
LLM Training & RAG

AI companies ingest the structured text to train travel-specific conversational agents and itinerary planners.

03
Map Data Enrichment

GIS platforms overlay Wikivoyage 'See' and 'Do' listings onto base maps to provide contextual tourist information.

04
Local SEO Seeding

Directory websites use the open-source listings to bootstrap local business databases across thousands of cities.

05
Risk Management

Corporate travel platforms monitor 'Stay safe' sections and regional warnings to alert employees of local hazards.

06
Translation Services

Language apps ingest the phrasebook data to build practical, travel-oriented translation dictionaries.

Why DataFlirt

"Wikivoyage contains the world's most comprehensive open travel dataset, but extracting structured POIs from wikitext requires complex parsing logic."

Most teams underestimate the difficulty of parsing MediaWiki templates. Reliable Wikivoyage extraction requires handling inconsistent VCard formats, nested regional hierarchies, and multi-language entity resolution. DataFlirt absorbs that complexity so your engineers can focus on product development.

Technical Spec

Wikivoyage scraper technical capabilities

Everything supported by our wikivoyage.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

MediaWiki API integration
Native consumption of the Action API for bulk exports and revision tracking
Supported
VCard template parsing
Extraction of structured listings (See, Do, Eat, Sleep, Buy)
Supported
GeoJSON coordinate extraction
Parsing dynamic map data and inline coordinate templates
Supported
Multi-language mapping
Cross-referencing articles via interlanguage links
Supported
Revision history diffs
Change detection based on MediaWiki timestamp and revision IDs
Supported
Image metadata extraction
Capturing Wikimedia Commons URLs and attribution data
Supported
Wikidata Q-ID resolution
Mapping local articles to global Wikidata entities
Supported
User account settings
Private user preferences, watchlists, and email configurations
Partial
Private draft edits
Unpublished user sandboxes and deleted revisions
Partial
Admin IP logs
CheckUser data and private moderation logs
Partial
Infrastructure

Infrastructure powering the Wikivoyage pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusMediaWiki APImwparserfromhell
Wikitext Parsing Stack

We utilise specialised AST parsers in Python to safely decompose complex wikitext, extracting nested templates and ignoring formatting markup.

Geodata Infrastructure

PostGIS extensions in our PostgreSQL clusters validate extracted coordinates against known administrative boundaries to filter out anomalous data points.

Cloud-Native Orchestration

Pipelines run on Kubernetes. Airflow handles scheduling, dependency management, and SLA alerting. All state is stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Nested structures preserving hierarchical region data
CSV
Flat file with typed columns for POI listings
XLS
Excel compatible format for editorial review
Parquet
Columnar format for BigQuery, Snowflake, Athena
AWS S3
Direct bucket delivery compatible with any data lake
Webhook
HTTP POST per record for real-time downstream processing
API
REST endpoints to query extracted destinations on demand
PostgreSQL
Direct database upserts with PostGIS coordinate types
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About wikivoyage.org scraping, legality, and pipeline operations.

Ask us directly →
What is the licensing for Wikivoyage data?

Wikivoyage text is licensed under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license. You are free to use, adapt, and distribute the data, provided you attribute the source and share adaptations under the same license. DataFlirt extracts the data; clients are responsible for complying with the license terms in their applications.

How do you handle broken wikitext templates?

Community editing results in frequent syntax errors. We use Abstract Syntax Tree (AST) parsers rather than simple regex. When standard VCard templates fail, our system falls back to heuristic text extraction to recover the POI data.

How frequently can the data be updated?

We can stream updates continuously by monitoring the MediaWiki RecentChanges API, ensuring your database reflects edits within minutes. Alternatively, we can provide full catalogue snapshots on a daily or weekly schedule.

Do you extract data from non-English Wikivoyage sites?

Yes. We support all language editions, including German, French, Italian, and Spanish. We can also provide cross-language mapping, linking the English article for 'Rome' to the Italian 'Roma' to build unified multi-lingual records.

Can you extract the map data (GeoJSON)?

Yes. We parse the <mapframe> tags and external map links to extract GeoJSON geometries, point coordinates, and map marker definitions associated with the articles.

Is it possible to get historical revisions?

Yes. We can target the MediaWiki history endpoints to extract previous versions of an article or track how a specific listing has changed over time.

$ dataflirt scope --new-project --source=wikivoyage.org ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off database dump of global POIs or a continuous feed of travel warnings and new listings. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →