We extract destination guides, POI listings, geographic coordinates, and regional hierarchies from Wikivoyage. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.
Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.
Complete list of extractable fields for Destinations objects from wikivoyage.org. All fields typed and schema-versioned.
"page_id": "27845", "title": "Kyoto", "country": "Japan", "region": "Kansai", "latitude": 35.0116, "longitude": 135.7681, "climate_text": "Kyoto has a humid subtropical climate...", "last_updated": "2023-10-14T08:22:11Z"
| # | page_id | title | url | continent | country | region |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for POI Listings objects from wikivoyage.org. All fields typed and schema-versioned.
"destination": "Kyoto", "category": "See", "name": "Kinkaku-ji", "alt_name": "Golden Pavilion", "address": "1 Kinkakujicho, Kita Ward", "hours": "09:00-17:00", "price": "¥400", "latitude": 35.0393, "longitude": 135.7292
| # | listing_id | destination | category | name | alt_name | address |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Regional Hierarchy objects from wikivoyage.org. All fields typed and schema-versioned.
"entity_name": "Kansai", "entity_type": "Region", "parent_region": "Japan", "sub_regions": "['Kyoto Prefecture', 'Osaka Prefecture', 'Nara Prefecture']", "cities": "['Kyoto', 'Osaka', 'Nara', 'Kobe']", "population": 22757897, "area_sq_km": 33112.62
| # | entity_name | entity_type | parent_region | sub_regions | cities | other_destinations |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Transport Logistics objects from wikivoyage.org. All fields typed and schema-versioned.
"destination": "Kyoto", "transport_mode": "Train", "provider_name": "JR Tokaido Shinkansen", "route_details": "Tokyo to Kyoto", "duration": "2 hours 15 minutes", "price_estimate": "¥13,080", "terminal_info": "Kyoto Station", "notes": "Covered by Japan Rail Pass"
| # | destination | transport_mode | provider_name | route_details | frequency | duration |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Phrasebooks objects from wikivoyage.org. All fields typed and schema-versioned.
"language_name": "Japanese", "regions_spoken": "['Japan']", "phrase_category": "Basics", "english_phrase": "Thank you", "translated_phrase": "ありがとう", "transliteration": "Arigatou", "cultural_notes": "Use 'Arigatou gozaimasu' for politeness"
| # | language_name | language_family | regions_spoken | pronunciation_guide | phrase_category | english_phrase |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Wikivoyage wikitext is notoriously difficult to parse. We handle the VCard templates, nested regional hierarchies, and coordinate extraction so you receive clean, normalised database records.
Extract the exact parent-child relationships from continents down to specific districts and neighbourhoods.
Parse 'See', 'Do', 'Eat', 'Sleep', and 'Buy' listings into distinct fields: name, address, coordinates, hours, and price.
Convert inline wikitext coordinates and GeoJSON map data into standard latitude and longitude decimal values.
Structure language guides into queryable dictionaries mapping English phrases to local scripts and transliterations.
Isolate 'Get in' and 'Get around' sections to extract airport details, train routes, and local transit advice.
Map articles across different language editions of Wikivoyage to build comprehensive global datasets.
Extract safety banners and 'Stay safe' sections to monitor regional instability and health advisories.
Capture suggested routes, day trips, and multi-day itineraries with ordered waypoints.
Monitor recent changes via the MediaWiki API to only update records that have been modified since the last crawl.
Brief in. Clean data out.
Specify target regions, languages, and required data fields. We configure the parsing logic for specific VCard templates.
We deploy MediaWiki API consumers and custom wikitext parsers to extract structured fields from unstructured prose.
Coordinate boundary checks, null-rate monitoring on addresses, and template fallback validation before full launch.
JSON, CSV, or Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on your required schedule.
Extracting data from community-edited wikis means dealing with inconsistent formatting and broken templates. Here is how we maintain data quality.
Editors frequently break wikitext templates or use deprecated formats. Our parsers use fallback regex patterns and AST traversal to extract listings even when standard tags are missing or malformed.
Coordinates on Wikivoyage appear in infoboxes, inline text, and Wikidata links. We aggregate these sources, convert DMS (degrees, minutes, seconds) to decimal degrees, and validate them against regional bounding boxes.
The 'Isin' (is in) relationships define the travel hierarchy. We traverse these links to build a strict parent-child relational database, resolving circular references and orphaned articles.
Instead of re-parsing 30,000 articles daily, we monitor the MediaWiki RecentChanges feed. We only fetch and parse articles that have been modified, drastically reducing latency and compute overhead.
When Wikivoyage listings include a Wikidata Q-identifier, we resolve it to append external metadata, alternative language names, and official website URLs that might be missing from the local text.
Mobile applications use the destination hierarchies and POI listings to populate offline maps and city guides.
AI companies ingest the structured text to train travel-specific conversational agents and itinerary planners.
GIS platforms overlay Wikivoyage 'See' and 'Do' listings onto base maps to provide contextual tourist information.
Directory websites use the open-source listings to bootstrap local business databases across thousands of cities.
Corporate travel platforms monitor 'Stay safe' sections and regional warnings to alert employees of local hazards.
Language apps ingest the phrasebook data to build practical, travel-oriented translation dictionaries.
"Wikivoyage contains the world's most comprehensive open travel dataset, but extracting structured POIs from wikitext requires complex parsing logic."
Most teams underestimate the difficulty of parsing MediaWiki templates. Reliable Wikivoyage extraction requires handling inconsistent VCard formats, nested regional hierarchies, and multi-language entity resolution. DataFlirt absorbs that complexity so your engineers can focus on product development.
Everything supported by our wikivoyage.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.
Open-source tooling on proven cloud infra — no vendor lock-in, full observability.
We utilise specialised AST parsers in Python to safely decompose complex wikitext, extracting nested templates and ignoring formatting markup.
PostGIS extensions in our PostgreSQL clusters validate extracted coordinates against known administrative boundaries to filter out anomalous data points.
Pipelines run on Kubernetes. Airflow handles scheduling, dependency management, and SLA alerting. All state is stored in managed Postgres.
Data delivered to where your team already works — no new tooling required.
About wikivoyage.org scraping, legality, and pipeline operations.
Ask us directly →Wikivoyage text is licensed under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license. You are free to use, adapt, and distribute the data, provided you attribute the source and share adaptations under the same license. DataFlirt extracts the data; clients are responsible for complying with the license terms in their applications.
Community editing results in frequent syntax errors. We use Abstract Syntax Tree (AST) parsers rather than simple regex. When standard VCard templates fail, our system falls back to heuristic text extraction to recover the POI data.
We can stream updates continuously by monitoring the MediaWiki RecentChanges API, ensuring your database reflects edits within minutes. Alternatively, we can provide full catalogue snapshots on a daily or weekly schedule.
Yes. We support all language editions, including German, French, Italian, and Spanish. We can also provide cross-language mapping, linking the English article for 'Rome' to the Italian 'Roma' to build unified multi-lingual records.
Yes. We parse the <mapframe> tags and external map links to extract GeoJSON geometries, point coordinates, and map marker definitions associated with the articles.
Yes. We can target the MediaWiki history endpoints to extract previous versions of an article or track how a specific listing has changed over time.
20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off database dump of global POIs or a continuous feed of travel warnings and new listings. Tell us what you need.