SYSTEM all green source wikivoyage.org queue 32,104 articles p99 latency 115ms dataflirt.com · scraper/wikivoyage-org

RUN * 14 active pipelines * wikivoyage.org live

Global travel data,
parsed and structured.

We extract destination guides, POI listings, geographic coordinates, and regional hierarchies from Wikivoyage. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Get data from wikivoyage.org → See how it works

Destinations parsed

134K /run

POI listings

1.2M /run

Edits tracked

45K /24h

Active pipelines

Uptime

99.99%

◆ Destination Hierarchies◆ POI Listings◆ Geo Coordinates◆ Get In / Get Around◆ Eat / Drink / Sleep◆ Travel Warnings◆ Phrasebooks◆ Itineraries◆ Multi-Language Support◆ MediaWiki Parsing◆ Managed Pipeline◆ S3 / BigQuery Delivery◆ Destination Hierarchies◆ POI Listings◆ Geo Coordinates◆ Get In / Get Around◆ Eat / Drink / Sleep◆ Travel Warnings◆ Phrasebooks◆ Itineraries◆ Multi-Language Support◆ MediaWiki Parsing◆ Managed Pipeline◆ S3 / BigQuery Delivery

Data Dictionary

Every field we extract from wikivoyage.org

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Destinations objects from wikivoyage.org. All fields typed and schema-versioned.

page_idtitleurlcontinentcountryregionlatitudelongitudeunderstand_textclimate_textlanguage_spokenlast_updated

"page_id": "27845",
"title": "Kyoto",
"country": "Japan",
"region": "Kansai",
"latitude": 35.0116,
"longitude": 135.7681,
"climate_text": "Kyoto has a humid subtropical climate...",
"last_updated": "2023-10-14T08:22:11Z"

#	page_id	title	url	continent	country	region
1
2
3

Complete list of extractable fields for POI Listings objects from wikivoyage.org. All fields typed and schema-versioned.

listing_iddestinationcategorynamealt_nameaddressdirectionsphoneemailwebsitehourspricelatitudelongitudedescription

"destination": "Kyoto",
"category": "See",
"name": "Kinkaku-ji",
"alt_name": "Golden Pavilion",
"address": "1 Kinkakujicho, Kita Ward",
"hours": "09:00-17:00",
"price": "¥400",
"latitude": 35.0393,
"longitude": 135.7292

#	listing_id	destination	category	name	alt_name	address
1
2
3

Complete list of extractable fields for Regional Hierarchy objects from wikivoyage.org. All fields typed and schema-versioned.

entity_nameentity_typeparent_regionsub_regionscitiesother_destinationsgeo_bounding_boxpopulationarea_sq_km

"entity_name": "Kansai",
"entity_type": "Region",
"parent_region": "Japan",
"sub_regions": "['Kyoto Prefecture', 'Osaka Prefecture', 'Nara Prefecture']",
"cities": "['Kyoto', 'Osaka', 'Nara', 'Kobe']",
"population": 22757897,
"area_sq_km": 33112.62

#	entity_name	entity_type	parent_region	sub_regions	cities	other_destinations
1
2
3

Complete list of extractable fields for Transport Logistics objects from wikivoyage.org. All fields typed and schema-versioned.

destinationtransport_modeprovider_nameroute_detailsfrequencydurationprice_estimatebooking_urlterminal_infonotes

"destination": "Kyoto",
"transport_mode": "Train",
"provider_name": "JR Tokaido Shinkansen",
"route_details": "Tokyo to Kyoto",
"duration": "2 hours 15 minutes",
"price_estimate": "¥13,080",
"terminal_info": "Kyoto Station",
"notes": "Covered by Japan Rail Pass"

#	destination	transport_mode	provider_name	route_details	frequency	duration
1
2
3

Complete list of extractable fields for Phrasebooks objects from wikivoyage.org. All fields typed and schema-versioned.

language_namelanguage_familyregions_spokenpronunciation_guidephrase_categoryenglish_phrasetranslated_phrasetransliterationcultural_notes

"language_name": "Japanese",
"regions_spoken": "['Japan']",
"phrase_category": "Basics",
"english_phrase": "Thank you",
"translated_phrase": "ありがとう",
"transliteration": "Arigatou",
"cultural_notes": "Use 'Arigatou gozaimasu' for politeness"

#	language_name	language_family	regions_spoken	pronunciation_guide	phrase_category	english_phrase
1
2
3

Capabilities

Extract structured intelligence from open travel data

Wikivoyage wikitext is notoriously difficult to parse. We handle the VCard templates, nested regional hierarchies, and coordinate extraction so you receive clean, normalised database records.

Hierarchical Mapping

Extract the exact parent-child relationships from continents down to specific districts and neighbourhoods.

VCard POI Extraction

Parse 'See', 'Do', 'Eat', 'Sleep', and 'Buy' listings into distinct fields: name, address, coordinates, hours, and price.

Geodata Normalisation

Convert inline wikitext coordinates and GeoJSON map data into standard latitude and longitude decimal values.

Phrasebook Parsing

Structure language guides into queryable dictionaries mapping English phrases to local scripts and transliterations.

Transport Logistics

Isolate 'Get in' and 'Get around' sections to extract airport details, train routes, and local transit advice.

Multi-Language Linking

Map articles across different language editions of Wikivoyage to build comprehensive global datasets.

Travel Warnings

Extract safety banners and 'Stay safe' sections to monitor regional instability and health advisories.

Itinerary Generation Data

Capture suggested routes, day trips, and multi-day itineraries with ordered waypoints.

Revision Tracking

Monitor recent changes via the MediaWiki API to only update records that have been modified since the last crawl.

// engagement pipeline

From wikitext to warehouse record

Brief in. Clean data out.

Define Scope

d 0

Specify target regions, languages, and required data fields. We configure the parsing logic for specific VCard templates.

Pipeline Build

d 2–4

We deploy MediaWiki API consumers and custom wikitext parsers to extract structured fields from unstructured prose.

Validation & QA

d 4–6

Coordinate boundary checks, null-rate monitoring on addresses, and template fallback validation before full launch.

Delivery

ongoing

JSON, CSV, or Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on your required schedule.

Under the hood

How our pipeline handles Wikivoyage parsing

Extracting data from community-edited wikis means dealing with inconsistent formatting and broken templates. Here is how we maintain data quality.

// fingerprinting

Identity rotation

TLS fingerprintrandomised

User-agentrotated

IP poolresidential

Challenges blocked0

// pagination

Page coverage

48,291 pages queued running

// observability

Pipeline health

99.9%

uptime

142ms

p99 lat

0.3%

null rate

alerts

Template parsing

Handling inconsistent VCard usage

Editors frequently break wikitext templates or use deprecated formats. Our parsers use fallback regex patterns and AST traversal to extract listings even when standard tags are missing or malformed.

Coordinate resolution

Normalising spatial data

Coordinates on Wikivoyage appear in infoboxes, inline text, and Wikidata links. We aggregate these sources, convert DMS (degrees, minutes, seconds) to decimal degrees, and validate them against regional bounding boxes.

Hierarchy extraction

Rebuilding the geographical tree

The 'Isin' (is in) relationships define the travel hierarchy. We traverse these links to build a strict parent-child relational database, resolving circular references and orphaned articles.

Differential updates

Scraping via the RecentChanges API

Instead of re-parsing 30,000 articles daily, we monitor the MediaWiki RecentChanges feed. We only fetch and parse articles that have been modified, drastically reducing latency and compute overhead.

Wikidata integration

Enriching POIs with external IDs

When Wikivoyage listings include a Wikidata Q-identifier, we resolve it to append external metadata, alternative language names, and official website URLs that might be missing from the local text.

Applications

Who uses Wikivoyage data

Teams across industries use wikivoyage.org data to build competitive products and smarter operations.

Travel Aggregator Apps

Mobile applications use the destination hierarchies and POI listings to populate offline maps and city guides.

LLM Training & RAG

AI companies ingest the structured text to train travel-specific conversational agents and itinerary planners.

Map Data Enrichment

GIS platforms overlay Wikivoyage 'See' and 'Do' listings onto base maps to provide contextual tourist information.

Local SEO Seeding

Directory websites use the open-source listings to bootstrap local business databases across thousands of cities.

Risk Management

Corporate travel platforms monitor 'Stay safe' sections and regional warnings to alert employees of local hazards.

Translation Services

Language apps ingest the phrasebook data to build practical, travel-oriented translation dictionaries.

Why DataFlirt

"Wikivoyage contains the world's most comprehensive open travel dataset, but extracting structured POIs from wikitext requires complex parsing logic."

Most teams underestimate the difficulty of parsing MediaWiki templates. Reliable Wikivoyage extraction requires handling inconsistent VCard formats, nested regional hierarchies, and multi-language entity resolution. DataFlirt absorbs that complexity so your engineers can focus on product development.

Technical Spec

Wikivoyage scraper technical capabilities

Everything supported by our wikivoyage.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

MediaWiki API integration

Native consumption of the Action API for bulk exports and revision tracking

Supported

VCard template parsing

Extraction of structured listings (See, Do, Eat, Sleep, Buy)

Supported

GeoJSON coordinate extraction

Parsing dynamic map data and inline coordinate templates

Supported

Multi-language mapping

Cross-referencing articles via interlanguage links

Supported

Revision history diffs

Change detection based on MediaWiki timestamp and revision IDs

Supported

Image metadata extraction

Capturing Wikimedia Commons URLs and attribution data

Supported

Wikidata Q-ID resolution

Mapping local articles to global Wikidata entities

Supported

User account settings

Private user preferences, watchlists, and email configurations

Partial

Private draft edits

Unpublished user sandboxes and deleted revisions

Partial

Admin IP logs

CheckUser data and private moderation logs

Partial

Infrastructure

Infrastructure powering the Wikivoyage pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheusMediaWiki APImwparserfromhell

Wikitext Parsing Stack

We utilise specialised AST parsers in Python to safely decompose complex wikitext, extracting nested templates and ignoring formatting markup.

Geodata Infrastructure

PostGIS extensions in our PostgreSQL clusters validate extracted coordinates against known administrative boundaries to filter out anomalous data points.

Cloud-Native Orchestration

Pipelines run on Kubernetes. Airflow handles scheduling, dependency management, and SLA alerting. All state is stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON

Nested structures preserving hierarchical region data

CSV

Flat file with typed columns for POI listings

XLS

Excel compatible format for editorial review

Parquet

Columnar format for BigQuery, Snowflake, Athena

AWS S3

Direct bucket delivery compatible with any data lake

Webhook

HTTP POST per record for real-time downstream processing

API

REST endpoints to query extracted destinations on demand

PostgreSQL

Direct database upserts with PostGIS coordinate types

Direct bucket delivery — compatible with any data lake

// faq

Common questions.

About wikivoyage.org scraping, legality, and pipeline operations.

Ask us directly →

What is the licensing for Wikivoyage data?

Wikivoyage text is licensed under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license. You are free to use, adapt, and distribute the data, provided you attribute the source and share adaptations under the same license. DataFlirt extracts the data; clients are responsible for complying with the license terms in their applications.

How do you handle broken wikitext templates?

Community editing results in frequent syntax errors. We use Abstract Syntax Tree (AST) parsers rather than simple regex. When standard VCard templates fail, our system falls back to heuristic text extraction to recover the POI data.

How frequently can the data be updated?

We can stream updates continuously by monitoring the MediaWiki RecentChanges API, ensuring your database reflects edits within minutes. Alternatively, we can provide full catalogue snapshots on a daily or weekly schedule.

Do you extract data from non-English Wikivoyage sites?

Yes. We support all language editions, including German, French, Italian, and Spanish. We can also provide cross-language mapping, linking the English article for 'Rome' to the Italian 'Roma' to build unified multi-lingual records.

Can you extract the map data (GeoJSON)?

Yes. We parse the <mapframe> tags and external map links to extract GeoJSON geometries, point coordinates, and map marker definitions associated with the articles.

Is it possible to get historical revisions?

Yes. We can target the MediaWiki history endpoints to extract previous versions of an article or track how a specific listing has changed over time.

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off database dump of global POIs or a continuous feed of travel warnings and new listings. Tell us what you need.

Start a wikivoyage.org pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

Services

Data Extraction for Every Industry

View All Services →

🛍️ eCommerce → 🔍 Search Engine → ⚽ Sports Data → 📱 App Store → 🍕 Food Delivery → 📉 Betting Odds → ✈️ Aviation & Flight → 🛒 Grocery → 🎓 E-Learning → 💹 Stock Market → 🏠 Real Estate → 🤖 AI Training Data → 🧠 LLM Data → 📰 News → ⭐ Reviews → 💼 Job Board → 🏥 Healthcare → 💊 Pharma → 🏢 Company Data → 🤝 B2B Marketplace → 🚗 Automotive → 🌍 Travel → 🏨 Hospitality → 🪙 Cryptocurrency → 💡 IP & Patents → 📈 SEO Data → ⚖️ Legal → 🛡️ Insurance → 📲 Mobile App → 📸 Influencer → 🏛️ Government → 🚚 Transportation → 🎟️ Events → 📂 Directory → ⚡ Dynamic Websites → 📄 PDF Extraction → ✍️ Blog Content → ☁️ Weather → 🖥️ Cloud Scraping → 👨‍💻 Managed Service →

Global travel data, parsed and structured.

Every field we extract from wikivoyage.org

Extract structured intelligence from open travel data

From wikitext to warehouse record

How our pipeline handles Wikivoyage parsing

Who uses Wikivoyage data

Wikivoyage scraper technical capabilities

Infrastructure powering the Wikivoyage pipeline

Your data, your destination

Common questions.

Tell us whatto extract. We do the rest.

Data Extraction for Every Industry

Global travel data,
parsed and structured.

Tell us what
to extract.
We do the rest.