We extract destination guides, POI listings, itineraries, and transport details from Wikitravel. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.
Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.
Complete list of extractable fields for Destination Guides objects from wikitravel.org. All fields typed and schema-versioned.
"destination_id": "WT-3921", "title": "Kyoto", "hierarchy": "['Asia', 'Japan', 'Kansai', 'Kyoto']", "coordinates": "35.0116, 135.7681", "introduction": "Kyoto is the former capital of Japan...", "last_updated": "2023-10-14T08:12:00Z"
| # | destination_id | title | hierarchy | coordinates | introduction | get_in |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for POI Listings (See & Do) objects from wikitravel.org. All fields typed and schema-versioned.
"poi_id": "POI-8492", "category": "See", "name": "Kinkaku-ji", "address": "1 Kinkakujicho, Kita Ward, Kyoto", "hours": "09:00-17:00", "price": "¥400", "coordinates": "35.0393, 135.7292"
| # | poi_id | destination_id | category | name | alternate_name | address |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Accommodation (Sleep) objects from wikitravel.org. All fields typed and schema-versioned.
"listing_id": "SLP-1102", "type": "Ryokan", "name": "Tawaraya", "address": "Fuyacho-dori, Nakagyo-ku", "check_in": "15:00", "price_range": "¥40,000+", "coordinates": "35.0111, 135.7649"
| # | listing_id | destination_id | name | type | address | phone |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Dining (Eat & Drink) objects from wikitravel.org. All fields typed and schema-versioned.
"listing_id": "EAT-9932", "category": "Eat", "name": "Nishiki Market", "cuisine": "Street Food", "hours": "09:00-18:00", "price_range": "¥500-¥2000", "description": "Historic marketplace known as Kyoto's Kitchen."
| # | listing_id | destination_id | category | name | cuisine | address |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Complete list of extractable fields for Transport & Logistics objects from wikitravel.org. All fields typed and schema-versioned.
"destination_id": "WT-3921", "transit_type": "Train", "operator": "JR Central", "routes": "['Tokyo to Kyoto']", "duration": "2h 15m", "cost": "¥13,080"
| # | destination_id | transit_type | operator | routes | frequency | duration |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | ||||||
| 3 |
Our Wikitravel scraper handles the complexities of MediaWiki parsing: normalising vCard templates, extracting geocoordinates, and mapping deep geographical hierarchies.
Clean unstructured wiki markup into strict JSON. We remove formatting artifacts and output clean text.
Extract specific fields from standard Wikitravel listing templates, handling malformed user inputs gracefully.
Capture lat/long coordinates for destinations and individual POIs, standardising them into GeoJSON formats.
Maintain continent-to-city parent-child relationships using breadcrumb links and category tags.
Extract guides across English, French, German, and 18 other language editions from respective subdomains.
Monitor specific pages for edits, capturing diffs and update timestamps from the MediaWiki API.
Extract image URLs, captions, and attribution data linked within articles.
Isolate 'Get in', 'See', 'Do', or 'Sleep' sections without pulling irrelevant surrounding text.
Extract structured translation tables and pronunciation guides for NLP training corpora.
Only push updates when a destination guide is modified, reducing redundant downstream processing.
Brief in. Clean data out.
Provide target regions, languages, or specific categories. We design the extraction schema together.
We configure Scrapy crawlers, wikitext parsers, and hierarchy mappers for wikitravel.org.
Schema validation, null-rate checks, and POI coordinate verification before full launch.
JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.
MediaWiki sites present unique parsing challenges. Here is how we convert free-text travel guides into strict relational data.
We use DOM traversal and custom parsers to convert raw wiki markup and nested tables into clean JSON, stripping out irrelevant formatting tags.
Wikitravel relies on vCard templates for POIs. We handle malformed or incomplete listing templates without breaking the schema, applying fallback heuristics.
We reconstruct the geographical tree from breadcrumb links and category tags, ensuring every city is correctly mapped to its parent region and country.
User-submitted lat/long formats vary wildly. We standardise these variations into standard GeoJSON points for immediate mapping use.
Using MediaWiki revision IDs, we only process updated articles. This saves compute and provides you with a clean stream of actual content changes.
Populate new travel applications with baseline destination data, POIs, and introductory text.
Feed structured travel guides into RAG pipelines to train accurate AI travel assistants.
Overlay Wikitravel POI coordinates onto custom maps for routing, discovery, and spatial analysis.
Use 'See' and 'Do' listings to train automated trip planning algorithms and recommendation engines.
Compare Wikitravel listings against proprietary databases to identify missing POIs or outdated information.
Utilise Wikitravel phrasebooks as parallel corpora for NLP models and translation software.
"Wikitravel holds decades of crowdsourced travel intelligence, but its wiki markup makes it notoriously difficult to query programmatically."
Converting crowdsourced travel guides into production-ready databases requires complex wikitext parsing, template normalisation, and geospatial standardisation. DataFlirt handles the extraction and normalisation layers so your engineering team can focus on building travel products.
Everything supported by our wikitravel.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.
Open-source tooling on proven cloud infra — no vendor lock-in, full observability.
Custom Scrapy middleware designed specifically to parse and normalise MediaWiki DOM structures and VCard templates into relational data.
Pipelines run on AWS Lambda and ECS. Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.
Hash-based diffing ensures downstream systems only receive updates when a Wikitravel article is actually modified, reducing redundant processing.
Data delivered to where your team already works — no new tooling required.
About wikitravel.org scraping, legality, and pipeline operations.
Ask us directly →Wikitravel content is generally available under Creative Commons licenses. Scraping public domain and openly licensed content is permissible. We adhere strictly to rate limits to ensure we do not degrade the site's performance.
User-generated content often breaks standard templates. We use fallback regex patterns and NLP heuristics to extract addresses, phone numbers, and coordinates even when the vCard markup is broken.
Yes. We support extraction across all active Wikitravel language subdomains, maintaining the schema structure regardless of the source language.
Yes. We extract and normalise geolocation data for both overarching destinations and individual POI listings.
We can configure daily, weekly, or monthly cadences. For high-frequency needs, we monitor the MediaWiki recent changes feed to trigger targeted updates.
We extract image URLs, captions, and attribution metadata. We do not typically download and host the binary image files, though this can be configured for custom pipelines.
Our minimum engagement typically starts with a defined target list of regions or cities. Contact us with your specific coverage requirements for a precise quote.
20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full global destination dump or continuous updates for specific regions, we scope, build, and operate the pipeline. Tell us what you need.