SYSTEM all green source wikitravel.org queue 11,842 pages p99 latency 184ms dataflirt.com · scraper/wikitravel-org

RUN · 42 active pipelines · wikitravel.org live

Wikitravel data,
structured for scale.

We extract destination guides, POI listings, itineraries, and transport details from Wikitravel. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Get data from wikitravel.org → See how it works

Destinations mapped

92.4K /run

POIs extracted

1.8M /run

Revision updates

14.2K /24h

Active pipelines

Uptime

99.98%

◆ Destination Hierarchies◆ POI Listings Extraction◆ Eat & Drink Directories◆ Accommodation Data◆ Transport Guides◆ Itinerary Mapping◆ Phrasebook Corpora◆ Geolocation Coordinates◆ Multi-Language Support◆ Wikitext Parsing◆ Managed Pipeline◆ S3 / BigQuery Delivery◆ Bengaluru HQ◆ Destination Hierarchies◆ POI Listings Extraction◆ Eat & Drink Directories◆ Accommodation Data◆ Transport Guides◆ Itinerary Mapping◆ Phrasebook Corpora◆ Geolocation Coordinates◆ Multi-Language Support◆ Wikitext Parsing◆ Managed Pipeline◆ S3 / BigQuery Delivery◆ Bengaluru HQ

Data Dictionary

Every field we extract from wikitravel.org

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Destination Guides objects from wikitravel.org. All fields typed and schema-versioned.

destination_idtitlehierarchycoordinatesintroductionget_inget_aroundclimatelanguagesrespectstay_safeurllast_updated

"destination_id": "WT-3921",
"title": "Kyoto",
"hierarchy": "['Asia', 'Japan', 'Kansai', 'Kyoto']",
"coordinates": "35.0116, 135.7681",
"introduction": "Kyoto is the former capital of Japan...",
"last_updated": "2023-10-14T08:12:00Z"

#	destination_id	title	hierarchy	coordinates	introduction	get_in
1
2
3

Complete list of extractable fields for POI Listings (See & Do) objects from wikitravel.org. All fields typed and schema-versioned.

poi_iddestination_idcategorynamealternate_nameaddressdirectionsphoneemailwebsitehourspricecoordinatesdescription

"poi_id": "POI-8492",
"category": "See",
"name": "Kinkaku-ji",
"address": "1 Kinkakujicho, Kita Ward, Kyoto",
"hours": "09:00-17:00",
"price": "¥400",
"coordinates": "35.0393, 135.7292"

#	poi_id	destination_id	category	name	alternate_name	address
1
2
3

Complete list of extractable fields for Accommodation (Sleep) objects from wikitravel.org. All fields typed and schema-versioned.

listing_iddestination_idnametypeaddressphoneemailwebsitecheck_incheck_outprice_rangedescriptioncoordinates

"listing_id": "SLP-1102",
"type": "Ryokan",
"name": "Tawaraya",
"address": "Fuyacho-dori, Nakagyo-ku",
"check_in": "15:00",
"price_range": "¥40,000+",
"coordinates": "35.0111, 135.7649"

#	listing_id	destination_id	name	type	address	phone
1
2
3

Complete list of extractable fields for Dining (Eat & Drink) objects from wikitravel.org. All fields typed and schema-versioned.

listing_iddestination_idcategorynamecuisineaddressphonewebsitehoursprice_rangedescriptioncoordinatesalcohol_served

"listing_id": "EAT-9932",
"category": "Eat",
"name": "Nishiki Market",
"cuisine": "Street Food",
"hours": "09:00-18:00",
"price_range": "¥500-¥2000",
"description": "Historic marketplace known as Kyoto's Kitchen."

#	listing_id	destination_id	category	name	cuisine	address
1
2
3

Complete list of extractable fields for Transport & Logistics objects from wikitravel.org. All fields typed and schema-versioned.

destination_idtransit_typeoperatorroutesfrequencydurationcostbooking_urlterminal_infonotes

"destination_id": "WT-3921",
"transit_type": "Train",
"operator": "JR Central",
"routes": "['Tokyo to Kyoto']",
"duration": "2h 15m",
"cost": "¥13,080"

#	destination_id	transit_type	operator	routes	frequency	duration
1
2
3

Capabilities

Everything you need from Wikitravel, structured

Our Wikitravel scraper handles the complexities of MediaWiki parsing: normalising vCard templates, extracting geocoordinates, and mapping deep geographical hierarchies.

Wikitext Normalisation

Clean unstructured wiki markup into strict JSON. We remove formatting artifacts and output clean text.

VCard Template Parsing

Extract specific fields from standard Wikitravel listing templates, handling malformed user inputs gracefully.

Geospatial Extraction

Capture lat/long coordinates for destinations and individual POIs, standardising them into GeoJSON formats.

Hierarchical Mapping

Maintain continent-to-city parent-child relationships using breadcrumb links and category tags.

Multi-Language Scraping

Extract guides across English, French, German, and 18 other language editions from respective subdomains.

Revision Tracking

Monitor specific pages for edits, capturing diffs and update timestamps from the MediaWiki API.

Media & Image Metadata

Extract image URLs, captions, and attribution data linked within articles.

Section-Specific Targeting

Isolate 'Get in', 'See', 'Do', or 'Sleep' sections without pulling irrelevant surrounding text.

Phrasebook Mining

Extract structured translation tables and pronunciation guides for NLP training corpora.

Continuous Diffing

Only push updates when a destination guide is modified, reducing redundant downstream processing.

// engagement pipeline

From wiki markup to warehouse record

Brief in. Clean data out.

Define Scope

d 0

Provide target regions, languages, or specific categories. We design the extraction schema together.

Pipeline Build

d 2–4

We configure Scrapy crawlers, wikitext parsers, and hierarchy mappers for wikitravel.org.

Validation & QA

d 4–6

Schema validation, null-rate checks, and POI coordinate verification before full launch.

Delivery

ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our Wikitravel pipeline handles unstructured wikis

MediaWiki sites present unique parsing challenges. Here is how we convert free-text travel guides into strict relational data.

// fingerprinting

Identity rotation

TLS fingerprintrandomised

User-agentrotated

IP poolresidential

Challenges blocked0

// pagination

Page coverage

48,291 pages queued running

// observability

Pipeline health

99.9%

uptime

142ms

p99 lat

0.3%

null rate

alerts

Wikitext parsing

Unstructured Wikitext Parsing

We use DOM traversal and custom parsers to convert raw wiki markup and nested tables into clean JSON, stripping out irrelevant formatting tags.

Template extraction

VCard Template Extraction

Wikitravel relies on vCard templates for POIs. We handle malformed or incomplete listing templates without breaking the schema, applying fallback heuristics.

Hierarchy mapping

Hierarchy Resolution

We reconstruct the geographical tree from breadcrumb links and category tags, ensuring every city is correctly mapped to its parent region and country.

Coordinate normalisation

Coordinate Normalisation

User-submitted lat/long formats vary wildly. We standardise these variations into standard GeoJSON points for immediate mapping use.

Change detection

Revision-based Change Detection

Using MediaWiki revision IDs, we only process updated articles. This saves compute and provides you with a clean stream of actual content changes.

Applications

Who uses Wikitravel data and how

Teams across industries use wikitravel.org data to build competitive products and smarter operations.

Travel App Content Seeding

Populate new travel applications with baseline destination data, POIs, and introductory text.

LLM Context Grounding

Feed structured travel guides into RAG pipelines to train accurate AI travel assistants.

Geospatial Mapping

Overlay Wikitravel POI coordinates onto custom maps for routing, discovery, and spatial analysis.

Itinerary Generation

Use 'See' and 'Do' listings to train automated trip planning algorithms and recommendation engines.

Competitor POI Analysis

Compare Wikitravel listings against proprietary databases to identify missing POIs or outdated information.

Language Translation Training

Utilise Wikitravel phrasebooks as parallel corpora for NLP models and translation software.

Why DataFlirt

"Wikitravel holds decades of crowdsourced travel intelligence, but its wiki markup makes it notoriously difficult to query programmatically."

Converting crowdsourced travel guides into production-ready databases requires complex wikitext parsing, template normalisation, and geospatial standardisation. DataFlirt handles the extraction and normalisation layers so your engineering team can focus on building travel products.

Technical Spec

Wikitravel scraper technical capabilities

Everything supported by our wikitravel.org scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

VCard listing extraction

Parses standard POI templates into structured fields

Supported

Geocoordinate standardisation

Normalises varying lat/long inputs into GeoJSON

Supported

Breadcrumb hierarchy mapping

Links cities to regions, countries, and continents

Supported

Revision history tracking

Captures timestamp and diffs for updated articles

Supported

Multi-language editions

Supports en, fr, de, and other language subdomains

Supported

Section-targeted scraping

Isolates specific headers like 'Get in' or 'Sleep'

Supported

Media URL extraction

Captures image links and attribution metadata

Supported

Change detection (diffs)

Only emit records with changed fields since last run

Supported

User account credentials & private drafts

Access to unpublished or user-gated draft pages

Partial

Admin-level deleted revision history

Access to revisions purged by site administrators

Partial

Infrastructure

Infrastructure powering the Wikitravel pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus

MediaWiki Parsing Engine

Custom Scrapy middleware designed specifically to parse and normalise MediaWiki DOM structures and VCard templates into relational data.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Change Data Capture

Hash-based diffing ensures downstream systems only receive updates when a Wikitravel article is actually modified, reducing redundant processing.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON

Newline-delimited or nested array structures

CSV

Flat file with typed columns for simple ingestion

XLS

Excel compatible output for business analysts

Parquet

Columnar format for BigQuery, Snowflake, Athena

AWS S3

Direct bucket delivery on defined schedules

Webhook

HTTP POST per record for immediate downstream processing

API

REST endpoints to query extracted datasets

PostgreSQL

Direct database inserts with conflict resolution

Direct bucket delivery — compatible with any data lake

// faq

Common questions.

About wikitravel.org scraping, legality, and pipeline operations.

Ask us directly →

Is scraping Wikitravel legal?

Wikitravel content is generally available under Creative Commons licenses. Scraping public domain and openly licensed content is permissible. We adhere strictly to rate limits to ensure we do not degrade the site's performance.

How do you handle malformed listing templates?

User-generated content often breaks standard templates. We use fallback regex patterns and NLP heuristics to extract addresses, phone numbers, and coordinates even when the vCard markup is broken.

Can you extract data in languages other than English?

Yes. We support extraction across all active Wikitravel language subdomains, maintaining the schema structure regardless of the source language.

Do you provide lat/long coordinates?

Yes. We extract and normalise geolocation data for both overarching destinations and individual POI listings.

How frequently can you update the data?

We can configure daily, weekly, or monthly cadences. For high-frequency needs, we monitor the MediaWiki recent changes feed to trigger targeted updates.

Can you extract images from the articles?

We extract image URLs, captions, and attribution metadata. We do not typically download and host the binary image files, though this can be configured for custom pipelines.

What is the minimum viable engagement?

Our minimum engagement typically starts with a defined target list of regions or cities. Contact us with your specific coverage requirements for a precise quote.

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a full global destination dump or continuous updates for specific regions, we scope, build, and operate the pipeline. Tell us what you need.

Start a wikitravel.org pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h

Services

Data Extraction for Every Industry

View All Services →

🛍️ eCommerce → 🔍 Search Engine → ⚽ Sports Data → 📱 App Store → 🍕 Food Delivery → 📉 Betting Odds → ✈️ Aviation & Flight → 🛒 Grocery → 🎓 E-Learning → 💹 Stock Market → 🏠 Real Estate → 🤖 AI Training Data → 🧠 LLM Data → 📰 News → ⭐ Reviews → 💼 Job Board → 🏥 Healthcare → 💊 Pharma → 🏢 Company Data → 🤝 B2B Marketplace → 🚗 Automotive → 🌍 Travel → 🏨 Hospitality → 🪙 Cryptocurrency → 💡 IP & Patents → 📈 SEO Data → ⚖️ Legal → 🛡️ Insurance → 📲 Mobile App → 📸 Influencer → 🏛️ Government → 🚚 Transportation → 🎟️ Events → 📂 Directory → ⚡ Dynamic Websites → 📄 PDF Extraction → ✍️ Blog Content → ☁️ Weather → 🖥️ Cloud Scraping → 👨‍💻 Managed Service →

Wikitravel data, structured for scale.

Every field we extract from wikitravel.org

Everything you need from Wikitravel, structured

From wiki markup to warehouse record

How our Wikitravel pipeline handles unstructured wikis

Who uses Wikitravel data and how

Wikitravel scraper technical capabilities

Infrastructure powering the Wikitravel pipeline

Your data, your destination

Common questions.

Tell us whatto extract. We do the rest.

Data Extraction for Every Industry

Wikitravel data,
structured for scale.

Tell us what
to extract.
We do the rest.