SYSTEM all green source paginegialle.it queue 18,492 pages p99 latency 214ms dataflirt.com · scraper/paginegialle-it
RUN · 42 active pipelines · paginegialle.it live

Italian B2B data,
at warehouse scale.

We extract company listings, contact details, VAT numbers, operating hours, and reviews from PagineGialle. Delivered as clean JSON, CSV, or Parquet to S3, BigQuery, or Snowflake on your cadence.

Businesses extracted
1.2M /run
Contact updates
340K /week
Reviews parsed
85K /day
Active pipelines
42
Uptime
99.98%
Data Dictionary

Every field we extract from paginegialle.it

Structured, schema-consistent data across all major object types — delivered clean, typed, and ready to query.

Complete list of extractable fields for Business Listings objects from paginegialle.it. All fields typed and schema-versioned.

business_idnamevat_numberprimary_categorysecondary_categoriesdescriptionphone_numberemailwebsiteaddress_fullclaim_statusrating_averagereview_countprofile_url
business_listings
● 200 OK
"business_id": "pg_8923145",
"name": "Ristorante Da Mario",
"vat_number": "IT01234567890",
"primary_category": "Ristoranti",
"phone_number": "+39 06 1234567",
"email": "info@damarioroma.it",
"rating_average": 4.5,
"review_count": 128
# business_idnamevat_numberprimary_categorysecondary_categoriesdescription
1
2
3

Complete list of extractable fields for Operating Hours objects from paginegialle.it. All fields typed and schema-versioned.

business_idmonday_openmonday_closetuesday_opentuesday_closewednesday_openwednesday_closethursday_openthursday_closefriday_openfriday_closesaturday_opensaturday_closesunday_opensunday_closeis_open_now
operating_hours
● 200 OK
"business_id": "pg_8923145",
"monday_open": "12:00",
"monday_close": "15:00",
"tuesday_open": "12:00",
"tuesday_close": "15:00",
"saturday_open": "19:00",
"saturday_close": "23:30",
"is_open_now": false
# business_idmonday_openmonday_closetuesday_opentuesday_closewednesday_open
1
2
3

Complete list of extractable fields for Reviews & Ratings objects from paginegialle.it. All fields typed and schema-versioned.

review_idbusiness_idauthor_nameauthor_profile_urlratingreview_textreview_dateplatform_sourcehelpful_votesmerchant_responseresponse_date
reviews_& ratings
● 200 OK
"review_id": "rev_459102",
"business_id": "pg_8923145",
"author_name": "Giuseppe R.",
"rating": 5,
"review_text": "Ottimo cibo e personale cortese. Consigliato.",
"review_date": "2023-10-14",
"platform_source": "PagineGialle",
"helpful_votes": 12
# review_idbusiness_idauthor_nameauthor_profile_urlratingreview_text
1
2
3

Complete list of extractable fields for Location Data objects from paginegialle.it. All fields typed and schema-versioned.

business_idlatitudelongitudestreet_addresscityprovinceregionzip_codeneighborhooddirections_url
location_data
● 200 OK
"business_id": "pg_8923145",
"latitude": 41.902783,
"longitude": 12.496365,
"street_address": "Via Roma 15",
"city": "Roma",
"province": "RM",
"region": "Lazio",
"zip_code": "00184"
# business_idlatitudelongitudestreet_addresscityprovince
1
2
3

Complete list of extractable fields for Search Results objects from paginegialle.it. All fields typed and schema-versioned.

keywordlocation_querypositionbusiness_idbusiness_nameis_sponsoredratingreview_countsnippet_textprofile_urlscraped_at
search_results
● 200 OK
"keyword": "idraulico",
"location_query": "Milano",
"position": 3,
"business_id": "pg_774129",
"business_name": "Pronto Intervento Idraulico Milano",
"is_sponsored": true,
"rating": 4.8,
"scraped_at": "2023-11-01T10:15:00Z"
# keywordlocation_querypositionbusiness_idbusiness_nameis_sponsored
1
2
3

Capabilities

Extract Italian business data with precision

Our PagineGialle scraper resolves dynamic phone numbers, parses operating hours, and extracts verified VAT numbers across every Italian province — bypassing IP blocks and CAPTCHAs automatically.

Full Profile Extraction

Capture company names, descriptions, primary and secondary categories, website links, and social media profiles directly from the listing.

Contact Data Mining

Extract email addresses and resolve click-to-reveal phone numbers using automated JavaScript execution.

VAT Number Capture

Isolate and validate Partita IVA (VAT numbers) for B2B enrichment and corporate identity verification.

Review & Rating Aggregation

Collect review text, author names, star ratings, dates, and merchant responses across all paginated review pages.

Geolocation & Map Coordinates

Extract precise latitude and longitude coordinates, street addresses, provinces, and ZIP codes for spatial analysis.

Operating Hours Formatting

Parse unstructured opening hours into a clean, queryable schema mapped to specific days of the week.

Sponsored vs Organic Tracking

Identify paid placements versus organic rank for any keyword and location combination.

Category & Keyword Mapping

Scrape entire category trees and track how businesses rank for specific industry keywords in local searches.

Scheduled + Streaming Modes

Run one-off provincial exports or configure continuous pipelines to track new business registrations and closures.

// engagement pipeline

From target province to warehouse record

Brief in. Clean data out.

Define Scope
d 0

Provide categories, regions, provinces, or specific search queries. We design the extraction schema together.

Pipeline Build
d 2–4

We configure Scrapy / Playwright crawlers, Italian proxy rotation, and CAPTCHA handling for paginegialle.it.

Validation & QA
d 4–6

Schema validation, null-rate checks, phone number resolution tests, and sample reviews before full launch.

Delivery
ongoing

JSON / CSV / Parquet pushed to your S3 bucket, BigQuery dataset, or Snowflake stage on agreed cadence.

Under the hood

How our PagineGialle pipeline handles the hard parts

Directory scraping requires specific regional infrastructure and dynamic rendering capabilities. Here is how we maintain reliable extraction.

pipeline-monitor · paginegialle.it · live ● active
// fingerprinting
Identity rotation
TLS fingerprintrandomised
User-agentrotated
IP poolresidential
Challenges blocked0
// pagination
Page coverage
48,291 pages queued running
// observability
Pipeline health
99.9%
uptime
142ms
p99 lat
0.3%
null rate
2
alerts
Regional targeting
Italian residential proxy rotation

PagineGialle aggressively blocks traffic originating outside Italy and flags datacenter IPs. We route all requests through ISP-grade Italian residential proxies to ensure uninterrupted access and accurate localized search results.

Dynamic content
Click-to-reveal contact resolution

Phone numbers and certain email addresses are obfuscated behind JavaScript event listeners. We deploy headless Playwright sessions to trigger these elements and capture the unmasked data.

Pagination limits
Deep category traversal algorithms

Search results often cap at a fixed number of pages. We bypass this by programmatically subdividing queries by micro-regions (ZIP codes or neighborhoods) to extract the entire business catalogue without hitting hard limits.

Schema stability
Resilient selectors for legacy DOMs

Directory DOM structures can be inconsistent across different business tiers (free vs premium listings). We use multi-layered fallback selectors to capture data reliably regardless of the profile template.

Change detection
Only re-scrape what's changed

For ongoing monitoring, we hash existing records and only push diffs when a business updates its hours, adds a phone number, or closes permanently — optimising your storage and compute costs.

Applications

Who uses PagineGialle data — and how

Teams across industries use paginegialle.it data to build competitive products and smarter operations.

01
B2B Lead Generation

Sales teams build targeted outreach lists using verified phone numbers, emails, and VAT data filtered by province and industry.

02
Local SEO & Competitor Analysis

Agencies track client ranking positions against competitors for specific local keywords and monitor review sentiment.

03
Market Research & Mapping

Consultancies map business density across Italian regions to identify underserved markets or plan retail expansion.

04
Data Enrichment

CRM administrators append missing VAT numbers, operating hours, and updated contact details to existing corporate databases.

05
Review Monitoring

Brands track customer feedback across franchise locations to maintain service quality and respond to negative sentiment.

06
Investment Due Diligence

Private equity firms analyze category growth, closure rates, and review velocity to evaluate specific regional markets.

Why DataFlirt

"PagineGialle holds the most comprehensive index of Italian SMEs, but extracting clean, structured VAT numbers and contact data requires bypassing aggressive rate limits."

Most engineering teams underestimate the friction of directory scraping. PagineGialle employs dynamic phone number obfuscation, strict IP rate limiting, and complex pagination. DataFlirt manages the residential proxies and DOM parsing, delivering structured business records directly to your warehouse so you can focus on utilizing the data.

Technical Spec

PagineGialle scraper — technical capabilities

Everything supported by our paginegialle.it scraper — rendered SPA elements, auth walls, rate-limit evasion and beyond.

Italian Residential IPs
Requests routed exclusively through IT-based ISP proxies
Supported
Click-to-reveal phones
Automated JavaScript execution to unmask obfuscated contact details
Supported
Pagination traversal
Geographic subdivision to bypass 50-page search limits
Supported
Review scraping
Extraction of all paginated reviews and merchant responses
Supported
Sponsored listing detection
Flags paid placements in category and keyword search results
Supported
Change detection (diffs)
Hash-based diff: only emit records with changed fields since last run
Supported
Webhook delivery
HTTP POST per record or batch — useful for real-time lead routing
Supported
Merchant dashboard analytics
Profile view counts and click-through rates (requires merchant login)
Partial
User account saved lists
Extraction of private user bookmarks or saved businesses
Partial
Infrastructure

Infrastructure powering the PagineGialle pipeline

Open-source tooling on proven cloud infra — no vendor lock-in, full observability.

ScrapyPlaywrightPython 3.12RedisPostgreSQLApache AirflowAWS LambdaS3CloudWatch2CaptchaCapSolverResidential ProxiesDockerKubernetesGrafanaPrometheus
Scrapy + Playwright Stack

Scrapy handles crawl orchestration and deduplication. Playwright handles JavaScript rendering to unmask click-to-reveal phone numbers and emails.

Italian Proxy Infrastructure

We maintain dedicated pools of Italian residential ISP proxies. Rotation happens per-request to prevent IP bans and ensure localized search fidelity.

Cloud-Native Orchestration

Pipelines run on AWS Lambda and ECS. Airflow handles scheduling, dependency management, and SLA alerting. All state stored in managed Postgres.

Output & Delivery

Your data, your destination

Data delivered to where your team already works — no new tooling required.

JSON
Newline-delimited or nested — schema versioned per run
CSV
Flat file with typed columns — Excel/Sheets compatible
XLS
Formatted Excel exports for non-technical stakeholders
Parquet
Columnar format for BigQuery, Snowflake, Athena
AWS S3
Direct bucket delivery — compatible with any data lake
Webhook
HTTP POST per record for real-time downstream processing
API
REST endpoints to query your extracted datasets
Postgres
Upsert into your existing schema with conflict resolution
S3
Direct bucket delivery — compatible with any data lake
// faq

Common questions.

About paginegialle.it scraping, legality, and pipeline operations.

Ask us directly →
Is scraping PagineGialle legal?

Scraping publicly available B2B information is generally permissible under EU law, provided it complies with GDPR. DataFlirt extracts only public business data (company names, business contacts, VAT numbers). We do not extract private consumer data or bypass authentication walls. Clients should ensure their use of the data (e.g., cold outreach) complies with local regulations.

How do you handle click-to-reveal phone numbers?

PagineGialle obfuscates phone numbers to prevent basic scraping. We use headless Playwright instances to render the DOM, simulate user interaction, and capture the network response containing the unmasked contact details.

Do you use Italian IP addresses?

Yes. PagineGialle heavily restricts non-Italian traffic. We utilize a strictly managed pool of Italian residential proxies to ensure access and prevent geo-blocking.

Can you extract Partita IVA (VAT numbers)?

Yes. VAT numbers are extracted and validated where present on the business profile, providing a unique identifier for CRM enrichment and deduplication.

How fresh is the data?

For continuous pipelines, we can configure daily or weekly runs to capture new business registrations, updated hours, or recent reviews. Full category refreshes depend on the requested volume.

What is the minimum viable engagement?

Our minimum engagement typically starts at a defined category or province extraction (e.g., all restaurants in Lombardy). For nationwide catalogues, we price based on total record volume and delivery frequency.

Can I request a sample dataset before committing?

Absolutely. We provide a sample run of up to 500 business listings or a specific local search query as part of the pre-engagement scoping process to validate schema fit and data quality.

$ dataflirt scope --new-project --source=paginegialle.it ready

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off regional export or a continuous feed of Italian B2B contacts — we scope, build, and operate the pipeline. Tell us what you need.

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h
Services

Data Extraction for Every Industry

View All Services →