← Glossary / Great Expectations

What is Great Expectations?

Great Expectations is an open-source Python framework for data validation, documentation, and profiling. In scraping pipelines, it acts as a declarative contract between the extraction layer and the delivery sink. Instead of writing custom assertion scripts to check if a scraped price column contains valid integers, you define an "expectation" suite. If the target site changes its DOM and prices suddenly extract as "N/A", the pipeline halts before poisoning the downstream warehouse.

Data QualityValidationSchema ContractsPythonETL
// 02 — definitions

Asserting
reality.

Why unit testing your code isn't enough when the data itself is a moving target.

Ask a DataFlirt engineer →

TL;DR

Great Expectations (GX) shifts data quality from reactive debugging to proactive testing. You define rules like expect_column_values_to_be_between, and GX runs them against your scraped datasets (Pandas, Spark, or SQL). It catches schema drift, missing fields, and type coercion failures before the data is delivered.

01Definition & core concepts
Great Expectations is a shared, open-source standard for data quality. It relies on three core concepts:
  • Expectation Suites: Collections of verifiable assertions about your data (e.g., "prices must be positive floats").
  • Checkpoints: The execution engine that runs a suite against a specific batch of data (like a daily scrape output).
  • Data Docs: Auto-generated HTML reports that translate the JSON expectations and validation results into human-readable documentation.
By decoupling the rules from the extraction logic, GX allows data engineers to treat data quality as a configuration problem rather than a coding problem.
02How it works in a scraping pipeline
In a typical ETL scraping pipeline, the crawler fetches HTML and the extractor parses it into raw records. Before these records are merged into the main database, a GX Checkpoint is triggered. The checkpoint loads the raw batch into memory (via Pandas or Spark) and evaluates every row against the Expectation Suite. If the batch passes, the orchestration tool (like Airflow) moves the data to production. If it fails, the pipeline halts, the batch is quarantined, and an alert is fired with a link to the generated Data Docs showing exactly which rows failed.
03Handling scraping volatility with "mostly"
Web scraping is inherently messy. Target sites have missing data, malformed HTML, and edge cases. If you require 100% compliance on every field, your pipeline will never finish. GX solves this with the mostly parameter. By setting mostly=0.95 on an expectation, you instruct the engine to pass the batch as long as 95% of the records meet the criteria. This absorbs the natural noise of the web while still catching catastrophic selector failures.
04How DataFlirt handles it
We integrate GX directly into our delivery layer. Every client data contract is mapped to a versioned Expectation Suite. We run these suites on Spark clusters for high-volume pipelines, ensuring that schema drift is caught before a single byte reaches the client's S3 bucket. When an expectation fails, our automated quarantine system isolates the specific bad records, allowing the healthy portion of the dataset to be delivered on time while our engineers patch the broken selectors.
05Did you know?
Great Expectations was originally developed to handle the extreme complexity and regulatory requirements of healthcare data. Its design philosophy—that data documentation and data testing should be the exact same artifact—has made it the de facto standard in the modern data stack, perfectly suited for the unpredictable nature of web-scraped datasets.
// 03 — the validation model

How strict should
expectations be?

DataFlirt tunes expectation strictness based on the pipeline's SLA. A 100% strictness on a volatile e-commerce target means constant pipeline halts; we use probabilistic expectations for non-critical fields.

Pass Rate = Valid_Records / Total_Records
Must exceed the 'mostly' threshold to trigger delivery. GX Core Mechanics
Expectation Coverage = Columns_Tested / Total_Columns
Aim for 100% on primary keys, prices, and identifiers. DataFlirt QA Standard
DataFlirt Quarantine Ratio = Failed_Records / Total_Records
A ratio > 0.05 triggers an automatic selector review. Internal SLO
// 04 — expectation suite execution

Validating a scraped
product catalog.

A live GX validation run on a 50k-record dataset extracted from a retail target. The suite catches a silent selector failure on the 'stock_status' column.

Python 3.11Pandas Execution EngineData Docs
edge.dataflirt.io — live
CAPTURED
// execute checkpoint
$ gx checkpoint run retail_daily_check
Validating batch: retail_catalog_20260519.parquet

// running expectations
expect_table_row_count_to_be_between: [48000, 52000] ... PASS
expect_column_values_to_not_be_null(column="sku") ... PASS
expect_column_values_to_be_of_type(column="price", type_="float") ... PASS
expect_column_values_to_be_in_set(column="stock_status") ... FAIL
↳ Unexpected values: ["Contact Store", ""] (12.4% of records)

// validation results
Evaluated Expectations: 14
Successful Expectations: 13
Unsuccessful Expectations: 1
Success %: 92.8%

Action: Halting delivery. Routing batch to quarantine.
// 05 — failure modes

What expectations
catch most often.

The most frequently triggered expectations across DataFlirt's validation layer. Schema drift on the target site almost always manifests as one of these failures.

PIPELINES MONITORED ·   300+ active
VALIDATION ENGINE ·  ·    Spark / Pandas
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

expect_column_values_to_not_be_null

Missing fields · Selector rot or conditional DOM changes
02

expect_column_values_to_be_of_type

Type coercion · Currency symbols leaking into float columns
03

expect_column_values_to_be_in_set

Categorical · New enum values added by the target site
04

expect_table_row_count_to_be_between

Volume drop · Pagination failures or partial crawls
05

expect_column_values_to_be_unique

Deduplication · Bad primary keys or infinite redirect loops
// 06 — our validation architecture

Test the data,

not just the code.

At DataFlirt, we treat data validation as a first-class citizen in the pipeline orchestration. We don't just run Great Expectations at the end of a batch; we run lightweight expectation suites at the micro-batch level during extraction. If a target site deploys a layout change that breaks the price selector, the expectation fails within the first 100 records. The worker halts, quarantines the batch, and alerts our engineering team — preventing the pipeline from burning proxy bandwidth on a crawl that will only yield garbage data.

Validation Checkpoint

State of a validation checkpoint on a B2B pricing feed.

suite.name b2b_pricing_v4
execution.engine SparkDFExecutionEngine
records.scanned 1.2Mcomplete
expectations.total 42
expectations.passed 42100%
data_docs.generated s3://df-docs/run_8912/published
pipeline.action promote_to_gold

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about implementing Great Expectations in scraping pipelines, handling volatile schemas, and managing quarantine queues.

Ask us directly →
What's the difference between Great Expectations and standard Python assertions? +
Standard assertions are imperative and buried in your code. Great Expectations is declarative. It separates the rules from the execution, generates human-readable Data Docs automatically, and tracks validation history over time. It's a data contract, not just a script.
Does running Great Expectations slow down the scraping pipeline? +
Yes, if you run it row-by-row in Python during extraction. The correct architectural pattern is to run GX on micro-batches (e.g., Parquet files in S3) using a vectorized engine like Pandas or Spark. This adds seconds to the pipeline, not milliseconds per request.
How do you handle target sites that change constantly? +
You use the mostly parameter. If a site's secondary image field is flaky, you set expect_column_values_to_not_be_null(column="image_2", mostly=0.85). This allows the pipeline to pass if 85% of records have the image, absorbing minor volatility without halting the entire feed.
What happens when an expectation fails? +
The batch is routed to a quarantine queue. We never silently drop the failed records, and we never write them to the client's production dataset. An engineer reviews the Data Docs, fixes the selector or updates the expectation, and replays the quarantined batch.
Can Great Expectations validate nested JSON API responses? +
Natively, GX is designed for tabular data (dataframes, SQL tables). To validate complex nested JSON from an API scrape, you must first flatten the payload into a tabular structure during the extraction phase, or write custom expectations. We typically flatten to NDJSON before validation.
How does DataFlirt integrate GX into its infrastructure? +
We embed GX checkpoints as gating tasks in our Airflow orchestration DAGs. A pipeline cannot promote data from the 'silver' (extracted) layer to the 'gold' (client-ready) layer unless the GX checkpoint returns a 100% success rate on critical expectations.
$ dataflirt scope --new-project --target=great-expectations READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h