← Glossary / Soda Core

What is Soda Core?

Soda Core is an open-source data reliability framework that uses YAML-based configuration to define and execute data quality checks across your pipelines. For scraping teams, it acts as the automated QA layer between raw extraction and client delivery, catching schema drift, null spikes, and type coercion failures before bad data poisons a downstream warehouse.

Data QualityYAML ChecksPipeline QASchema ValidationOpen Source
// 02 — definitions

Test your
data.

How declarative data quality checks prevent silent extraction failures from becoming downstream analytics disasters.

Ask a DataFlirt engineer →

TL;DR

Soda Core allows data engineers to write human-readable YAML contracts that assert what valid data should look like. It connects directly to your data warehouse or processing engine, runs the defined checks against the dataset, and fails the pipeline or alerts the team if thresholds — like maximum null rates or unexpected schema changes — are breached.

01Definition & structure
Soda Core is a CLI tool and Python library that enables data engineers to write declarative data quality checks using YAML. Instead of writing complex SQL scripts or Python validation logic from scratch, you define what the data should look like (e.g., missing_count(id) = 0). Soda translates these YAML instructions into optimized SQL, executes them against your data source, and returns pass/warn/fail statuses.
02How it works in practice
In a typical scraping pipeline, data is extracted, transformed, and loaded into a staging table in a data warehouse (the ELT pattern). Once the load is complete, an orchestrator like Airflow triggers a soda scan. Soda connects to the warehouse, runs the YAML-defined checks, and outputs the results. If the checks pass, the data is promoted to the production tables. If they fail, the pipeline halts, preventing corrupted data from reaching business dashboards.
03Common scraping quality checks
Web scraping is inherently brittle, making automated QA critical. Common checks include:
  • Completeness: Ensuring critical fields (price, SKU, title) are not null.
  • Validity: Checking that prices are positive numbers and emails match a regex pattern.
  • Uniqueness: Asserting that primary keys (like product URLs) have no duplicates.
  • Freshness: Verifying that the scraped_at timestamp is within the last 24 hours.
04How DataFlirt handles it
We integrate programmatic data contracts directly into our delivery layer. Before any client receives a dataset, it must pass a rigorous suite of automated checks. We use a mix of open-source concepts and proprietary validation engines to ensure that schema drift on a target site results in an internal engineering alert, not a broken client dashboard. We guarantee the shape and quality of the data, not just the execution of the crawler.
05The silent failure it prevents
The most dangerous scraping failure isn't a 403 Forbidden — it's a silent extraction failure. If a target site changes its CSS class for a price element, your scraper might successfully fetch 100,000 pages but extract null for every price. Without a tool like Soda Core checking the missing_percent(price), the pipeline will happily deliver 100,000 useless records to your analytics team.
// 03 — the quality model

How we measure
data reliability.

Soda Core translates business requirements into executable SQL queries. Here is how we mathematically model the thresholds that trigger pipeline alerts across our extraction fleet.

Null threshold = Nrate = null_count / row_count
Alerts if the percentage of missing values exceeds a defined tolerance. Completeness check
Freshness bound = Tlag = now() − max(scraped_at)
Ensures the dataset contains recently extracted records, catching stalled crawlers. Freshness check
Volume anomaly = Vdiff = |Vcurrent − μ(Vpast_7d)| / σ(Vpast_7d)
Z-score for row counts. Catches pagination failures or massive target site purges. Anomaly detection
// 04 — soda scan execution

Running a data
contract check.

A live trace of a Soda Core scan running against a freshly scraped e-commerce pricing dataset in Snowflake before client delivery.

soda-core 3.0YAMLSnowflake
edge.dataflirt.io — live
CAPTURED
// execute scan
$ soda scan -d snowflake_prod -c configuration.yml pricing_checks.yml

// reading configuration
Soda Core 3.0.45
Scan summary:
Data source: snowflake_prod
Dataset: raw_pricing_in

// executing checks
row_count > 0 [pass: 142,501]
missing_count(sku) = 0 [pass: 0]
duplicate_count(sku) = 0 [pass: 0]
missing_percent(price) < 2% [warn: 3.4%]
schema:
columns:
sku: varchar
price: numeric

// scan results
Passes: 4
Warnings: 1
Failures: 0
Status: Completed with warnings
// 05 — failure modes

Where scraped data
fails validation.

The most common data quality violations caught by automated checks across active scraping pipelines. Schema drift remains the dominant cause of downstream data corruption.

DATASETS SCANNED ·  ·  ·  1,200+ daily
CHECK EXECUTION ·  ·  ·   post-load
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Schema drift / missing fields

structural · Target site DOM changes break selectors
02

Type coercion errors

formatting · Strings in numeric fields (e.g., 'Out of Stock')
03

Volume anomalies

completeness · Pagination loops fail, returning 10% of expected rows
04

Stale data / freshness

temporal · Crawler blocked, returning cached or old records
05

Duplicate records

uniqueness · Overlapping crawl frontiers or bad deduplication
// 06 — our QA stack

Trust is good,

declarative data contracts are better.

At DataFlirt, we don't just deliver CSVs and hope for the best. Every dataset passes through a rigorous validation layer inspired by declarative models like Soda Core. We define strict contracts for completeness, accuracy, and format. If a target site changes its layout and our price extractor starts pulling empty strings, the pipeline halts and quarantines the batch. Bad data never reaches your warehouse.

pricing_checks.yml

Validation contract for a daily e-commerce pricing feed.

dataset raw_pricing_in
checks.row_count > 100000pass
checks.freshness < 24hpass
checks.missing_sku = 0pass
checks.missing_prc < 2%3.4%
checks.schema strict matchpass
pipeline.action deliver with warning

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About data quality frameworks, YAML checks, pipeline integration, and how DataFlirt ensures reliable data delivery.

Ask us directly →
How does Soda Core compare to Great Expectations? +
Both are excellent data quality tools. Great Expectations is Python-heavy, highly customizable, and generates extensive documentation. Soda Core is YAML-first, heavily focused on SQL generation, and generally easier for analysts and non-engineers to write checks for. We find Soda's declarative syntax faster to deploy across hundreds of distinct scraping schemas.
Can Soda Core run on streaming data? +
Soda Core is primarily designed for batch processing — it executes SQL queries against data at rest in a warehouse or database (like Snowflake, BigQuery, or PostgreSQL). For real-time streaming pipelines (like Kafka or Flink), you typically need a different validation architecture, though you can run Soda on micro-batches once they land in a staging table.
How do you handle dynamic thresholds for row counts? +
Hardcoding row counts (e.g., `row_count > 50000`) breaks when catalogs grow or shrink naturally. Soda supports anomaly detection and dynamic thresholds using historical metrics. You can configure checks to fail if the current row count deviates by more than a specific percentage from the 7-day moving average, accommodating organic growth while catching catastrophic crawl failures.
What happens when a check fails in production? +
It depends on the severity. We categorize checks into warnings and errors. A warning (e.g., a slight uptick in null descriptions) sends an alert to our Slack channel but allows the data to be delivered. An error (e.g., 100% null prices or a missing primary key) halts the pipeline, quarantines the batch, and pages an engineer to fix the extractor.
Does running data quality checks add significant latency? +
Minimal. Because Soda Core pushes the computation down to the data warehouse (using native SQL), it leverages the warehouse's distributed compute. For a dataset of a few million rows in BigQuery or Snowflake, a standard suite of 20 checks usually executes in under 15 seconds.
How does DataFlirt use data contracts? +
We treat data contracts as binding agreements. Before we deliver a dataset to a client's S3 bucket or Snowflake instance, it must pass a versioned schema and quality check. If the target site changes and breaks our extraction, the contract fails, preventing us from silently delivering garbage data. We fix the scraper, backfill the data, and deliver a clean batch.
$ dataflirt scope --new-project --target=soda-core READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h