← Glossary / Few-Shot Extraction

What is Few-Shot Extraction?

Few-shot extraction is an AI-driven parsing technique where a Large Language Model (LLM) is given a target schema and a handful of labeled examples, then tasked with extracting structured data from raw, unstructured text or HTML. Unlike traditional CSS selectors that break when a site's layout changes, few-shot models rely on semantic understanding, making them highly resilient to DOM drift and ideal for long-tail scraping targets where writing custom parsers is economically unviable.

LLM ParsingSchema ExtractionSemantic ScrapingPrompt EngineeringResilience

// 02 — definitions

Semantics over
selectors.

Why brittle XPath rules are being replaced by probabilistic models that understand what a price looks like, regardless of where it sits in the DOM.

Ask a DataFlirt engineer →

TL;DR

Few-shot extraction uses LLMs to parse web data by providing 2–5 examples of the desired output in the prompt. It trades the low latency and compute cost of deterministic parsers (like BeautifulSoup) for extreme resilience against site changes. It is the backbone of modern long-tail scraping pipelines.

01Definition & structure

Few-shot extraction is a prompt engineering technique used to parse unstructured or semi-structured data. Instead of writing code to locate data, you provide an LLM with a system prompt, a strict JSON schema, and a "few" (typically 2 to 5) examples of input text mapped to the correct output. The model uses these examples to infer the extraction logic semantically, allowing it to parse highly variable layouts without breaking.

02How it works in practice

A typical pipeline fetches the raw HTML, strips out the boilerplate (scripts, navbars, footers) to save tokens, and injects the cleaned text into the LLM prompt alongside the few-shot examples. The model returns a JSON object. Because the model understands that "MSRP: $49" and "Price: 49 USD" mean the same thing, it successfully maps both to the price field in your schema, completely ignoring the underlying DOM structure.

03The context window constraint

The primary technical limitation of few-shot extraction is the context window. Raw HTML is incredibly token-dense. If you pass a 2MB DOM to an LLM, you will either hit the token limit, incur massive API costs, or suffer from "lost in the middle" syndrome where the model ignores data buried deep in the prompt. Effective few-shot extraction requires aggressive pre-processing to distill the page down to its semantic core before the LLM ever sees it.

04How DataFlirt handles it

We treat LLMs as a fallback, not the frontline. Our pipelines run fast, cheap deterministic extractors first. When our schema drift detection flags a broken selector, the pipeline automatically routes the failed record to our few-shot extraction cluster. The LLM parses the data, and we use that output to auto-heal the broken CSS selector for the next run. This hybrid architecture delivers 99.9% extraction success rates at a fraction of the cost of pure AI scraping.

05Did you know: JSON mode isn't enough

Simply turning on "JSON mode" in the OpenAI API does not guarantee schema adherence. JSON mode ensures the output is valid JSON, but it does not ensure the keys match your schema or that the values are the correct types. Few-shot examples are critical because they teach the model your specific type coercions—for example, showing it that "Out of Stock" should be extracted as null rather than a string.

// 03 — the economics

The cost of
semantic parsing.

LLM extraction introduces variable token costs and latency overheads that deterministic parsers don't have. DataFlirt models these constraints to route extraction tasks dynamically between regex, CSS selectors, and few-shot models.

Extraction Cost = C = (T_in × R_in) + (T_out × R_out)

Total cost per record based on input/output token counts and model rates. Standard LLM pricing model

Few-Shot Accuracy Gain = ΔA = A_few − A_zero

Providing 3 examples typically boosts schema adherence by 15-40% over zero-shot. DataFlirt AI benchmarks, 2025

DataFlirt Confidence Score = S = P(tokens) × SchemaMatch

Records scoring < 0.95 are routed to human-in-the-loop review. Internal validation SLO

// 04 — prompt execution

From raw HTML to
validated JSON.

A live trace of a few-shot extraction job parsing a real estate listing. The model receives the raw DOM text, 3 examples, and a strict JSON schema contract.

gpt-4o-miniJSON modeschema validation

edge.dataflirt.io — live

CAPTURED

// prompt assembly
task.id: "ext-re-092"
model: "gpt-4o-mini"
system_prompt: "Extract property details matching schema v4."
examples_loaded: 3 // few-shot context active
input_tokens: 4,102

// execution
status: "generating..."
latency: 840ms

// output payload
raw_response: "{ 'price': 450000, 'beds': 3, 'baths': 2, 'sqft': 1850 }"
output_tokens: 42

// deterministic validation
schema.match: true
type.price: integer
hallucination_check: passed // values exist in source HTML
pipeline.route: DELIVERED

// 05 — failure modes

Where semantic
parsing breaks.

LLMs are probabilistic. While they survive DOM changes, they introduce new failure modes that deterministic parsers avoid. Ranked by frequency across DataFlirt's AI extraction fleet.

PIPELINES MONITORED · 120+ AI-driven

AVG LATENCY · · · · 800–1200ms

UPDATED · · · · · · 2026-05-19

01

Hallucinated values

% of AI errors · Model infers missing data instead of returning null

02

Context window truncation

% of AI errors · Large DOMs exceed token limits, dropping data

03

Schema non-adherence

% of AI errors · Returning strings for integer fields

04

Rate limit bottlenecks

% of AI errors · Provider API limits throttling pipeline throughput

05

Example overfitting

% of AI errors · Model rigidly copies example formats over reality

// 06 — our architecture

Semantic resilience,

backed by deterministic validation.

DataFlirt does not blindly trust LLM outputs. We use few-shot extraction to navigate DOM chaos, but we pipe the resulting JSON through a strict, deterministic schema validation layer. If the LLM hallucinates a price that isn't in the source HTML, or coerces a string incorrectly, the record is quarantined. AI provides the flexibility; traditional engineering provides the safety.

AI Extraction Job Status

Live telemetry from a few-shot extraction worker processing unstructured news articles.

worker.id ai-ext-node-04

model.tier gpt-4o-minicost-optimized

records.processed 14,200

schema.pass_rate 99.1%nominal

hallucinations 12 records

avg.latency 910mswithin SLO

fallback.triggered 0 times

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About LLM parsing, token economics, hallucination prevention, and how DataFlirt scales semantic extraction.

Ask us directly →

What is the difference between zero-shot and few-shot extraction? +

Zero-shot extraction gives the model a schema and asks it to parse the data with no examples. Few-shot provides 2 to 5 labeled examples of the input text and the exact desired JSON output. Few-shot drastically improves schema adherence, reduces hallucinations, and helps the model understand edge cases (like how to format specific dates or currencies).

Is few-shot extraction too expensive for high-volume scraping? +

If you run it on every record naively, yes. DataFlirt uses a hybrid approach: we use few-shot extraction to automatically generate and heal CSS selectors or regex patterns. The LLM runs only when the deterministic parser fails (schema drift). This gives you the resilience of AI with the unit economics of traditional scraping.

How do you prevent the LLM from hallucinating data? +

Through post-extraction validation. We run a deterministic check that verifies every extracted string or number actually exists in the raw source HTML. If the model returns a price of $500, but "500" does not appear in the input text, the record is flagged for quarantine. You cannot trust LLMs without a verification layer.

Does this replace CSS selectors entirely? +

No. CSS selectors are orders of magnitude faster and cheaper. Few-shot extraction shines for long-tail scraping (extracting data from 10,000 different websites where writing 10,000 custom parsers is impossible) or for highly unstructured text (like parsing product specs out of a dense paragraph).

How does DataFlirt handle context window limits on massive HTML pages? +

We don't send raw HTML to the LLM. We run a pre-processing step that strips boilerplate, scripts, styles, and irrelevant DOM nodes, converting the page to clean Markdown or simplified text. This reduces token consumption by 80-90%, lowering costs and keeping the payload well within the model's context window.

What about data privacy when sending scraped content to an LLM provider? +

We use enterprise API endpoints (like Azure OpenAI or AWS Bedrock) with zero-data-retention agreements. Your scraped data is never used to train foundational models. For highly sensitive pipelines, we deploy open-weight models (like Llama 3 or Mistral) on our own bare-metal infrastructure to guarantee data sovereignty.

$ dataflirt scope --new-project --target=few-shot-extraction READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

Start a pipeline → View pricing

hello@dataflirt.com · Bengaluru · IST · typical reply < 4h