← Glossary / Label Studio

What is Label Studio?

Label Studio is an open-source data annotation platform used to build ground-truth datasets for machine learning models. In scraping pipelines, it serves as the human-in-the-loop interface where raw scraped text, HTML, or images are manually tagged to train custom Named Entity Recognition (NER) or LLM-based extraction models. When DOM-based selectors become too brittle for complex unstructured pages, annotated datasets are what make AI-driven extraction possible.

Data AnnotationHuman-in-the-loopNERGround TruthFine-Tuning
// 02 — definitions

From raw text
to ground truth.

How unstructured scraped data is transformed into the labeled examples required to train robust extraction models.

Ask a DataFlirt engineer →

TL;DR

Label Studio provides the UI and workflow for human annotators to highlight and classify entities within raw scraped data. It outputs structured JSON/XML that data engineering teams use to fine-tune extraction models, shifting the pipeline dependency from brittle CSS selectors to resilient semantic understanding.

01Definition & structure
Label Studio is an open-source data labeling tool. It provides a configurable web interface where human annotators can view raw data (text, HTML, images, audio) and apply structured tags. In the context of web scraping, it is used to highlight specific fields—like product names, prices, or addresses—within unstructured text blocks. The output is a highly structured JSON file mapping the exact character offsets of each entity, which serves as the ground truth for training machine learning models.
02The role in AI extraction
Traditional scraping relies on CSS selectors or XPath. When target websites are highly dynamic, lack consistent class names, or present data in free-text paragraphs, selectors fail. AI extraction solves this by using NLP models to "read" the text. Label Studio is the bridge between the old way and the new way: it is where you create the training data that teaches the AI what a "price" or "SKU" looks like in the wild.
03Pre-annotation workflows
Starting from a blank slate is slow and expensive. Modern annotation pipelines use pre-annotation. A baseline model (like GPT-4 or a generic NER model) makes a first pass over the scraped data and populates Label Studio with its best guesses. The human annotator then acts as an editor, correcting mistakes rather than highlighting every word manually. This drastically reduces the time and cost required to build a dataset.
04How DataFlirt handles it
We integrate Label Studio directly into our active learning pipelines. When our automated extraction layer encounters a record with a low confidence score, it automatically routes that specific record to our internal Label Studio instance. Our data operations team annotates the edge case, and the corrected record is immediately queued for the next model fine-tuning run. We don't label data we already understand; we only label the exceptions.
05Common misconception
A frequent mistake is assuming that more labeled data always equals a better model. In reality, consistency matters more than volume. If two annotators highlight the same field differently (e.g., one includes the currency symbol, the other doesn't), the model receives conflicting signals and performance degrades. 500 perfectly consistent records will outperform 5,000 sloppily annotated records every time.
// 03 — annotation metrics

How reliable is
the labeled data?

Annotation quality directly dictates extraction model performance. DataFlirt tracks inter-annotator agreement to ensure our training datasets don't introduce bias or noise into the AI extraction layer.

Inter-Annotator Agreement (Cohen's Kappa) = κ = (pope) / (1pe)
Measures agreement between two annotators on the same scraped record. >0.8 is production-ready. Standard ML metric
Annotation Velocity = V = records_labeled / annotator_hours
Throughput metric. Pre-labeling with a zero-shot model increases V by ~3x. DataFlirt pipeline ops
Model Confidence Threshold = C = P(entity | context) > 0.92
Records scoring below C are automatically routed back to Label Studio for human review. Active Learning Loop
// 04 — the annotation payload

Exporting labeled
records for training.

A JSON export from Label Studio representing a single scraped product description, annotated for Named Entity Recognition (NER) to train a custom extraction model.

JSON exportNER taskground truth
edge.dataflirt.io — live
CAPTURED
// label_studio_export.json
"id": 49201,
"data": {
"text": "Apple iPhone 15 Pro Max 256GB Natural Titanium - Unlocked"
},
"annotations": [
{
"result": [
{ "value": { "start": 0, "end": 5, "labels": ["BRAND"] } },
{ "value": { "start": 6, "end": 25, "labels": ["MODEL"] } },
{ "value": { "start": 26, "end": 31, "labels": ["CAPACITY"] } },
{ "value": { "start": 32, "end": 48, "labels": ["COLOR"] } }
],
"was_cancelled": false,
"ground_truth": true
}
]
// 05 — annotation bottlenecks

Where labeling
projects stall.

Building ground truth is the most expensive phase of deploying AI extraction. These are the primary failure modes in annotation workflows.

AVG TIME PER RECORD ·   12–45 seconds
AGREEMENT TARGET ·  ·  ·  κ > 0.85
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Ambiguous labeling guidelines

% of errors · Annotators disagreeing on edge cases
02

Subject matter expertise gaps

% of errors · Complex B2B or medical data mislabeled
03

UI/UX friction for annotators

% of errors · Slow hotkeys and poor layout design
04

Imbalanced class distribution

% of errors · Rare fields lacking enough examples
05

Stale pre-annotations

% of errors · Human bias toward accepting bad model guesses
// 06 — active learning

Don't label everything,

only label what the model gets wrong.

DataFlirt uses an active learning loop to minimize human annotation time. We don't dump 10,000 random scraped records into Label Studio. Instead, a baseline model attempts extraction, and we only route the low-confidence predictions to human annotators. Once labeled, those edge cases are fed back into the training pipeline, continuously improving the model's accuracy while keeping human-in-the-loop costs strictly bounded.

Active Learning Queue

Live routing metrics for an AI-based invoice extraction pipeline.

pipeline.id extract-invoice-ai
records.processed 45,200
model.high_confidence 43,850auto-extracted
model.low_confidence 1,350review required
routed_to_label_studio 1,350 records
annotator.throughput 240 records/hr
model.retraining pending batch

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About data annotation, human-in-the-loop workflows, and how DataFlirt builds ground truth for AI extraction models.

Ask us directly →
Why use Label Studio instead of just writing regex or CSS selectors? +
Regex and CSS selectors are brittle. When a target site redesigns its DOM, selectors break instantly. By annotating data in Label Studio and training an NLP or vision model, you teach the system to understand the semantic structure of the content. The model extracts the price because it looks like a price in context, not because it's inside a <div class="price-tag">.
How does Label Studio integrate with a scraping pipeline? +
It sits parallel to the main pipeline. Raw scraped data is pushed to Label Studio via API. Annotators label the data, and the resulting JSON annotations are pulled into a training pipeline. Once the model is trained and deployed, it replaces the traditional parsing logic in the extraction layer.
What is pre-annotation and why does it matter? +
Pre-annotation uses an existing, imperfect model (like a zero-shot LLM) to guess the labels before a human sees the record. The annotator's job shifts from "highlight every entity from scratch" to "correct the model's mistakes." This typically increases annotation velocity by 300% to 500%, drastically reducing the cost of building ground truth.
How does DataFlirt handle data privacy during the annotation process? +
We strip PII and sensitive identifiers before pushing records to the annotation UI. If the extraction task requires recognizing PII (e.g., extracting names from public directories), we use synthetic data generation or strict, air-gapped annotation environments with cleared personnel to ensure compliance with GDPR and CCPA.
Can Label Studio handle image and PDF scraping? +
Yes. Label Studio supports bounding box and polygon annotation for computer vision tasks. For PDF scraping, we convert pages to images, annotate the spatial layout (tables, headers, paragraphs), and train Document AI models to extract structured data from visually complex, non-HTML documents.
How many annotated records do I need to train a custom extraction model? +
It depends on the complexity of the schema and the base model. Fine-tuning a modern LLM for extraction might only require 50–200 high-quality examples (few-shot). Training a smaller, faster BERT-based NER model from scratch typically requires 1,000 to 5,000 annotated records to achieve production-grade F1 scores.
$ dataflirt scope --new-project --target=label-studio READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h