← Glossary / Feature Engineering

What is Feature Engineering?

Feature engineering is the process of transforming raw scraped data — HTML text, DOM structures, and metadata — into structured numerical representations that machine learning models can actually use. In the context of AI scraping, it bridges the gap between unstructured web chaos and predictable model inputs. If you feed raw HTML into a classifier without extracting the right signals, your model will drown in noise and compute costs will skyrocket.

AI ScrapingData TransformationNLPVectorizationML Pipelines
// 02 — definitions

Signal from
the noise.

How we turn messy, nested DOM elements and raw text into clean, mathematical inputs for downstream AI models.

Ask a DataFlirt engineer →

TL;DR

Feature engineering takes raw scraped content and extracts specific, measurable attributes like text length, keyword frequency, or DOM depth. It is the critical step before vectorization or classification. Good features make simple models perform well; bad features make complex models fail.

01Definition & structure

Feature engineering is the process of using domain knowledge to extract new variables (features) from raw data. In web scraping, the raw data is usually HTML, JSON, or unstructured text. Machine learning models require numbers. Feature engineering is the translation layer.

A standard feature engineering pipeline includes:

  • text.cleaning — lowercasing, stemming, stop-word removal
  • text.metrics — word counts, character counts, readability scores
  • dom.structure — tag density, link-to-text ratios, header hierarchies
  • categorical.encoding — turning text labels into binary vectors
02How it works in practice

Instead of passing a raw 5,000-word news article to a classifier, a feature engineering script parses the text. It calculates the TF-IDF scores for the top 100 keywords, counts the number of external links, and measures the average sentence length. The model receives a neat array of 102 numbers. This drastically reduces memory usage and allows simpler algorithms like Random Forests to outperform massive neural networks on specific tasks.

03Common feature types in scraping

Beyond standard NLP metrics, scraping provides unique structural features. The link-to-text ratio is a powerful feature for detecting spam or affiliate pages. The DOM depth of a text node often correlates with its importance — main content sits higher, while boilerplate sits deep in nested divs. Temporal features, like the time elapsed between a product's publication date and the scrape date, are critical for pricing models.

04How DataFlirt handles it

We push feature engineering as close to the edge as possible. Our extraction workers do not just pull text; they compute baseline NLP features and structural metadata in memory before writing to the delivery sink. This means our clients receive datasets that are immediately ready for vectorization or model training, bypassing the usual messy Python pandas scripts required to clean raw scrape dumps.

05The curse of dimensionality

A common mistake is generating too many features. If you use one-hot encoding on a scraped field with 10,000 unique values (like "author name"), you add 10,000 columns to your dataset. This leads to the curse of dimensionality, where the model overfits to noise and training time explodes. Good feature engineering is as much about dropping useless variables as it is about creating new ones.

// 03 — the math

How we measure
feature value.

Not all extracted features are useful. We evaluate feature importance mathematically to drop noise before it hits the model, reducing compute costs and improving accuracy across our AI pipelines.

TF-IDF Score = TF(t,d) × log(N / DF(t))
Weighs term frequency against document frequency to find unique keywords. Standard NLP metric
Information Gain = H(T) - H(T|a)
Measures how much a specific feature reduces entropy in the target variable. Decision Tree algorithms
DataFlirt Feature Density = non_null_features / (total_features × records)
If density drops below 0.85, we prune the feature from the extraction schema. Internal pipeline SLO
// 04 — feature extraction trace

Raw DOM to
feature vector.

A live trace of a product review being parsed, cleaned, and engineered into a feature set for a sentiment classification model.

NLPText CleaningVectorization
edge.dataflirt.io — live
CAPTURED
// input record
source.type: "html_review"
raw_text: "The battery life is terrible!!! Barely lasts 2 hrs."

// text normalization
text.lower: "the battery life is terrible!!! barely lasts 2 hrs."
text.no_punct: "the battery life is terrible barely lasts 2 hrs"

// feature extraction
feature.word_count: 9
feature.exclamation_count: 3 // high emotional valence
feature.has_numbers: 1
feature.sentiment_lexicon: -0.85

// n-gram generation
feature.bigrams: ["battery life", "life is", "is terrible", "barely lasts"]

// final vector output
vector.dense: [9.0, 3.0, 1.0, -0.85, 0.0, 1.0, ...]
pipeline.status: ready for inference
// 05 — feature impact

Which features
drive accuracy.

Ranked by their average contribution to model accuracy across our standard text classification pipelines. Structural features often punch above their weight compared to pure text.

MODELS EVALUATED ·  ·  ·  45 active
FEATURE SPACE ·  ·  ·  ·  1,200+ dims
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Domain-specific keywords (TF-IDF)

high impact · Core vocabulary drives 60% of classification accuracy
02

DOM structural depth

medium impact · Deeply nested text often indicates boilerplate
03

Sentiment lexicon scores

medium impact · Pre-computed valence speeds up downstream models
04

Text length and density

baseline · Filters out stub reviews and empty descriptions
05

Punctuation frequency

niche · Useful for spam and bot-generated content detection
// 06 — our pipeline

Compute features once,

serve them everywhere.

DataFlirt computes standard NLP and structural features at the edge during the extraction phase. By the time the data hits your S3 bucket, it is already enriched with token counts, sentiment scores, and structural metadata. This saves your data science team hundreds of hours of repetitive preprocessing and ensures feature consistency across historical and real-time data feeds.

Feature extraction job

Live status of a feature engineering worker processing e-commerce reviews.

job.id feat-eng-nlp-042
records.processed 85,400
features.extracted 12 per record
null_rate 0.02%
compute.latency 14ms / record
outliers.dropped 142 records
output.status writing to parquet

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

Common questions about feature engineering, text normalization, and preparing scraped data for machine learning models.

Ask us directly →
Why not just pass raw text directly to an LLM? +
You can, but it is expensive and slow. LLMs charge by the token. If you pass raw HTML or uncleaned text, you pay for boilerplate, navigation menus, and script tags. Feature engineering strips the noise and extracts the core signals, allowing you to use cheaper, faster models like XGBoost or smaller fine-tuned transformers for classification tasks.
What is the difference between data cleaning and feature engineering? +
Data cleaning fixes errors — removing duplicates, handling nulls, and standardizing date formats. Feature engineering creates new variables. Turning a date of birth into an "age" column is feature engineering. Extracting the count of exclamation marks from a review to measure anger is feature engineering.
How do you handle categorical variables from scraped data? +
We use one-hot encoding for low-cardinality fields like "product category" or "stock status". For high-cardinality fields like "brand name", we use target encoding or embedding lookups to keep the feature space manageable and prevent the curse of dimensionality.
Does DataFlirt provide pre-engineered features? +
Yes. For our enterprise AI pipelines, we deliver datasets with pre-computed NLP features including token counts, language detection, TF-IDF vectors, and basic sentiment scores. This allows your data scientists to start modeling on day one instead of writing regex parsers.
How do you prevent data leakage during feature engineering? +
Data leakage happens when information from outside the training dataset is used to create the model. In scraping pipelines, we ensure that time-series features (like rolling averages of product prices) are strictly computed using only data available prior to the timestamp of the target record.
What happens when the source website changes its DOM structure? +
Structural features (like DOM depth or CSS class presence) will drift. Our schema validation layer monitors feature distributions in real-time. If the average DOM depth of a product description suddenly drops by 50%, the pipeline flags an anomaly and pauses delivery until the selectors are updated.
$ dataflirt scope --new-project --target=feature-engineering READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h