← Glossary / SAML Flow Scraping

What is SAML Flow Scraping?

SAML Flow Scraping is the process of automating authentication across enterprise identity providers (IdPs) to access data behind single sign-on (SSO) walls. Unlike simple form logins, SAML requires capturing, decoding, and re-submitting base64-encoded XML assertions across multiple redirect hops while maintaining strict cookie state. For data pipelines targeting B2B portals, mastering this handshake is the difference between a reliable feed and a permanent 401 Unauthorized.

Auth ScrapingSSOIdentity ProviderStatefulXML Assertions
// 02 — definitions

Navigating the
identity maze.

The mechanics of automating enterprise authentication flows without triggering anomalous login alerts at the IdP.

Ask a DataFlirt engineer →

TL;DR

SAML flow scraping requires a stateful client to follow a strict sequence of HTTP redirects between a Service Provider (SP) and an Identity Provider (IdP). The scraper must capture the SAMLRequest, authenticate at the IdP, extract the base64-encoded SAMLResponse, and POST it back to the SP's Assertion Consumer Service (ACS) URL to establish a session.

01Definition & structure
SAML Flow Scraping is the automation of the Security Assertion Markup Language (SAML) protocol to gain programmatic access to a target site. The flow involves three parties: the scraper (acting as the User Agent), the Service Provider (SP, the target site), and the Identity Provider (IdP, like Okta or Azure AD). The scraper must navigate a series of HTTP 302 redirects, submit credentials to the IdP, capture a base64-encoded XML SAMLResponse, and POST it to the SP's Assertion Consumer Service (ACS) URL to receive a valid session cookie.
02How it works in practice
Because SAML relies heavily on browser redirects and hidden form submissions, naive HTTP scrapers often fail by dropping cookies across domain boundaries (e.g., moving from target.com to okta.com and back). A robust scraper must maintain a strict, unified cookie jar. When the IdP returns the HTML page containing the SAML assertion, the scraper must parse the DOM, extract the SAMLResponse and RelayState input values, and execute the final POST request exactly as a browser would.
03The challenge of IdP anti-bot systems
The hardest part of SAML scraping isn't the XML—it's the IdP. Enterprise Identity Providers deploy aggressive anti-bot measures to protect corporate credentials. If you attempt to script a login to Azure AD using basic HTTP requests, you will likely hit JavaScript challenges, conditional access policies, or CAPTCHAs. Overcoming this requires high-fidelity browser fingerprints and residential or static-ISP proxies that align with the expected geographic location of the user.
04How DataFlirt handles it
We utilize a hybrid worker model. We spin up a fully headed, fingerprint-spoofed browser instance to navigate the IdP login, handle any MFA via injected TOTP seeds, and solve JS challenges. We intercept the network traffic at the browser level to catch the final ACS POST request. We then extract the resulting session cookies and immediately transfer them to our high-speed, stateless HTTP worker pool. This gives our clients the reliability of a real browser login with the throughput of a raw HTTP pipeline.
05Did you know?
The RelayState parameter in a SAML flow is crucial for scrapers. It tells the Service Provider exactly which deep-link URL you were trying to access before you were redirected to the IdP. If your scraper drops or modifies the RelayState during the handshake, the SP will authenticate you but dump you on the default homepage, breaking your pipeline's URL routing logic.
// 03 — the auth model

How fragile is
a SAML handshake?

SAML flows are highly sensitive to state loss and timing. DataFlirt monitors these metrics to ensure automated SSO logins don't trigger IdP security policies or timeout windows.

SAML Session Success = P(success) = StateSPAuthIdPStateACS
Cookies must be perfectly preserved across all three domain transitions. Protocol requirement
Assertion Expiry Window = Tvalid = NotOnOrAfterNotBefore
The XML assertion is typically only valid for 2–5 minutes. The POST must happen quickly. SAML 2.0 Core Specification
DataFlirt Auth Latency = Lauth = Tredirects + Tidp_login + Tacs_post
Target < 2.5s end-to-end to avoid triggering slow-login heuristics. Internal SLO
// 04 — what the network sees

A successful SAML
handshake trace.

A live trace of a scraper navigating an SP-initiated SAML 2.0 flow. Notice the domain hops and the critical handoff of the SAMLResponse payload.

SAML 2.0Okta IdPStateful HTTP
edge.dataflirt.io — live
CAPTURED
// 1. SP Initiated Login
GET https://b2b-portal.example.com/login
status: 302 Found
location: https://idp.okta.com/app/xyz/sso/saml?SAMLRequest=fVLLTsMw...

// 2. IdP Authentication
POST https://idp.okta.com/api/v1/authn
payload: { "username": "bot@df.com", "password": "***" }
status: 200 OK
set-cookie: sid=00.abc123def; Secure; HttpOnly

// 3. Extracting the Assertion
GET https://idp.okta.com/app/xyz/sso/saml
extract.SAMLResponse: "PHNhbWxwOlJlc3BvbnNl..."
extract.RelayState: "https://b2b-portal.example.com/dashboard"

// 4. ACS POST (The Critical Handoff)
POST https://b2b-portal.example.com/saml/acs
payload: SAMLResponse + RelayState
status: 302 Found
set-cookie: portal_session=987xyz; Secure; HttpOnly

// 5. Target Access
GET https://b2b-portal.example.com/api/data
status: 200 OK // CAPTURED
// 05 — failure modes

Where SAML flows
break down.

Ranked by frequency of authentication failures across DataFlirt's B2B pipelines. State loss during cross-domain redirects is the dominant issue, followed closely by IdP-level bot detection.

PIPELINES MONITORED ·   140+ B2B targets
AUTH ATTEMPTS ·  ·  ·  ·  50k/day
UPDATED ·  ·  ·  ·  ·  ·  2026-05-19
01

Cross-domain cookie state loss

% of failures · Failing to persist cookies between SP and IdP
02

IdP anti-bot / MFA triggers

% of failures · Okta/Entra flagging anomalous login velocity
03

Assertion timestamp expiry

% of failures · NotOnOrAfter window missed during slow scrapes
04

Missing RelayState parameter

% of failures · SP rejects the login due to missing context
05

IdP DOM layout changes

% of failures · Breaks credential submission on the login form
// 06 — our architecture

Headless when necessary,

protocol-level when possible.

DataFlirt handles SAML flows by isolating the authentication phase from the extraction phase. We use lightweight headless browsers solely to navigate complex IdP login screens (like Okta or Entra ID) and solve JavaScript challenges. Once the SAMLResponse is generated, we intercept the ACS POST, extract the resulting session cookies, and hand them off to high-concurrency HTTP workers for the actual data extraction. This hybrid approach keeps compute costs low while maintaining enterprise-grade session stability.

saml-auth-worker.log

Live state of a hybrid SAML authentication worker.

sp.target b2b-portal.example.com
idp.provider Okta
auth.method headless_hybrid
mfa.status bypassed_via_totp_seed
acs.post intercepted
session.cookie portal_session_id
handoff.status transferred to http_pool

Stay ahead of the pipeline

Data engineering
intel, weekly.

Anti-bot shifts, scraping infrastructure updates, dataset delivery patterns, and business outcomes from our pipelines. Short, technical, no fluff.

// 07 — FAQ

Common
questions.

About SAML mechanics, legal boundaries, MFA handling, and how DataFlirt automates enterprise SSO at scale.

Ask us directly →
What is the difference between SAML and OAuth scraping? +
SAML relies on base64-encoded XML assertions passed via browser redirects and form POSTs, typically resulting in a session cookie. OAuth relies on JSON Web Tokens (JWTs) passed via API headers. SAML is heavily tied to browser behavior and DOM interactions, making it harder to automate purely via HTTP requests compared to OAuth.
Is it legal to scrape data behind a SAML login? +
The Authorized Access Doctrine applies. You must have legitimate, authorized credentials to the target system and adhere to the Service Provider's Terms of Service. DataFlirt only automates SAML flows for clients who own the underlying accounts and have the legal right to access the data they are extracting.
How do you handle MFA (Multi-Factor Authentication) during a SAML flow? +
We inject TOTP (Time-Based One-Time Password) seeds directly into our authentication workers. When the IdP prompts for a 2FA code, the worker generates a valid token on the fly and submits it. For IdPs that rely on push notifications, we utilize session persistence to keep our worker devices "trusted" for 30-day windows.
Why not just use a headless browser for the entire scrape? +
Cost and throughput. Headless browsers consume massive amounts of RAM and CPU. Using a browser just to complete the SAML handshake, and then passing the resulting session cookie to a lightweight HTTP client (like aiohttp or Go's net/http), is up to 20x faster and significantly cheaper at scale.
What happens when the SAML assertion expires? +
The SAML assertion itself (the XML payload) is only valid for a few minutes—just long enough to complete the login. The resulting session cookie, however, usually lasts hours or days. We monitor the cookie's validity; when the SP returns a 401, our orchestration layer automatically spins up a browser worker to execute a fresh SAML flow and update the cookie jar.
How does DataFlirt prevent IdP lockouts from automated logins? +
Enterprise IdPs like Azure AD monitor for "impossible travel" and anomalous IP ranges. We route all IdP authentication requests through dedicated static proxy IPs that match the geographic profile of the client's actual organization. This ensures the login context looks identical to a normal employee logging in from a corporate VPN.
$ dataflirt scope --new-project --target=saml-flow-scraping READY

Tell us what
to extract.
We do the rest.

20-minute scoping call. Pilot dataset within the week. Production within two. Whether you need a one-off catalogue dump or a continuous feed across millions of records — we scope, build, and operate the pipeline.

hello@dataflirt.com  ·  Bengaluru  ·  IST  ·  typical reply < 4h