← All Posts Web Scraping With Python Using Beautiful Soup

Web Scraping With Python Using Beautiful Soup

· Updated 11 Jun 2026
Author
Nishant
Nishant

Founder of DataFlirt.com. Logging web scraping shhhecrets to help data engineering and business analytics/growth teams extract and operationalise web data at scale.

TL;DRQuick summary
  • Beautiful Soup parses HTML; it does not fetch it. Pair it with Requests for static pages, and reach for a headless browser only when content renders through JavaScript.
  • The tutorial code is runnable as written against a public practice site, covering a first parse, find_all versus CSS selectors, pagination, and saving results to CSV.
  • Polite scraping is what keeps scripts alive. Delays between requests, honest headers, robots.txt awareness, and logged-off collection prevent most blocks before they happen.
  • Beautiful Soup stops being enough at JavaScript rendering, aggressive anti-bot systems, and multi-source scheduled pipelines. Those are framework and infrastructure problems.
  • DataFlirt builds production scraping on the same open-source stack this tutorial teaches, so the path from your first script to a maintained data feed is a scoping call, not a rewrite.

Every web scraping tutorial shows you ten lines of Beautiful Soup that print a page title, and then you hit a real website and nothing works: the prices are in nested tags, the results span forty pages, and your tenth request comes back blocked. This guide is built for that gap. The code below runs as written, scrapes a site that exists for exactly this practice, and covers the parts most tutorials skip: pagination, polite request patterns, and the honest boundary where Beautiful Soup stops being the right tool.

A word on why this matters beyond the exercise. Python web scraping is how teams collect the pricing, listing, and review data their businesses run on, and DataFlirt builds production pipelines on this same open-source foundation every week. Learning it properly means you can build small things yourself and scope big things intelligently.

What Beautiful Soup does, and what it does not

Beautiful Soup is an HTML parser. You hand it a page’s HTML as a string, and it gives you a searchable tree you can query by tag, attribute, or CSS selector. That is the whole job, and it does the job superbly, including on the broken, mis-nested HTML that real websites ship.

The two things it never does

Beautiful Soup does not fetch pages, so it always pairs with an HTTP library, and Requests is the standard partner. It also does not execute JavaScript, which means content rendered in the browser is invisible to it. Both limits matter for planning a project, and both have clean workarounds covered later. If you want the wider context first, our overview of what web scraping is sets the stage; this guide stays hands-on.

Where it fits in a real stack

In production, Beautiful Soup is usually the parsing layer inside something bigger: Scrapy handles crawling at volume, Playwright handles JavaScript pages, and Beautiful Soup or Scrapy’s own parsel handles extraction. DataFlirt’s pipelines use exactly these tools, which is worth knowing because the skills in this tutorial transfer directly to how professional web scraping is actually built.

Setting up Python for web scraping

Two commands of setup prevent the dependency conflicts that derail beginners. Always work inside a virtual environment with pinned versions, the same discipline DataFlirt enforces on client pipelines so a run today behaves like a run in six months.

Create and activate the environment

python -m venv scraping-env
source scraping-env/bin/activate    # macOS/Linux
# scraping-env\Scripts\activate     # Windows

Install pinned dependencies

pip install beautifulsoup4==4.15.0 requests==2.34.2

That is the entire stack for static pages: Requests fetches, Beautiful Soup parses. The built-in html.parser backend keeps the install lean; you can add lxml later if parsing speed ever becomes the bottleneck, which for most projects it does not.

Your first scrape: parsing a product grid

The target is books.toscrape.com, a public site built specifically for scraping practice, so you can run everything here without legal or ethical worry. It mimics an ecommerce catalog: product cards, prices, ratings, and twenty pages of pagination.

Fetch the page politely

import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0 Safari/537.36"
    )
}

response = requests.get(
    "https://books.toscrape.com/", headers=HEADERS, timeout=20
)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")

The user-agent string identifies your client, the timeout stops a dead server from hanging your script, and raise_for_status() fails loudly on errors instead of parsing an error page as if it were data.

Inspect the page before writing code

Open the target in your browser, right-click a book title, and choose Inspect. The elements panel shows the structure your selectors must match: each book sits in an article tag with the class product_pod, the title lives in a link inside an h3, and the price sits in a p tag classed price_color. Five minutes of inspection beats an hour of guessing, and it is the same first step DataFlirt’s engineers take on every new target. One caution applies, covered in the debugging section below: browsers tidy HTML before displaying it, so always confirm your selector against the raw response too.

Extract fields with CSS selectors

Each book on the page lives in an article tag with the class product_pod. The pattern below grabs every card, then pulls fields from inside each one:

books = []
for card in soup.select("article.product_pod"):
    link = card.select_one("h3 a")
    price = card.select_one("p.price_color")
    availability = card.select_one("p.availability")
    books.append({
        "title": link["title"],
        "url": link["href"],
        "price": price.get_text(strip=True).lstrip("£"),
        "in_stock": "In stock" in availability.get_text(),
    })

print(f"Scraped {len(books)} books")
print(books[0])

Two habits in this snippet pay off everywhere: scope your selectors to a parent card so fields from different products never mix, and clean values (strip=True, stripping the currency symbol) at extraction time rather than in a painful pass later.

find_all versus select: choosing your selector style

Beautiful Soup gives you two query styles that reach the same elements. find() and find_all() match tags and attributes with Python arguments, while select() and select_one() take CSS selectors. Neither is wrong; they suit different heads.

When find_all wins

find_all shines when your matching logic is programmatic: every a tag with an href containing a substring, or tags filtered by a custom function. It reads like Python because it is Python:

links = soup.find_all("a", href=True)
sale_tags = soup.find_all("p", class_="price_color")

If your team already uses XPath elsewhere, note that Beautiful Soup does not support it natively; that is one reason heavier crawls graduate to Scrapy, which speaks both.

When CSS selectors win

select wins the moment your pattern is structural: a link inside a heading inside a product card is one expression, article.product_pod h3 a, instead of three chained finds. Anyone who has written CSS can read it, and selectors copied from browser devtools drop straight in. Most DataFlirt parsers standardize on CSS selectors for exactly that maintainability, switching styles only when a match needs logic CSS cannot express.

Extracting more than text: attributes and detail pages

Real datasets need more than visible text. Ratings hide in class names, canonical IDs hide in attributes, and half the fields you want live one click deeper on the product’s own page. Both patterns take a few lines once you know where to look.

Reading data out of attributes

On books.toscrape.com, the star rating never appears as text. It is encoded in a class name, star-rating Three, which is a common pattern on real sites too:

RATINGS = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

rating_tag = card.select_one("p.star-rating")
classes = rating_tag.get("class", [])
stars = next((RATINGS[c] for c in classes if c in RATINGS), None)

tag.get("class") returns a list of classes, and tag["href"] or tag.get("data-sku") reads any other attribute the same way. Attributes are often more reliable than display text because they exist for the site’s own JavaScript, so redesigns change them less often, a stability trick DataFlirt’s parsers exploit wherever a target allows it.

Following through to detail pages

Listing pages rarely carry every field, so the standard pattern is two-stage: collect URLs from the grid, then visit each detail page for the rest.

def scrape_detail(url):
    response = requests.get(url, headers=HEADERS, timeout=20)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")
    table = {
        row.th.get_text(strip=True): row.td.get_text(strip=True)
        for row in soup.select("table.table-striped tr")
    }
    return {
        "upc": table.get("UPC"),
        "availability": table.get("Availability"),
        "description": (
            soup.select_one("#product_description + p")
            .get_text(strip=True)
            if soup.select_one("#product_description + p")
            else None
        ),
    }

The dictionary comprehension turns the spec table into a lookup keyed by header, which survives row reordering. Mind the request count, though: a 1,000-product catalog now means 1,050 requests, so the politeness delay from the pagination section applies doubly here, and it is the main reason detail-page crawls are where DataFlirt clients most often graduate from scripts to managed pipelines.

Debugging when your selector returns nothing

Every scraper author hits the moment where a selector that works in browser devtools returns None in Python. The causes are few and checkable in order.

First, confirm the data is in the response at all: print("£51.77" in response.text) or search for a known string. If it is absent, the content is JavaScript-rendered and no selector will ever find it, which sends you to the headless browser section below. Second, remember that browsers repair HTML before you inspect it; devtools shows tbody tags that the raw source may not contain, so selectors copied from the inspector can reference elements Beautiful Soup never sees. Print soup.prettify()[:2000] and write selectors against what is actually there. Third, watch the Python-specific traps: the keyword is class_ with an underscore in find_all, and chaining like tag.select_one(...).get_text() raises AttributeError the moment a field is missing, so guard optional fields with a None check the way the detail-page code above does. DataFlirt’s code reviews flag unguarded chaining more than any other defect in inherited scrapers, because it works in testing and crashes on page 37.

Crawling every page: pagination done right

One page is a demo; the dataset lives across all twenty. The pattern that generalizes to most catalogs is to follow the “next” link until it stops existing, rather than guessing URL numbers, because sites change their page counts and URL schemes without notice.

The full multi-page scraper

import csv
import time
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

BASE_URL = "https://books.toscrape.com/"
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0 Safari/537.36"
    )
}


def parse_page(soup, page_url):
    for card in soup.select("article.product_pod"):
        link = card.select_one("h3 a")
        price = card.select_one("p.price_color")
        availability = card.select_one("p.availability")
        yield {
            "title": link["title"],
            "url": urljoin(page_url, link["href"]),
            "price": price.get_text(strip=True).lstrip("£"),
            "in_stock": "In stock" in availability.get_text(),
        }


def scrape_catalog(start_url):
    url = start_url
    while url:
        response = requests.get(url, headers=HEADERS, timeout=20)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "html.parser")

        yield from parse_page(soup, url)

        next_link = soup.select_one("li.next a")
        url = urljoin(url, next_link["href"]) if next_link else None
        time.sleep(1.5)  # be polite between requests


if __name__ == "__main__":
    with open("books.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(
            f, fieldnames=["title", "url", "price", "in_stock"]
        )
        writer.writeheader()
        for row in scrape_catalog(BASE_URL):
            writer.writerow(row)
    print("Done: books.csv written")

The details that prevent silent bugs

urljoin converts relative links into absolute URLs, which matters because this site’s “next” link is relative and changes shape after page one. The generator structure streams rows to CSV as they arrive, so a crash on page fifteen still leaves fourteen pages saved. And the time.sleep(1.5) is not decoration; pacing is the cheapest anti-block measure that exists. When clients bring DataFlirt a broken in-house scraper, missing absolute-URL handling and missing pacing are the two most common defects in it.

Scraping politely so you do not get blocked

Most beginner blocks are self-inflicted. A site seeing a hundred requests per second from one IP with a default python-requests user agent will respond with rate limiting, and it is right to. Polite scraping keeps you under those thresholds and on defensible ground.

The baseline practices

Check the site’s robots.txt for crawl preferences, keep one to a few seconds between requests, send a realistic user agent, and collect logged-off public pages rather than anything behind an account. These overlap heavily with the legal safe zone: US courts have held that scraping public, logged-off data does not violate the CFAA, while logins and terms of service create real risk. Our guide to crawling ethics covers the fuller picture, and for commercial projects, a review with qualified counsel is always the right move. DataFlirt bakes these defaults into every engagement, including a low footprint on target servers.

When blocks happen anyway

Defended sites escalate past politeness with IP reputation checks and CAPTCHA challenges, and at that point the fix is infrastructure, not code. Rotating proxies spread requests across many IPs, with residential proxies reserved for the targets that flag datacenter ranges on sight. Our guide to choosing a proxy service lays out the trade-offs. The honest steer: a practice site or small crawl needs none of this, and buying proxies before you are blocked is over-engineering. DataFlirt sizes the access layer to the target, which is why its quotes vary by site difficulty rather than a flat rate.

When Beautiful Soup is not enough

Beautiful Soup has a clean failure boundary, and recognizing it early saves days. If response.text does not contain the data you can see in your browser, no amount of parsing will find it.

JavaScript-rendered pages

Modern sites often ship a skeleton page and fill in prices, listings, or reviews with JavaScript after load. Requests never runs that JavaScript, so the data never reaches your parser. The fix is a headless browser: Playwright loads the page like a real browser, waits for rendering, and hands you the final HTML, which you can still parse with Beautiful Soup. Heavily dynamic targets are their own discipline, which is why DataFlirt runs a dedicated dynamic website service for them and keeps Playwright expertise in-house. Before reaching for a browser, also check the network tab: many sites load their data from a JSON endpoint you can request directly, which is faster than rendering anything.

The JSON-LD shortcut

Many listing pages embed their data as structured markup inside a script type="application/ld+json" tag, because sites want search engines to read their products. Parsing that block through JSON-LD extraction is dramatically more stable than CSS selectors, since redesigns change layouts but rarely the structured data. Our used car data guide includes a full working extractor for this pattern. Checking for JSON-LD is the first thing DataFlirt’s engineers do on any new listing-site target, and it should be yours too.

Scale, scheduling, and many sources

A script is one site, one run, one machine. A weekly feed across five sources needs scheduling, retries, monitoring for the hard scraping tasks, and storage decisions like which database receives the rows. That is the point where projects either graduate to Scrapy with real orchestration or stop being a side project at all.

From script to production pipeline

Everything above gets you to working scripts, and for plenty of needs a script is genuinely the right size. The question worth asking before scaling further is whether scraping is the job you want to own. Sites change weekly, anti-bot systems evolve, and the maintenance never finishes, which is why a feed’s true cost is the upkeep rather than the build.

When the answer is “we want the data, not the engineering project,” that is what DataFlirt is for. We build on the same open-source stack this tutorial uses, Requests and BeautifulSoup through Scrapy and Playwright, then add the proxy management, monitoring, and cleaning that production demands, and deliver query-ready data as CSV, JSON, or a feed straight into your database. Teams scraping marketplaces lean on us for sources like an Amazon scraper, a Flipkart scraper, an Etsy scraper, or a Myntra scraper; hiring and market teams pull from an Indeed scraper or a Glassdoor scraper; travel and local teams use a Booking scraper, a Tripadvisor scraper, or a Yelp scraper; and property teams start with a Zillow scraper. Each exists as a maintained pipeline, not a one-off script, and the ecommerce scraping service bundles the retail versions end to end.

The economics usually decide it. A script you run occasionally costs nothing but your time, while a feed your business depends on costs whatever it takes to keep it alive through every redesign and every new bot defense, and that recurring cost is the number to compare against a managed engagement.

Talk to DataFlirt about the sites and fields you need. Most projects are scoped within 48 hours, and we will send a sample dataset from your actual targets before you commit, so you judge the output, not the pitch.

Frequently asked questions

Does Beautiful Soup fetch web pages on its own?

No. Beautiful Soup only parses HTML you give it, so it pairs with an HTTP library such as Requests to fetch pages. Requests downloads the page, Beautiful Soup turns the HTML into a searchable tree, and your code extracts the fields. The two together cover most static websites.

Can Beautiful Soup scrape JavaScript-rendered websites?

It cannot. Beautiful Soup reads the HTML returned by the server, and JavaScript-rendered content never appears in that response. For pages that build their content in the browser, you need a headless browser such as Playwright to render the page first, then you can hand the rendered HTML to Beautiful Soup or extract directly.

Should I use find_all or CSS selectors in Beautiful Soup?

Use find and find_all when you are matching tag names and attributes in Python, and use select when you already think in CSS selectors or need nested patterns like article.product_pod h3 a in one expression. Both reach the same elements, so consistency within a project matters more than the choice itself.

How do I avoid getting blocked while scraping with Python?

Identify your scraper honestly where appropriate, keep request rates low with delays between pages, respect robots.txt, and scrape logged-off public pages rather than content behind a login. Blocks usually mean the site is rate limiting your IP, and the fix is slower requests or rotating proxies, never hammering harder.

Is Beautiful Soup good enough for production web scraping?

Beautiful Soup suits scripts and small to mid-size crawls on static sites. Production pipelines that run on a schedule across many sources need crawling frameworks, proxy management, monitoring, and storage, which is the point where teams either adopt Scrapy and Playwright or hand the pipeline to DataFlirt to build and maintain.

What does DataFlirt handle if I outsource my web scraping?

DataFlirt builds and maintains the full pipeline, including crawlers on Scrapy and Playwright, proxy rotation, anti-bot handling, data cleaning, and delivery as CSV, JSON, or a feed into your database. You define the sources and fields, and DataFlirt ships query-ready data on your schedule.

More to read

Latest from the Blog

Services

Data Extraction for Every Industry

View All Services →