Web Scraping with Python: Requests + BeautifulSoup Tutorial

· 7 min read

Introduction

The web is full of data. But most websites do not offer an API. Web scraping lets you extract that data directly from HTML pages using Python.

Whether you want to monitor prices, collect product information, build datasets, or automate research, web scraping is one of the most practical Python skills to learn.

In this tutorial, you will learn how to:

  • Use requests to download web pages
  • Use BeautifulSoup to parse HTML
  • Extract text, links, and images
  • Handle pagination across multiple pages
  • Save scraped data to CSV
  • Handle common errors and avoid getting blocked

All examples use books.toscrape.com — a website built specifically for legal scraping practice. Every code block in this article can be run directly without modification.

All examples are tested on Python 3.12.


What You Need

Install the two required libraries:

pip install requests beautifulsoup4

These libraries have different roles:

  • requests sends HTTP requests and downloads HTML pages
  • BeautifulSoup parses that HTML and helps extract specific data

Together they form one of the simplest and most widely used web scraping stacks in Python.


Fetching a Web Page

Start by downloading a web page and inspecting the response:

import requests

url = "https://books.toscrape.com/"

response = requests.get(url)

print(response.status_code)
print(response.text[:500])

Expected output:

200
<!DOCTYPE html>
<html>
    <head>
        <title>
            All products | Books to Scrape - Sandbox
        ...

status_code tells you whether the request succeeded. 200 means success. response.text contains the full HTML source code of the page.


Why Headers Matter

Many websites block requests that do not look like real browsers. Adding a User-Agent header makes your request appear more like a normal browser visit:

headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36"
    )
}

response = requests.get(url, headers=headers)

Without a User-Agent header, some websites return 403 Forbidden, an empty page, or a CAPTCHA challenge. It is good practice to include headers in every scraping request.


Parsing HTML with BeautifulSoup

Once you have the HTML, pass it to BeautifulSoup to make it searchable:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")

print(soup.title.text)

Expected output:

All products | Books to Scrape - Sandbox

BeautifulSoup converts raw HTML into a structured object you can query using tags, classes, IDs, and CSS selectors.

Core BeautifulSoup Methods

Find the first matching tag:

soup.find("h1")

Find all matching tags — returns a list:

soup.find_all("a")

Use CSS selectors — works like selectors in web development:

soup.select(".product_pod")   # by class
soup.select("#main")          # by ID
soup.select("article h3 a")   # nested selector

CSS selectors become especially useful on larger, more complex pages.


Understanding the HTML Structure

Before writing any scraping code, always inspect the page structure first. Press F12 in your browser to open DevTools, then click on any element to see its HTML.

For books.toscrape.com, you will notice:

  • Each book is wrapped in an <article class="product_pod"> tag
  • Titles are inside <h3> elements
  • Prices use the class price_color

Understanding the HTML hierarchy is the most important step in any scraping project. The selectors in your code must match the actual structure of the page.


Extracting Real Data

Now extract actual book titles and prices from the homepage:

import requests
from bs4 import BeautifulSoup

url = "https://books.toscrape.com/"

headers = {
    "User-Agent": "Mozilla/5.0"
}

response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, "html.parser")

books = soup.find_all("article", class_="product_pod")

for book in books:
    title = book.find("h3").find("a")["title"]
    price = book.find("p", class_="price_color").text

    print(f"{title}: {price}")

Expected output:

A Light in the Attic: £51.77
Tipping the Velvet: £53.74
Soumission: £50.10
Sharp Objects: £47.82
...

find_all("article", class_="product_pod") returns a list of all book containers on the page. For each one, the title is pulled from the title attribute of the anchor tag inside <h3>, and the price is pulled from the paragraph with class price_color.


Extract all hyperlinks on a page:

links = soup.find_all("a")

for link in links:
    print(link.get("href"))

Extract all image sources:

images = soup.find_all("img")

for img in images:
    print(img.get("src"))

.get("href") and .get("src") safely retrieve HTML attributes. If an attribute does not exist, .get() returns None instead of raising an error.


Handling Pagination

Most websites spread content across multiple pages. The following script scrapes five pages and collects all results into a single list:

import requests
from bs4 import BeautifulSoup
import time

base_url = "https://books.toscrape.com/catalogue/page-{}.html"

all_books = []

for page in range(1, 6):

    url = base_url.format(page)

    response = requests.get(
        url,
        headers={"User-Agent": "Mozilla/5.0"}
    )

    soup = BeautifulSoup(response.text, "html.parser")

    books = soup.find_all("article", class_="product_pod")

    for book in books:
        title = book.find("h3").find("a")["title"]
        price = book.find("p", class_="price_color").text

        all_books.append({
            "title": title,
            "price": price
        })

    print(f"Page {page}: {len(books)} books found")
    time.sleep(1)  # be respectful — pause between requests

print(f"Total: {len(all_books)} books")

Expected output:

Page 1: 20 books found
Page 2: 20 books found
Page 3: 20 books found
Page 4: 20 books found
Page 5: 20 books found
Total: 100 books

The time.sleep(1) call pauses for one second between pages. This reduces load on the server and lowers the chance of getting blocked.


Saving Data to CSV

After scraping, save the results to a CSV file that can be opened in Excel or any spreadsheet tool:

import csv

with open("books.csv", "w", newline="", encoding="utf-8") as f:

    writer = csv.DictWriter(f, fieldnames=["title", "price"])

    writer.writeheader()
    writer.writerows(all_books)

print("Saved to books.csv")

Expected output:

Saved to books.csv

Once saved, you can use pandas to clean and transform the scraped data before analysis.

The resulting file contains one row per book with the title and price columns. If you also scraped from Excel files earlier, the openpyxl tutorial shows how to save data directly to .xlsx format instead.


Common Issues and Fixes

403 Forbidden

The most common cause is a missing User-Agent header. Add headers to every request. If the site still blocks you, add a longer delay between requests:

import time
time.sleep(2)

Avoid sending dozens of requests per second — this is both ineffective and inconsiderate.

Data Not Found

If your selectors stop returning results, the website likely changed its HTML structure. Open DevTools (F12), inspect the element you want, and update your selectors to match the new structure.

JavaScript-Rendered Content

Some websites load content dynamically using JavaScript. In these cases, requests only downloads the initial HTML shell — the actual data is not present. Check the page source (Ctrl+U) and search for the data you want. If it is missing, the site uses JavaScript rendering and you will need a browser automation tool like Playwright or Selenium instead.

Encoding Problems

If scraped text appears corrupted or garbled, set the encoding manually:

response.encoding = "utf-8"

Timeout and Network Errors

Always set a timeout so requests do not hang indefinitely, and wrap requests in error handling:

import requests

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
except requests.exceptions.RequestException as e:
    print(f"Error: {e}")

raise_for_status() raises an exception for any 4xx or 5xx response codes, so failures are caught immediately rather than silently producing empty data.


Web scraping has legal and ethical boundaries that every developer should understand before starting a project.

Check robots.txt first. Most websites publish their scraping rules at https://example.com/robots.txt. This file specifies which pages automated tools are allowed to access. Respecting it is both a legal and ethical obligation.

Do not scrape login-protected or private data. Scraping personal information, private accounts, or content behind a login wall without permission is likely illegal in most jurisdictions.

Be respectful of server resources. Sending hundreds of requests per second can degrade performance for real users and may be treated as a denial-of-service attack. Always add delays between requests and limit the total number of requests per session.

Understand the terms of service. Many websites explicitly prohibit automated scraping in their terms of service. Commercial use of scraped data may require additional legal review depending on your country and the specific website.

The examples in this tutorial use books.toscrape.com, which exists specifically for scraping practice and has no restrictions.


Wrap-Up

Web scraping with requests and BeautifulSoup covers a wide range of real-world data collection tasks. With the techniques in this tutorial, you can download pages, extract structured data, handle pagination, and save results to CSV — all with standard Python tools.

The natural next steps are to combine scraping with other automation workflows. You can schedule the scraper to run automatically every day using cron or Task Scheduler, or rename and organize the output files after each run. Understanding how Python imports modules is useful when structuring larger scraping projects into packages. For questions or future tutorial ideas, get in touch via the Contact page.