Web scraping has become a vital skill for Python developers, data analysts, and anyone working with datasets. When it comes to structured and rich data, tables found on websites are often goldmines of information. Whether you’re scouring the web for product catalogs, sports statistics, or financial data, the ability to extract and save table data using Python is an invaluable tool.
This practical guide takes you step by step through the process of scraping tables from websites using Python. By the end, you’ll know how to use popular libraries like requests, Beautiful Soup, and even pandas to access table data and store it in reusable formats like CSV files.
Before we jump into the technical details, here’s what you’ll need to follow along:
We will use the pip command to install the required libraries. Simply run the following command in your terminal to complete the installation:
pip install requests beautifulsoup4 pandas
The first step in any web scraping project is to analyze the structure of the target website. In this example, we'll be scraping data from a sample website that features a table displaying the standings for hockey teams. Below is a preview of the table:
Here’s how this table appears within the HTML structure.
The first step is fetching the webpage you want to scrape. We’ll use the requests library to send an HTTP request and retrieve the HTML content from the dummy website that we are using to get the table content
url = "https://www.scrapethissite.com/pages/forms/"
response = requests.get(url)
if response.status_code == 200:
print("Page fetched successfully!")
html_content = response.text
else:
print(f"Failed to fetch the page. Status code: {response.status_code}")
exit()
In HTML, a table is a structured way to present data in rows and columns, just like in a spreadsheet. Tables are created using the <table>
tag, and their content is divided into rows (<tr>
) and cells (<td>
for data cells or <th>
for header cells). Here's a quick breakdown of how a table's structure works:
<table>
tags, it acts as the container for all rows and cells.<tr>
(table row) represents a horizontal slice of the table.<td>
tags hold individual data values (or <th>
tags for headers). For example, in this script, we locate the <table>
tag with a specific class (class="table"
) and extract its rows and cells using Beautiful Soup. This allows us to pull out the data systematically and prepare it for analysis or saving.
soup = BeautifulSoup(html_content, "html.parser")
table = soup.find("table", {"class": "table"})
if not table:
print("No table found on the page!")
exit()
In this step, we’ll save the extracted table data into a CSV file for future use and also display it as a pandas DataFrame so you can see how the data is structured. Saving the data as a CSV allows you to analyze it later in tools like Excel, Google Sheets, or Python itself.
headers = [header.text.strip() for header in table.find_all("th")]
rows = []
for row in table.find_all("tr", class_="team"):
cells = [cell.text.strip() for cell in row.find_all("td")]
rows.append(cells)
df = pd.DataFrame(rows, columns=headers)
csv_filename = "scraped_table_data_pandas.csv"
df.to_csv(csv_filename, index=False, encoding="utf-8")
print(f"Data saved to {csv_filename}")
When you run this code, pandas will create a file named scraped_table_data.csv
in your working directory, and the extracted data will be printed in the console as follows:
Team Name Year Wins Losses OT Losses Win % Goals For (GF) Goals Against (GA) + / -
0 Boston Bruins 1990 44 24 0.55 299 264 35
1 Buffalo Sabres 1990 31 30 0.388 292 278 14
2 Calgary Flames 1990 46 26 0.575 344 263 81
3 Chicago Blackhawks 1990 49 23 0.613 284 211 73
4 Detroit Red Wings 1990 34 38 0.425 273 298 -25
Below is the complete Python script for scraping table data from a website, saving it to a CSV file, and displaying the extracted data. This script combines all the steps covered in this guide into a single cohesive workflow.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.scrapethissite.com/pages/forms/"
response = requests.get(url)
if response.status_code == 200:
print("Page fetched successfully!")
html_content = response.text
else:
print(f"Failed to fetch the page. Status code: {response.status_code}")
exit()
soup = BeautifulSoup(html_content, "html.parser")
table = soup.find("table", {"class": "table"})
if not table:
print("No table found on the page!")
exit()
headers = [header.text.strip() for header in table.find_all("th")]
rows = []
for row in table.find_all("tr", class_="team"):
cells = [cell.text.strip() for cell in row.find_all("td")]
rows.append(cells)
df = pd.DataFrame(rows, columns=headers)
csv_filename = "scraped_table_data_pandas.csv"
df.to_csv(csv_filename, index=False, encoding="utf-8")
print(df.head())
print(f"Data saved to {csv_filename}")
This guide has walked you through the process step by step: understanding the website structure, extracting the data, and saving it for analysis. Whether you're building datasets for research, automating data collection, or simply exploring the possibilities of web scraping, mastering these techniques opens up a world of opportunities.
While scraping, you might encounter challenges like IP bans or rate limits imposed by websites. This is where proxies become crucial. Proxies allow you to:
ProxyScrape offers a wide range of proxies, including residential, premium, dedicated and mobile proxies, tailored for web scraping. Their reliable and scalable proxy solutions can help you handle large-scale scraping projects without interruptions, ensuring smooth and efficient data collection.