Web scraping can be defined as the method of collecting and restructuring data from websites. It can also be defined as the programmatic approach of obtaining website data in an automated manner. For instance, you want to extract the email IDs of all people who commented on a Facebook post. You can do this by
Web scraping can be defined as the method of collecting and restructuring data from websites. It can also be defined as the programmatic approach of obtaining website data in an automated manner. For instance, you want to extract the email IDs of all people who commented on a Facebook post. You can do this by two methods. First, you can point the cursor to any person’s email address string. You can then copy and paste it into a file. This method is known as manual scraping. But what if you want to gather 2000 email IDs? With the help of a web scraping tool, you can extract all email IDs in 30 sec instead of 3 hours if you use manual scraping.
You can use web scraping tools to extract information from websites. You only need to know how to click, and no programming knowledge is required. These tools are resource-efficient and save time and cost. You can scrape millions of pages based on your needs without worrying about the network bandwidths. Some websites implement anti-bots that discourage scrapers from collecting data. But good web scraping tools have in-built features to bypass these tools and deliver a seamless scraping experience.
Python has excellent tools to scrape the data from the web. For instance, you can import the requests library to retrieve content from a webpage and bs4 (BeautifulSoup) to extract the relevant information. You can follow the steps below to web scrape in Python. We will be extracting information from this website.
You have to import the requests library to fetch the HTML of the website.
import requests
You have to make a GET request to the website. You can do this by pasting the URL into the requests.get() function.
r = requests.get('http://www.cleveland.com/metro/index.ssf/2017/12/case_western_reserve_university_president_barbara_snyders_base_salary_and_bonus_pay_tops_among_private_colleges_in_ohio.html')
Extract the content of the website using r.content. It gives the content of the website in bytes.
c = r.content
You have to import the BeautifulSoup library as it makes it easy to scrape information from web pages.
from bs4 import BeautifulSoup
You have to create a BeautifulSoup object from the content and parse it using several methods.
soup = BeautifulSoup(c)
print(soup.get_text())
You will get the output (it’s just a part) somewhat like this.
We have to find the right CSS selectors as we need to extract our desired data. We can find the main content on the webpage using the .find() method of the soup object.
main_content = soup.find('div', attrs = {'class': 'entry-content'})
We can retrieve the information as text from the table using the .text attribute of the soup.
content = main_content.find('ul').text
print(content)
We retrieved the text of the table as a string. But the information will be of great use if we extract the specific parts of the text string. To achieve this task, we need to move on to Regular Expressions.
Regular expressions (RegEx) are a sequence of patterns that define a search pattern. The basic idea is that:
Suppose we want to extract the following pieces of info from the text table.
You can extract the three pieces of information by following the steps mentioned below.
Import re and for extracting the salaries, you have to make a salary pattern. Use re.compile() method to compile a regular expression pattern provided as a string into a RegEx pattern object. Further, you can use pattern.findall() to find all the matches and return them as a list of strings. Each string will represent one match.
import re
salary_pattern = re.compile(r'\$.+')
salaries = salary_pattern.findall(content)
Repeat the same procedure for extracting the names of the colleges. Make a pattern and extract the names.
school_pattern = re.compile(r'(?:,|,\s)([A-Z]{1}.*?)(?:\s\(|:|,)')
schools = school_pattern.findall(content)
print(schools)
print(salaries)
Repeat the same procedure for extracting the names of the presidents. Make a pattern and extract the required names.
name_pattern = re.compile(r'^([A-Z]{1}.+?)(?:,)', flags = re.M)
names = name_pattern.findall(content)
print(names)
The salaries look messy and are not understandable. So, we use Python list comprehension for converting the string salaries into numbers. We will use string slicing, split and join, and list comprehension to achieve the desired results.
salaries = ['$876,001', '$543,903', '$2453,896']
[int(''.join(s[1:].split(','))) for s in salaries]
The output is as:
Data visualization helps you understand the data visually so that the trends, patterns and correlations can be exposed. You can translate a large amount of data into graphs, charts, and other visuals to identify the outliers and gain valuable insights.
We can use matplotlib to visualize the data, as shown below.
Import the necessary libraries as shown below.
import pandas as pd
import matplotlib.pyplot as plt
Make a pandas dataframe of schools, names and salaries. For instance, you can convert the schools into a dataframe as:
df_school = pd.DataFrame(schools)
print(df_school)
The output is:
Likewise, you can do the same for salaries and names.
For data visualization, we can plot a bar graph as shown below.
df.plot(kind='barh', x = 'President', y = 'salary')
The output is as:
Web scraping helps businesses extract useful information about market insights and industries to offer data-powered services and make data-driven decisions. Proxies are essential to scrape data from various websites for the following reasons effectively.
Do you know how many proxies are needed to get the above benefits? You can calculate the required number of proxies by using this formula:
Number of proxies = Number of access requests / Crawl Rate
The number of access requests depends on the following parameters.
On the other hand, the crawl rate is limited by the number of requests the user makes in a certain amount of time. Some websites permit a limited number of requests per user to differentiate automated and human user requests.
You can use proxies in Python by following the steps below.
import requests
proxy = 'http://114.121.248.251:8080'
url = 'https://ipecho.net/plain'
page = requests.get(url,
proxies={"http": proxy, "https": proxy})
print(page.text)
The output is as:
We discussed that we can use web scraping to extract data from websites instead of using manual scraping. Web scraping is cost-efficient and a time-saving process. Businesses use it to collect and restructure web information for making data-driven decisions and gaining valuable insights. The use of proxies is essential for safe web scraping as it hides the original IP address of the user from the target website. You can use datacenter or residential proxies for web scraping. But prefer using residential proxies as they are fast and can not be easily detected. Further, we can use regular expressions in Python to match or find sets of strings. It means we can extract any string pattern from the text with the help of regular expressions. We also saw that data visualization converts voluminous amounts of data into charts, graphs, and other visuals that help us detect anomalies and identify useful trends in the data.