cloud cloud cloud cloud cloud
Web Scraping for News Articles using Python

In this article, we will create a web scraper that will scrape the latest news articles from different newspapers and store them as text. We will go through the following two steps to have an in-depth analysis of how the whole process is done.

  1. Surface-level introduction to web pages and HTML.
  2. Web scraping using Python and the famous library called BeautifulSoup.

Surface-level Introduction to Web Pages and HTML

If we want to withdraw important information from any website or webpage, it is important to know how that website works. When we go the specific URL using any web browser (Chrome, Firefox, Mozilla, etc.), that web page is a combination of three technologies,

HTML (HyperText Markup Language): HTML defines the content of the webpage. It is the standard markup language for adding content to the website. For example, if you want to add text, images, or any other stuff to your website, HTML helps you do that.

CSS (Cascading Style Sheets): Is used for styling web pages. All the visuals designs you see on a specific website are handled by CSS.

JavaScript: JavaScript is the brain of a webpage. JavaScript handles all the logic handling and web page functionality. Hence it allows making the content and style interactive.

These three programming languages allow us to create and manipulate all the aspects of a webpage.

For this article, I suppose that you know the basics of a web page and HTML. Some HTML concepts like divs, tags, headings, etc., might be very useful while creating this web scraper. You don’t need to know everything but only the basics of the webpage design and how the information is contained in it, and we are good to go.

Web Scraping news articles Using BeautifulSoup in Python

Python has several packages that allow us to scrape information from a webpage. We will continue with BeautifulSoup because it is one of the most famous and easy-to-use Python libraries for web scraping.

BeautifulSoup is best for parsing a URL’s HTML content and accessing it with tags and labels. Therefore it will be convenient to extract certain pieces of text from the website.

With only 3-5 lines of code, we will be able to do the magic and extract any type of text of our website of choice from the internet, which elaborates it is easy to use yet powerful package.

We start from the very basics, to install the library package, type the following command into your Python distribution,

! pip install beautifulsoup4

We will also use the ‘requests module’ as it provides BeautifulSoup with any page’s HTML code. To install it, type in the following command to your Python distribution,

! pip install requests

This requests module will allow us to get the HTML code from the web page and navigate it using the BeautfulSoup package. The two commands that will make our job much easier are

find_all(element tag, attribute): This function takes tag and attributes as its parameters and allows us to locate any HTML element from a webpage. It will identify all the elements of the same type. We can use find() instead to get only the first one.

get_text(): Once we have located a given element, this command allows us to extract the inside text.

To navigate our web page’s HTML code and locate the elements we want to scrape, we can use the ‘inspect element’ option by right-clicking on the page or simply pressing Ctrl+F. It will allow you to see the source code of the webpage.

Once we locate the elements of interest, we will get the HTML code with the requests module, and for extracting those elements, we will use the BeautifulSoup.

For this article, we will carry out with EL Paris English newspaper. We will first scrape the news articles titles from the front page and then the text out of them.

If we inspect the HTML code of the news articles, we will see that the article on the front page has a structure like this,

The title has <h2> element with itemprop=”headline” and class=”articulo-titulo” attributes. It has an href attribute containing the text. So we will now extract the text using the following commands:

import requests
from bs4 import BeautifulSoup

Once we get the HTML content using the requests module, we can save it into the coverpage variable:

# Request
r1 = requests.get(url)
r1.status_code
 
# We'll save in coverpage the cover page content
coverpage = r1.content

Next, we will define the soup variable,

# Soup creation
soup1 = BeautifulSoup(coverpage, 'html5lib')

In the following line of code we will locate the elements we are looking for,

# News identification
coverpage_news = soup1.find_all('h2', class_='articulo-titulo')

Using final_all we are getting all the occurrences. Therefore it must return a list  in which each item is  a news article,

To be able to extract the text, we will use the following command:

coverpage_news[4].get_text()

If we want to access the value of an attribute (in our case, the link), we can use the following command,

coverpage_news[4]['href']

This will allow us to get the link in plain text.

If you have grasped all the concepts up to this point, you can web scrape any content of your own choice.

The next step involves accessing each of the news article’s content with the href attribute, get the source code to find the paragraphs in the HTML code, and finally get them with BeautifulSoup. It’s the same process as we described above, but we need to define the tags and attributes that identify the news article content.

The code for the full functionality is given below. I will not explain each line separately as the code is commented and one can get a clear understanding of it by reading those comments.

number_of_articles = 5
# Empty lists for content, links and titles
news_contents = []
list_links = []
list_titles = []
 
for n in np.arange(0, number_of_articles):
    
    # only news articles (there are also albums and other things)
    if "inenglish" not in coverpage_news[n].find('a')['href']:  
        continue
    
    # Getting the link of the article
    link = coverpage_news[n].find('a')['href']
    list_links.append(link)
    
    # Getting the title
    title = coverpage_news[n].find('a').get_text()
    list_titles.append(title)
    
    # Reading the content (it is divided in paragraphs)
    article = requests.get(link)
    article_content = article.content
    soup_article = BeautifulSoup(article_content, 'html5lib')
    body = soup_article.find_all('div', class_='articulo-cuerpo')
    x = body[0].find_all('p')
    
    # Unifying the paragraphs
    list_paragraphs = []
    for p in np.arange(0, len(x)):
        paragraph = x[p].get_text()
        list_paragraphs.append(paragraph)
        final_article = " ".join(list_paragraphs)
        
    news_contents.append(final_article)

Let’s put the extracted articles into:

  • A dataset that will input the models (df_features).
  • A dataset with the title and the link (df_show_info).
# df_features
df_features = pd.DataFrame(
     {'Article Content': news_contents 
    })
 
# df_show_info
df_show_info = pd.DataFrame(
    {'Article Title': list_titles,
     'Article Link': list_links})
df_features
df_show_info

To define a better user experience, we will also measure the time a script takes to get the news. We will define a function for this and then call. Again, I will not explain every line of code as the code is commented. To get a clear understanding, you can read those comments.

def get_news_elpais():
    
    # url definition
    url = "https://elpais.com/elpais/inenglish.html"
    
    # Request
    r1 = requests.get(url)
    r1.status_code
 
    # We'll save in coverpage the cover page content
    coverpage = r1.content
 
    # Soup creation
    soup1 = BeautifulSoup(coverpage, 'html5lib')
 
    # News identification
    coverpage_news = soup1.find_all('h2', class_='articulo-titulo')
    len(coverpage_news)
    
    number_of_articles = 5
 
    # Empty lists for content, links and titles
    news_contents = []
    list_links = []
    list_titles = []
 
    for n in np.arange(0, number_of_articles):
 
        # only news articles (there are also albums and other things)
        if "inenglish" not in coverpage_news[n].find('a')['href']:  
            continue
 
        # Getting the link of the article
        link = coverpage_news[n].find('a')['href']
        list_links.append(link)
 
        # Getting the title
        title = coverpage_news[n].find('a').get_text()
        list_titles.append(title)
 
        # Reading the content (it is divided in paragraphs)
        article = requests.get(link)
        article_content = article.content
        soup_article = BeautifulSoup(article_content, 'html5lib')
        body = soup_article.find_all('div', class_='articulo-cuerpo')
        x = body[0].find_all('p')
 
        # Unifying the paragraphs
        list_paragraphs = []
        for p in np.arange(0, len(x)):
            paragraph = x[p].get_text()
            list_paragraphs.append(paragraph)
            final_article = " ".join(list_paragraphs)
 
        news_contents.append(final_article)
 
    # df_features
    df_features = pd.DataFrame(
         {'Content': news_contents 
        })
 
    # df_show_info
    df_show_info = pd.DataFrame(
        {'Article Title': list_titles,
         'Article Link': list_links,
         'Newspaper': 'El Pais English'})
    
    return (df_features, df_show_info)

Wrapping Up

In this article, we have seen the basics of web scraping by understanding the basics of web page flow design and structure. We have also done hands-on experience by extracting data from news articles. Web scraping can do wonders if done rightfully. For example, a fully optimized model can be made based on extracted data that can predict categories and show summaries to the user. The most important thing to do is to figure out your requirements and then understanding the page structure. Python has some very powerful yet easy-to-use libraries for extracting data of your choice. That has made web scraping very easy and fun.  

It is important to note that this code is useful for extracting data from this particular webpage. If we want to do it from any other page, we need to optimize our code according to that page’s structure. But once we know how to identify them, the process is exactly the same.

I hope this article was resourceful and knowledgeable for you. See you in the next ones!

Leave a Reply

Your email address will not be published. Required fields are marked *

Looking for help with our proxies or want to help? Here are your options:

Thanks to everyone for the amazing support!

Latest blog posts

© Copyright 2021 – Thib BV | Brugstraat 18 | 2812 Mechelen | VAT BE 0749 716 760