In this article, we will create a web scraper to scrape the latest news articles from different newspapers and store them as text. We will go through the following two steps to have an in-depth analysis of how the whole process is done.
Feel free to jump to any sections to learn more on how to perform web scraping for news articles using python
If we want to withdraw important information from any website or webpage, it is important to know how that website works. When we go to the specific URL using any web browser (Chrome, Firefox, Mozilla, etc.), that web page is a combination of three technologies,
These three programming languages allow us to create and manipulate a webpage’s aspects.
I suppose you know the basics of a web page and HTML for this article. Some HTML concepts like divs, tags, headings, etc., might be very useful while creating this web scraper. You don’t need to know everything but only the basics of the webpage design and how the information is contained in it, and we are good to go.
Python has several packages that allow us to scrape information from a webpage. We will continue with BeautifulSoup because it is one of the most famous and easy-to-use Python libraries for web scraping.
BeautifulSoup is best for parsing a URL’s HTML content and accessing it with tags and labels. Therefore it will be convenient to extract certain pieces of text from the website.
With only 3-5 lines of code, we can do the magic and extract any type of text of our website of choice from the internet, which elaborates it is an easy-to-use yet powerful package.
We start from the very basics. To install the library package, type the following command into your Python distribution,
We will also use the ‘requests module’ as it provides BeautifulSoup with any page’s HTML code. To install it, type in the following command to your Python distribution,
This requests module will allow us to get the HTML code from the web page and navigate it using the BeautfulSoup package. The two commands that will make our job much easier are
find_all(element tag, attribute): This function takes tag and attributes as its parameters and allows us to locate any HTML element from a webpage. It will identify all the elements of the same type. We can use find() instead to get only the first one.
get_text(): Once we have located a given element, this command allows us to extract the inside text.
To navigate our web page’s HTML code and locate the elements we want to scrape, we can use the ‘inspect element’ option by right-clicking on the page or simply pressing Ctrl+F. It will allow you to see the source code of the webpage.
Once we locate the elements of interest, we will get the HTML code with the requests module, and for extracting those elements, we will use the BeautifulSoup.
If we inspect the HTML code of the news articles, we will see that the article on the front page has a structure like this,
The title has <h2> element with itemprop=”headline” and class=”articulo-titulo” attributes. It has an href attribute containing the text. So we will now extract the text using the following commands:
Once we get the HTML content using the requests module, we can save it into the coverpage variable:
Next, we will define the soup variable,
In the following line of code, we will locate the elements we are looking for,
Using final_all, we are getting all the occurrences. Therefore it must return a list in which each item is a news article,
To be able to extract the text, we will use the following command:
If we want to access the value of an attribute (in our case, the link), we can use the following command,
This will allow us to get the link in plain text.
If you have grasped all the concepts up to this point, you can web scrape any content of your own choice.
The next step involves accessing each of the news article’s content with the href attribute, getting the source code to find the paragraphs in the HTML code, and finally getting them with BeautifulSoup. It’s the same process as we described above, but we need to define the tags and attributes that identify the news article content.
The code for the full functionality is given below. I will not explain each line separately as the code is commented on; one can clearly understand it by reading those comments.
Let’s put the extracted articles into the following:
To define a better user experience, we will also measure the time a script takes to get the news. We will define a function for this and then call. Again, I will not explain every line of code as the code is commented. To get a clear understanding, you can read those comments.
A dedicated datacenter proxy has several features, such as unlimited bandwidth and concurrent connections, dedicated HTTP proxies for easy communication, and IP authentication for more security. With 99.9% uptime, you can rest assured that the dedicated datacenter will always work during any session. Last but not least, ProxyScrape provides excellent customer service and will help you to resolve your issue within 24-48 business hours.
The other feature of ProxyScrape’s residential proxy is a rotating feature. A rotating proxy helps you avoid a permanent ban on your account because your residential proxy dynamically changes your IP address, making it difficult for the target server to check whether you are using a proxy or not.
Apart from that, the other features of a residential proxy are: unlimited bandwidth, along with concurrent connection, dedicated HTTP/s proxies, proxies at any time session because of 7 million plus proxies in the proxy pool, username and password authentication for more security, and last but not least, the ability to change the country server. You can select your desired server by appending the country code to the username authentication.
Next, the best thing would be to change the proxy server based on the country. You just have to append the country ISO_CODE at the end of the IP authentication or username and password authentication.
The python library is called “BeautifulSoup” and can automatically scrape data from any news article. The only requirement would be a basic knowledge of HTML for locating the HTML tag from the page source code that contains the data that needs to be scraped.
The answer is it depends on the website’s terms and conditions. But most of the news articles can be scraped since all the information is intentionally made available to the public. All public data can be scraped as long as your scraping method does not harm the data or website owner.
You can scrape Google News or any news articles using python with the help of the python library called “BeautifulSoup”. Install the library and a reliable residential proxy to prevent IP block from the target server.
In this article, we have seen the basics of web scraping by understanding the basics of web page flow design and structure. We have also done hands-on experience by extracting data from news articles. Web scraping can do wonders if done rightfully. For example, a fully optimized model can be made based on extracted data that can predict categories and show summaries to the user. The most important thing to do is figure out your requirements and understand the page structure. Python has some very powerful yet easy-to-use libraries for extracting the data of your choice. That has made web scraping very easy and fun.
It is important to note that this code is useful for extracting data from this particular webpage. If we want to do it from any other page, we need to optimize our code according to that page’s structure. But once we know how to identify them, the process is exactly the same.