cloud cloud cloud cloud cloud

When it comes to concurrency vs. parallelism, it may be apparent as they refer to the same concepts in executions of computer programs in a multi-threaded environment. Well, after looking at their definitions in the Oxford dictionary, you may be inclined to think so. However, when you go deeper into these notions with respect to how the CPU executes program instructions, you’ll notice that concurrency and parallelism are two distinct concepts. 

This article delves deeper into concurrency and parallelism, how they vary, and how they work together to improve program execution productivity. Finally, it will discuss which two strategies are most suitable for web scraping. So let’s get started.

What is concurrent execution?

First, to make things simpler, we’ll start with concurrency in a single application executed in a single processor. Dictionary.com defines concurrency as a combined action or effort and the occurrence of simultaneous events. However, one could say the same about parallel execution as the executions coincide, and thus this definition is somewhat misleading in the world of computer programming.

In everyday life, you will have concurrent executions on your computer. For instance, you may read a blog article on your browser while listening to music on your Windows Media Player. There would be another process running: downloading a PDF file from another web page—all these examples are separate processes.

Before the invention of concurrently executing applications, the CPUs sequentially executed programs. This implied that the instructions of one program have to complete execution before the CPU moved to the next.

In contrast, concurrent execution alternates a little bit of each process until all are complete.

In a single processor multi-threaded execution environment, one program executes when another is blocked for user input. Now you may ask what a multi-thread environment is. It is a collection of threads that runs independently of one another—more on threads next section.

Concurrency is not to be confused with parallel execution

Now, then it’s easier to confuse concurrency with parallelism. What we meant by concurrency in the above examples is that processes are not running parallelly. 

Instead, let’s say that one process requires the completion of an Input/Output operation, then the Operating System would allocate the CPU to another process while it completes its I/O operation. This procedure would continue until all processes complete their execution.

However, since the switching of the tasks by the Operating System happens within a nano or microsecond, it would appear to a user as the processes are executed parallelly, 

What is a Thread?

Unlike in sequential execution, the CPU may not execute the whole process/program at once with current architectures. Instead, most computers may split the entire process into several lightweight components that run independently of one another in an arbitrary order. It is these lightweight components that are called threads.

For example, Google Docs might have several threads that operate concurrently. While one thread automatically saves your work, another might run in the background, checking the spelling and grammar.  

The Operating System determines the order and which threads to prioritize, which is system dependant.

What is parallel execution?

Now you know the execution of computer programs in an environment with a single CPU. In contrast, modern computers execute many processes simultaneously in multiple CPUs, known as parallel execution. Most of the current architectures have multiple CPUs.

As you can see in the diagram below, the CPU executes each thread belonging to a process parallelly with each other.  

concurrency vs. parallelism

In parallelism, the Operating System switches the threads to and from the CPU within splits of macro or microseconds depending on the system architecture. For the Operating System to achieve parallel execution, computer programmers use the concept known as parallel programming. In parallel programming, the programmers develop code to make the best use of the multiple CPUs. 

How concurrency could speed up web scraping

With so many domains utilizing web scraping to scrape data from websites, a significant drawback is a time it consumes to scrape vast quantities of data. If you’re not a seasoned developer, you may end up wasting plenty of time experimenting with specific techniques before ultimately running the code error-free and perfectly.

The section below outlines some of the reasons why web scraping is slow.

Significant reasons why web scraping is slow?

Firstly, the scraper has to navigate to the target website in web scraping. Then it would have to pull and retrieve the entities from the HTML tags that you wish to scrape from. Finally, in most circumstances, you would be saving data to an external file such as the CSV format.  

So as you can see, most of the above tasks require heavy bound I/O operation such as pulling data from websites and then saving it onto external files. Navigating to the target websites often depends on external factors such as network speed or waiting while a network becomes available.

As you can see in the figure below, this extreme slow time consumption may further handicap the scraping process when you have to scrape three or more websites. It is assuming that you carry out the scraping operation sequentially.

concurrency vs. parallelism

Therefore one way or the other, you would have to apply concurrency or parallelism to your scraping operations. We would look into parallelism first in the next section.

Concurrency in web scraping using Python

I’m sure you have an overview of concurrency and parallelism by now. This section will focus on concurrency in web scraping with a simple coding example in Python.

A simple example demonstrating without concurrent execution

In this example, we will scrape the URL of countries by a list of capital cities based on the population from Wikipedia. The program would save the links and then go to each of the 240 pages and save HTMLof those pages locally.

 To demonstrate the effects of concurrency, we will show two programs  — one with sequential execution and the other concurrently with multi-threads.

Here is the code:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

def get_countries():
    countries = 'https://en.wikipedia.org/wiki/List_of_national_capitals_by_population'
    all_countries = []
    response = requests.get(countries)
    soup = BeautifulSoup(response.text, "html.parser")
    countries_pl = soup.select('th .flagicon+ a')
    for link_pl in countries_pl:
        link = link_pl.get("href")
        link = urljoin(countries, link)
        
        all_countries.append(link)
    return all_countries
  
def fetch(link):
    res = requests.get(link)
    with open(link.split("/")[-1]+".html", "wb") as f:
        f.write(res.content)
  

        
def main():
    clinks = get_countries()
    print(f"Total pages: {len(clinks)}")
    start_time = time.time()
    for link in clinks:
        fetch(link)
 
    duration = time.time() - start_time
    print(f"Downloaded {len(links)} links in {duration} seconds")
main()

Code explanation

Firstly, we import the libraries, including BeautifulSoap, to extract the HTML data. The other libraries include the request to access the website,urllib to join the URLs as you will discover, and the time library to find out the total execution time for the program.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

The program first begins with the main module, which calls the get_countries() function. The function then accesses the Wikipedia URL specified in the countries variable via the BeautifulSoup instance through the HTML parser.

It then searches for the URL for the list of countries in the table by extracting the value in the href attribute of the anchor tag.

The links that you retrieve are relative links. The urljoin function will convert them to absolute links. These links are then appended into the all_countries array, which it returns to the main function 

Then the fetch function saves the HTML content in each link as an HTML file. It is what these pieces of code do:

def fetch(link):
    res = requests.get(link)
    with open(link.split("/")[-1]+".html", "wb") as f:
        f.write(res.content)

Lastly, the main function prints the time it took to save the files in HTML format. In our PC, it took 131.22 seconds.

Well, this time could certainly be made faster. We”ll find it out in the next section, where the same program is executed with multiple threads.

The same program with concurrency

In the multithreaded version, we would have to tweak minor changes so that the program would execute faster.

Remember, concurrency is about creating multiple threads and executing the program. There are two ways to create threads — manually and using the ThreadPoolExecutor class. 

After creating the threads manually, you could use the join function on all the threads for the manual method. By doing so, the main method would wait for all the threads to complete their execution.

In this program, we will execute the code with ThreadPoolExecutor class which is part of the concurrent. futures module. So first of all, you have to put the below line in the above program. 

from concurrent.futures import ThreadPoolExecutor

After that, you could change the for loop that saves the HTML content in HTML format as follows:

  with ThreadPoolExecutor(max_workers=32) as executor:
           executor.map(fetch, clinks)

The above code creates a thread pool with a maximum of 32 threads. For every CPU, the max_workers parameter differs, and you need to experiment with different values. It doesn’t necessarily equate to the higher the number of threads faster the execution time.

So in our PC produced an output of 15.14 seconds, which is way better than when we executed it sequentially.

So before we move on to the next section, here is the final code for the program with concurrent execution:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor
import time

def get_countries():
    countries = 'https://en.wikipedia.org/wiki/List_of_national_capitals_by_population'
    all_countries = []
    response = requests.get(countries)
    soup = BeautifulSoup(response.text, "html.parser")
    countries_pl = soup.select('th .flagicon+ a')
    for link_pl in countries_pl:
        link = link_pl.get("href")
        link = urljoin(countries, link)
        
        all_countries.append(link)
    return all_countries
  
def fetch(link):
    res = requests.get(link)
    with open(link.split("/")[-1]+".html", "wb") as f:
        f.write(res.content)


def main():
  clinks = get_countries()
  print(f"Total pages: {len(clinks)}")
  start_time = time.time()
  

  with ThreadPoolExecutor(max_workers=32) as executor:
           executor.map(fetch, clinks)
        
 
  duration = time.time()-start_time
  print(f"Downloaded {len(clinks)} links in {duration} seconds")
main()

How parallelism could speed up web scraping

Now we hope that you have gained an understanding of concurrent execution. To help you analyze better, let’s look at how the same program performs in a multiprocessor environment with processes executing parallelly in multiple CPUs.

Firstly you have to import the required module :

from multiprocessing import Pool,cpu_count

Python provides the cpu_count() method, which counts the number of CPUs in your machine. It is undoubtedly helpful in determining the precise number of tasks it could perform in parallel.

Now you have to replace the code with the for loop in sequential execution with this code:

with Pool (cpu_count()) as p:
 
   p.map(fetch,clinks)

After running this code, it produced an overall execution time of 20.10 seconds which is relatively faster than the sequential execution in the first program.

Conclusion

At this point, we hope that you may have a comprehensive overview of parallel and sequential programming—the choice to use one over the other primarily depends on the particular scenario you have confronted.

For the web scraping scenario, we recommend that starting with concurrent execution and then moving into a parallel solution would be great. We hope that you enjoyed reading this article.Do not also forget to read other articles relevant to web scraping such as this in our blog.

Comments Off on Concurrency vs Parallelism: Significant Differences For Web Scraping

Comments are closed.

Looking for help with our proxies or want to help? Here are your options:

Thanks to everyone for the amazing support!

© Copyright 2022 – Thib BV | Brugstraat 18 | 2812 Mechelen | VAT BE 0749 716 760