cloud cloud cloud cloud cloud
how to scrape emails with python

Reaching the most potential clients is very important for most startups. In this way, they can generate better leads. One of the easiest ways to have a good clientage is to have as many business email addresses as possible and send them your service details time and again.

They are many scraping tools present on the internet that provide these services for free, but they have withdrawal data limits. They also offer unlimited data extraction limits, but they are paid. Why pay them when you can build one with your own hands?

This article will demonstrate how easy it is to build a simple web crawler in Python. Although it will be a very simple example but for beginners, it will be a learning experience, especially for those who are new to web scraping. This will be a step-by-step tutorial that will help you get email addresses without any limits. 

Let’s start with the building process of our intelligent web scraper. I will divide the whole code into different pieces by commenting on what’s going on so that you can get a deeper insight into how the whole process works. I will also share the entire code at the end of the post to fully analyze the whole process. 

Step 1: Importing Modules

We will be using the following six modules for our project.

import re
import requests
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import files

The details of the imported modules are given below:

  1. re is for regular expression matching.
  2. requests for sending HTTP requests.
  3. urlsplit for dividing the URLs into component parts.
  4. deque is a container that is in the form of a list used for appending and popping on either end.
  5. BeautifulSoup for pulling data from HTML files of different web pages.
  6. pandas for email formatting into DataFrame and for further operations.

Step 2: Initializing Variables

In this step, we will initialize a deque that will save scraped URLs, unscraped URLs, and a set of saving emails scraped successfully from the websites.

# read url from input
original_url = input("Enter the website url: ") 
 
# to save urls to be scraped
unscraped = deque([original_url])
 
# to save scraped urls
scraped = set()
 
# to save fetched emails
emails = set()  

Duplicate elements are not allowed in a set, so they are all unique.

Step 3: Starting the Scraping Process  

  1. The first step is to distinguish between the scraped and unscraped URLs. The way to do this is to move a URL from unscraped to scraped.
while len(unscraped):
    # move unsraped_url to scraped_urls set
    url = unscraped.popleft()  # popleft(): Remove and return an element from the left side of the deque
    scraped.add(url)
  1. The next step is to extract data from different parts of the URL. For this purpose, we will use urlsplit.
 parts = urlsplit(url)

urlsplit() returns a 5-tuple: (addressing scheme, network location, path, query, fragment, identifier).

I can’t show sample inputs and outputs for urlsplit() due to confidential reasons, but once you try, the code will ask you to input some value (website address). The output will display the SplitResult(), and inside the SplitResult() there would be five attributes.

This will allow us to get the base and path part for the website URL.

base_url = "{0.scheme}://{0.netloc}".format(parts)
    if '/' in parts.path:
      path = url[:url.rfind('/')+1]
    else:
      path = url
  1. This is the time to send the HTTP GET request to the website.
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        # ignore pages with errors and continue with next url
        continue
  1. For extracting the email addresses we will use the regular experession and then add them to the email set.
# You may edit the regular expression as per your requirement
    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.com", 
                  response.text, re.I)) # re.I: (ignore case)
    emails.update(new_emails)

Regular expressions are of massive help when you want to extract the information of your own choice. If you are not comfortable with them, you can have a look at Python RegEx for more details.

  1. The next step is to find all linked URLs to the website.
 # create a beutiful soup for the html document
    soup = BeautifulSoup(response.text, 'lxml')

The <a href=””> tag indicates a hyperlink that can be used to find all the linked URLs in the document.

 for anchor in soup.find_all("a"): 
        
        # extract linked url from the anchor
        if "href" in anchor.attrs:
          link = anchor.attrs["href"]
        else:
          link = ''
        
        # resolve relative links (starting with /)
        if link.startswith('/'):
            link = base_url + link
            
        elif not link.startswith('http'):
            link = path + link

Then we will find the new URLs and add them in the unscraped queue if they are not in the scraped nor in the unscraped.

When you try the code on your own, you will notice that not all the links are able to be scraped, so we also need to exclude them,

if not link.endswith(".gz" ):
          if not link in unscraped and not link in scraped:
              unscraped.append(link)

Step 4: Exporting Emails to a CSV file

To analyze the results in a better way, we will export the emails to the CSV file.

df = pd.DataFrame(emails, columns=["Email"]) # replace with column name you prefer
df.to_csv('email.csv', index=False)

If you are using Google Colab,you can download the file to your local machine by

from google.colab import files
files.download("email.csv")

As already explained, I can’t show the scrapped email addresses due to confidentiality issues. 

[Disclaimer! Some websites don’t allow to do web scraping and they have very intelligent bots that can permanently block your IP, so scrape at your own risk.]

Complete Code

import re
import requests
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import files
 
# read url from input
original_url = input("Enter the website url: ") 
 
# to save urls to be scraped
unscraped = deque([original_url])
 
# to save scraped urls
scraped = set()
 
# to save fetched emails
emails = set()  
 
while len(unscraped):
    url = unscraped.popleft()  
    scraped.add(url)
 
    parts = urlsplit(url)
        
    base_url = "{0.scheme}://{0.netloc}".format(parts)
    if '/' in parts.path:
      path = url[:url.rfind('/')+1]
    else:
      path = url
 
    print("Crawling URL %s" % url)
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        continue
 
    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.com", response.text, re.I))
    emails.update(new_emails) 
 
    soup = BeautifulSoup(response.text, 'lxml')
 
    for anchor in soup.find_all("a"):
      if "href" in anchor.attrs:
        link = anchor.attrs["href"]
      else:
        link = ''
 
        if link.startswith('/'):
            link = base_url + link
        
        elif not link.startswith('http'):
            link = path + link

Wrapping Up

In this article, we have explored one more wonder of web scraping by showing a practical example of scraping email addresses. We have tried the most intelligent approach by making our web crawler by using Python and its easiest and yet powerful library called BeautfulSoup. Web Scraping can be of massive help if done rightfully considering your requirements. Although we have written a very simple code for scraping email addresses, it is totally free of cost, and also, you don’t need to rely on other services for this. I tried my level best to simplify the code as much as possible and also added room for customization so you optimize it according to your own requirements. 

If you are looking for proxy services to use during your scraping projects, don’t forget to look at ProxyScrape residential and premium proxies

That was all for this article. See you in the next ones!

Leave a Reply

Your email address will not be published. Required fields are marked *

Looking for help with our proxies or want to help? Here are your options:

Thanks to everyone for the amazing support!

Latest blog posts

© Copyright 2021 – Thib BV | Brugstraat 18 | 2812 Mechelen | VAT BE 0749 716 760