Owning a list of email prospects can help marketers to scale up their businesses. By scraping email addresses using Python scripts, business people can have better outreach to their audience. MailButler.io says that there are nearly 4.3 billion email users globally which is estimated to reach 4.6 billion by 2025. These statistics say that people
Owning a list of email prospects can help marketers to scale up their businesses. By scraping email addresses using Python scripts, business people can have better outreach to their audience.
MailButler.io says that there are nearly 4.3 billion email users globally which is estimated to reach 4.6 billion by 2025. These statistics say that people mostly rely on the email platform for their official mode of communication. This article will guide you through the process of scraping email addresses using python language.
One of the easiest ways to have good clientage is to have as many business email addresses as possible and send them your service details time and again. They are many scraping tools present on the internet that provide these services for free, but they have withdrawal data limits. They also offer unlimited data extraction limits, but they are paid. Why pay them when you can build one with your own hands? Let us discuss the steps to build a quality scraping tool using Python.
Related Articles
Best Python Web Scraping tools
How To Create A Proxy In Python
Though it will be a very simple example for beginners, it will be a learning experience, especially for those who are new to web scraping. This will be a step-by-step tutorial that will help you get email addresses without any limits. Let’s start with the building process of our intelligent web scraper.
We will be using the following six modules for our project.
import re
import requests
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import files
The details of the imported modules are given below:
re is for regular expression matching.requests for sending HTTP requests.urlsplit for dividing the URLs into component parts.deque is a container that is in the form of a list used for appending and popping on either end.BeautifulSoup for pulling data from HTML files of different web pages.pandas for email formatting into DataFrame and for further operations.
In this step, we will initialize a deque that will save scraped URLs, unscraped URLs, and a set of saving emails scraped successfully from the websites.
# read url from input
original_url = input("Enter the website url: ")
# to save urls to be scraped
unscraped = deque([original_url])
# to save scraped urls
scraped = set()
# to save fetched emails
emails = set()
Duplicate elements are not allowed in a set, so they are all unique.
The first step is to distinguish between the scraped and unscraped URLs. The way to do this is to move a URL from unscraped to scraped.
while len(unscraped):
# move unsraped_url to scraped_urls set
url = unscraped.popleft() # popleft(): Remove and return an element from the left side of the deque
scraped.add(url)
The next step is to extract data from different parts of the URL. For this purpose, we will use urlsplit.
parts = urlsplit(url)
urlsplit() returns a 5-tuple: (addressing scheme, network location, path, query, fragment, identifier).
I can’t show sample inputs and outputs for urlsplit() due to confidential reasons, but once you try, the code will ask you to input some value (website address). The output will display the SplitResult(), and inside the SplitResult() there would be five attributes.
This will allow us to get the base and path part for the website URL.
base_url = "{0.scheme}://{0.netloc}".format(parts)
if '/' in parts.path:
path = url[:url.rfind('/')+1]
else:
path = url
This is the time to send the HTTP GET request to the website.
try:
response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
# ignore pages with errors and continue with next url
continue
For extracting the email addresses we will use the regular experession and then add them to the email set.
# You may edit the regular expression as per your requirement
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.com",
response.text, re.I)) # re.I: (ignore case)
emails.update(new_emails)
Regular expressions are of massive help when you want to extract the information of your own choice. If you are not comfortable with them, you can have a look at Python RegEx for more details.
The next step is to find all linked URLs to the website.
# create a beutiful soup for the html document
soup = BeautifulSoup(response.text, 'lxml')
The <a href=””> tag indicates a hyperlink that can be used to find all the linked URLs in the document.
for anchor in soup.find_all("a"):
# extract linked url from the anchor
if "href" in anchor.attrs:
link = anchor.attrs["href"]
else:
link = ''
# resolve relative links (starting with /)
if link.startswith('/'):
link = base_url + link
elif not link.startswith('http'):
link = path + link
Then we will find the new URLs and add them in the unscraped queue if they are not in the scraped nor in the unscraped.
When you try the code on your own, you will notice that not all the links are able to be scraped, so we also need to exclude them,
if not link.endswith(".gz" ):
if not link in unscraped and not link in scraped:
unscraped.append(link)
To analyze the results in a better way, we will export the emails to the CSV file.
df = pd.DataFrame(emails, columns=["Email"]) # replace with column name you prefer
df.to_csv('email.csv', index=False)
If you are using Google Colab,you can download the file to your local machine by
from google.colab import files
files.download("email.csv")
As already explained, I can’t show the scrapped email addresses due to confidentiality issues.
[Disclaimer! Some websites don’t allow to do web scraping and they have very intelligent bots that can permanently block your IP, so scrape at your own risk.]
import re
import requests
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import files
# read url from input
original_url = input("Enter the website url: ")
# to save urls to be scraped
unscraped = deque([original_url])
# to save scraped urls
scraped = set()
# to save fetched emails
emails = set()
while len(unscraped):
url = unscraped.popleft()
scraped.add(url)
parts = urlsplit(url)
base_url = "{0.scheme}://{0.netloc}".format(parts)
if '/' in parts.path:
path = url[:url.rfind('/')+1]
else:
path = url
print("Crawling URL %s" % url)
try:
response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
continue
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.com", response.text, re.I))
emails.update(new_emails)
soup = BeautifulSoup(response.text, 'lxml')
for anchor in soup.find_all("a"):
if "href" in anchor.attrs:
link = anchor.attrs["href"]
else:
link = ''
if link.startswith('/'):
link = base_url + link
elif not link.startswith('http'):
link = path + link
As businesses require numerous email addresses to build their contact list, it is necessary to collect data from multiple sources. A manual data collection process may be tedious and time-consuming. In this case, scrapers usually go for proxies to speed up the process and bypass the restrictions that come their way. Proxyscrape provides high-bandwidth proxies that are capable of scraping unlimited data and work 24/7 to ensure uninterrupted functionality. Their proxy anonymity level is high enough to hide the identity of the scrapers.
1. Why is it necessary to scrape email addresses?
Creating a potential contact list with qualified email addresses will ease the process of reaching out to the target audience. As most people use email as their communication medium, it is quite easier to reach them through email addresses.
2. Do we need proxies for scraping email addresses?
While scraping the email addresses from multiple sources, scrapers may face some challenges like IP blocks or geographical barriers. In this case, proxies will hide users’ addresses with the proxy address and remove the blocks in accessing blocked websites.
3. Is it legal to scrape email addresses?
It is always legal to collect publicly available data. So, scrapers must make sure the data they are collecting is available in the public domain. If not they can collect data with prior permission to maintain legality in scraping.
In this article, we have explored one more wonder of web scraping by showing a practical example of scraping email addresses. We have tried the most intelligent approach by making our web crawler using Python and it’s the easiest and yet most powerful library called BeautfulSoup. Web Scraping can be of massive help if done rightfully considering your requirements. Although we have written a very simple code for scraping email addresses, it is totally free of cost, and also, you don’t need to rely on other services for this. I tried my level best to simplify the code as much as possible and also added room for customization so you optimize it according to your own requirements.
If you are looking for proxy services to use during your scraping projects, don’t forget to look at ProxyScrape residential and premium proxies.