Web Scraping for Lead Generation: Thousands of Leads at Your Fingertips

Scraping, Mar-06-20245 mins read

Why Lead Generation Matters Lead generation is an essential part of growing your business. If your sales team doesn’t have leads to approach, they can’t do their job. Cold-calling prospects is rarely effective, especially for brands that sell higher-value products where there’s some friction to the idea of making a purchase. Every Sale Started as

Table of Contents

Why Lead Generation Matters

Lead generation is an essential part of growing your business. If your sales team doesn’t have leads to approach, they can’t do their job. Cold-calling prospects is rarely effective, especially for brands that sell higher-value products where there’s some friction to the idea of making a purchase.

Every Sale Started as a Lead

Sales come from leads. The Technology Content Marketing: Benchmarks, Budgets and Trends report produced by the Content Marketing Institute and MarketingProfs highlights that 77% of tech marketers use marketing-qualified leads to drive sales (up from 64% in 2019).

Qualified leads are easier to convert because they’re people (or businesses) that have already expressed an interest in your product or service. By identifying your target audience and focusing your marketing efforts on those people specifically, you’ll save your sales team’s time and energy so they can focus on the highest-quality prospects.

The Power of the Web at Your Fingertips

Lead generation is easier today than it’s ever been. Instant communication, highly-targeted social media marketing options, and access to databases containing almost any piece of information imaginable mean small business owners have the power to achieve whatever they want to set their minds to.

In the past, if you’d wanted to reach a specific target audience, you’d have to pay a huge amount of money to a marketing company to be able to send leaflets in the post to the companies on their database.

Today, that’s not necessary. If you want to find a list of Mexican restaurants on the east coast or K-12 schools in your state, you can find that online. Companies operating in the B2B space can build up a database of prospects quickly and easily, then filter that list and send tailored marketing messages.

For B2B entities that are targeting a relatively small geographic area, a simple web search could be enough to find a list of prospective clients. If you’re looking to reach businesses statewide or even nationwide, however, manually collecting all that data would be hugely time-consuming.

Web scraping can save you and your marketing team a significant amount of time and money, gathering the data you need automatically.

What is Web Scraping?

Web Scraping is an automated technique for extracting data from a website or multiple websites, so you can use the data in other applications. For example, suppose you wanted to build a list of names and addresses of restaurants in your area, rather than manually visiting every single local restaurant listed on Yelp or Tripadvisor. In that case, you could use a web scraper to go through those pages and extract those details, creating a list you could use for mail-outs.

Web scraping can save businesses a lot of time and effort when it comes to building a marketing list. It’s also surprisingly easy to do if you have the right tools or programming know-how.

How Do Web Scrapers Work?

Web scrapers work by loading up the pages that you want to extract data from, then reading the page to look for the type of information you’re trying to find. That information could be:

  • Company names
  • Telephone numbers
  • Email addresses
  • Postal addresses
  • Website addresses

When a web scraper downloads a page, it reads the source code to look for patterns. Depending on the site you’re pulling the data from, it could simply look for something which matches the 123-456-78901 pattern of a telephone number or the [email protected] format of an email address.

Alternatively, the developer of the scraper may know that on a certain directory website, contact details are surrounded by a specific set of tags in the HTML and make the scraper extract the information from between those tags.

Some scraper software can be configured by the end-user, so it can be taught to understand almost any website.

Challenges With Using Scrapers

One problem with using scraper software is that regulations such as the EU’s GDPR mean users have to be very careful with the data they gather and how it’s used. Under GDPR, an organization must have a person’s permission to hold or process data about an individual.

Some websites try to protect their users’ privacy and protect their own server resources by attempting to block web scrapers. There are several options for doing this, including checking the ‘user agent’ returned by the client software and limiting the number of requests for pages that come from a specific IP address.

If you want to use scrapers effectively, you’ll need to make sure you understand the rules surrounding marketing in your country, process any data you collect responsibly, and know-how to collect data from your chosen sources in an efficient, non-destructive way that won’t get you banned from that site.

For example, at ProxyScrape, we offer residential proxies that can be used for data collection purposes. We recommend that if you’re considering using those proxies, you make sure your scraper does not issue an excessive number of requests to a target website in a short period of time. Scrape responsibly so that you don’t cause harm to the websites you’re working with.

Choosing Data Sources for High-Quality Leads

Content scraping gives business owners access to huge amounts of information that would otherwise be difficult to gather, but that information is only as useful as the source it came from.

One of the challenges of gathering data from scraping is being sure that the information is up-to-date. There are thousands of directories on the web, and many of them are poorly curated and out-of-date.

If you gather data from an outdated, low-quality source, at best, you waste time on emails that won’t be read. In the worst-case scenario, you could find yourself faced with complaints for making repeated unsolicited phone calls to a number that no longer belongs to the business you thought it did.

So, how can you increase the chances of the data you gather being useful?

Choose Your Data Source Carefully

Before you start collecting data using a scraping tool, vet the website you’re considering working with manually. Collect a few leads by hand and investigate them.

Are the businesses still operating? Are the contact details still correct? Does it look like the directory owner is vetting information before it gets added?

Suppose half of the leads you gather manually are dead, outdated, or potentially fake. In that case, there’s a high chance any database you build by scraping that site would be low quality.

Larger directory sites such as Tripadvisor, Yelp, or FourSquare are more likely to have quality data than smaller, lesser-known directories because these platforms have a much bigger base of users updating them.

Niche directories may have value if you’re looking to market to an obscure interest group or a highly specialized type of company, but you should expect to have a lot of data cleaning to do before using the information you collect for marketing purposes.

Consider Sites That Require a Login

In many cases, you’ll get far more valuable data if you gather it from a site that requires a login. LinkedIn and Twitter, for example, can be scraped if you use a rate limiter to keep the number of requests your bot sends to a reasonable level and are logged in to the site when you’re making the requests.

Another option is to use an API instead of a simple HTTP scraper and gather details from one of the popular mapping services. For example, Google provides a business search API that can be used to collect information about organizations included in Google Maps, but you must agree to comply with Google’s terms and conditions before accessing the API.

In general, if an API is available, it’s better to gather your data using that API than to use web scraping. You’ll be far less likely to run into problems with website owners, and it will be easier to clean data delivered via an API.

Construct Your Queries Properly

There’s a saying in computer programming of “garbage in, garbage out”, and that most certainly applies to data collection. Make sure you construct any searches you perform carefully.

For example, if you want to market to builders in Newcastle, don’t forget that there’s more than one Newcastle in England, and there’s a Newcastle in Australia too. If you’re searching for ‘Newcastle’ via a proxy, most websites will try to guess which Newcastle you mean by looking at which is closest to the geographic location of the proxy.

Try to narrow down the search as much as possible, providing city, state, and even country information if the target website allows. This will help you avoid ending up with a database full of contact details for organizations hundreds of miles away from your desired area.

Scraper Software Options: Popular Tools

Web scraping can be as simple or as complex as you want it to be. If you’re just trying scraping for the first time, there’s no need to spend a lot of money on sophisticated software.

Some good options include:

  • Scraper
  • ProWebScraper
  • Scrapy

Scraper is a web browser extension that allows users to extract data from web pages quickly and easily. If you want to pull information from a single results page or a small number of pages, Scraper is a simple and effective way of doing so, and you may find that it’s much easier to use than a more sophisticated web crawler.

ProWebScraper is a more advanced tool that has free and premium versions. The free tool can be used to scrape up to 100 pages, which means it should be sufficient for a smaller, niche business. ProWebScraper is relatively easy to use for scraping software, featuring a point-and-click interface and predesigned rules that allow you to set up scraping even if you’re not confident on the technical side.

ProWebScraper can download images and crate JSON, CSV or XML dumps. It can even be set up to scrape sites on a schedule so you can collect the data and update your marketing records.

Scrapy is a web scraping framework that is free and open source. This tool requires technical knowledge, but it is fast, flexible, and can be used to scrape large amounts of data. Scrapy can be run on your own Linux, OS X, Windows, or BSD computer or on a web server.

There’s an active Scrapy community, including IRC chat, Reddit, and StackOverflow. You can seek advice from the community and may be able to take advantage of extensions or modules created by the community, unlocking the power of Scrapy even if you’re not a confident developer yourself.

Coding Your Own Scraper

If you need to gather a lot of data or plan on scraping regularly, free tools and GUI-based tools may not be powerful enough for your use-case. Coding your own scraper, or hiring a developer to do it for you, is a good option.

There are several free, open-source frameworks that can be used to code a scraper in popular languages such as Python, Perl, Java, R, or PHP.

One of the most popular libraries for web scraping is BeautifulSoup. This is a Python scraping tool that is able to extract data out of HTML or XML files quickly and easily. You’ll need to have some knowledge of programming to use it, but it does a lot of the detailed work of scraping for you, saving you from reinventing the wheel.

Once you’ve extracted the data, you can either export it as a CSV file or display it in various formats using a data processing library such as Pandas.

The Pros and Cons of Coding Your Own Scraper

Coding your own scraper is a good idea if you have some programming knowledge. It may also be useful to code your own scraper if you need to extract a lot of data from an unusual web page that free scraping tools can’t handle.

Coding your own scraper or paying someone to do it for you can be a good idea if you have specific, sophisticated needs. A custom-coded scraper can be designed around a target page more effectively than a more general tool, so you’re less likely to encounter bugs or issues handling the data.

Conversely, custom-coded scrapers are also useful for smaller, simple jobs. Once you’ve written a scraper once you can tweak the parsing routine and use the same script to extract data from other pages.

The downside of using a custom-coded scraper is that it takes time to write the scraper for the first time, and if you’re not an experienced developer, you might spend more time struggling with JSON formatting or trying to learn a new library than it would take to just read the manual for ProWebScraper and configure it.

Depending on the task, it may be more cost-effective to pay for a tool than to write a custom one.

In addition, if you’re planning to write your own scraper, you’ll need to be aware of scraping best practices and coding issues, such as:

  • Using a User-Agent to identify your bot
  • How you handle authentication for sites that require a login
  • Compliance with any terms and conditions of the website
  • Rate limiting your requests to avoid putting undue load on the website
  • Sending properly formed requests
  • Using (and regularly rotating) proxies
  • Sanitizing any information that’s returned by the server
  • Data protection rules for how and where you store the information returned
  • CAPTCHA solving

Writing a small scraper to pull information about a few hundred or a few thousand companies makes a lot of sense. If you’re pulling larger amounts of data, you may want to seek advice or work with an expert to make sure you’re fully compliant with local privacy regulations.

Golden Rules for Web Scraping

If you do decide to write your own scraper, remember to “be nice”. Make every effort to scrape in a considerate way, sending properly formed requests, scraping slowly, and using a range of IP-addresses when you scrape.

Try to make your scraper look like a human. That means requesting pages slowly and trying not to follow a fixed pattern when going through the pages. Consider, for example, pulling a list of search results, making a list of the links on the results page, then going to those links in a random order, so it’s less obvious you’re a bot.

Don’t send multiple requests from the same IP at the same time. Anti-scraping tools will detect that you’re placing an abnormal load on the server.

Respect the information in the website’s Robots.txt file. If there are pages, the webmaster doesn’t want to be indexed. It would be unethical for you to ignore that.

Consider using a library such as Selenium to make your bot look more human by sending clicks to the page or otherwise interacting with it. Some more sophisticated ant-scraper tools look for ‘bot-like interaction patterns and will block an IP address if they notice a lack of scrolling, clicking, and other interaction.

There’s a technological arms race between scraper developers and those who try to block scrapers from their websites. It’s very hard to make a scraper that can collect huge volumes of data un-detected. However, for smaller or mid-sized projects, if you follow the rules of being nice and don’t be greedy, you should be able to get the data you need with a slow, steady scraper and some proxies.

Remember, your bot can work 24 hours a day, gathering data in the background, so there’s no need to download the entire list of small businesses on Yelp in one go.

Troubleshooting Scraper Issues

There are several potential issues you might encounter when running a scraper. These can include:

  • Having your IP blocked by the webmaster
  • Having your scraping client blocked by the webmaster
  • Your scraper getting confused when attempting to navigate the website
  • Garbage data being collected through ‘honeypots’ hidden on sites
  • Rate limiting stopping your scraper from working quickly
  • Changes to site designs breaking a scraper that used to work

The good news is these issues can all be fixed if you understand how scrapers work.

Simple web scrapers follow a pattern:

  1. The scraper sends an HTTP request to a website
  2. The website sends a response, as it would to a normal web browser
  3. The scraper reads the response, looking for a pattern in the HTML
  4. The pattern is extracted and stored in a JSON file for processing later
  5. The scraper can then either continue reading the response looking for more patterns or send its next request

There are a few areas where things can go wrong.

The scraper Isn’t Picking up Any Data

If the scraper isn’t picking up any data at all, this could be because of an issue with the way you’ve set up the parser, or it could be that the scraper isn’t seeing the same site as you do when you use a web browser.

To find out what’s gone wrong, set your scraper to output the HTML of the page, and compare that to normal browser output.

If you see an error or a different page, it could be that your scraping client has been banned. The site could have banned your IP address or the scraper client software.

Try changing the User-Agent your scraper identifies to one that makes it look like a modern web browser such as Firefox or Chrome. This could help you get around simple restrictions on some sites.

If that doesn’t work, consider setting your scraper to use a proxy to connect to the website in question. A proxy is a server that sends web requests on your behalf, so the website can’t tell they’re coming from your internet connection.

If you see a ‘normal’ page, then the issue is more likely with the way you’ve set the scraper to extract data. Each scraping program has its own way of matching patterns, although most use some variation of regular expressions. Make sure there are no typographical errors in the pattern matching. Remember, the program is doing exactly what you tell it to, so even one small error will completely break the matching rules!

The Scraper Works for a While, Then Stops

Another common problem is for a scraper to work for a short time, then stop working. This usually means the website has blocked your IP address, either temporarily or permanently, because you’ve sent too many requests in a short time.

If this happens, you can get around the ban by using a proxy. Proxyscrape offers both premium and residential proxies for people to use for data scraping. Premium datacenter proxies are fast and offer unlimited bandwidth but have IP addresses that webmasters may recognize as being from a datacenter. Residential proxies look like they’re ‘home users’, but the throughput available on these may be lower.

Consider changing the proxy you use after a few requests to reduce the risk of a proxy’s IP address getting banned. You can also reduce the risk of IP bans by reducing the speed at which your scraper sends requests.

Remember that a scraper can work in the background, 24 hours a day, without breaks. Even if you cap the speed of the scraper to parsing one page every 15-30 seconds, it will work more quickly than a human being.

Bear in mind that many websites, especially smaller ones, are hosted on servers that have limits to their speed and the amount of data they can transmit each month. You may feel your bot scraping some data is not unreasonable, but if many other users are doing the same thing, or your bot gets ‘lost’ and tries to endlessly download the same pages over and over again, you could impair the performance of the website for human users or cost the webmaster money by consuming excessive resources.

The Scraper Gets Confused and Goes Through an Endless Loop of Pages

Another common issue that marketers encounter when trying to use a web scraper is for the scraper to get confused and download pages it shouldn’t.

Let’s imagine that your scraper’s plan is to find a list of bricklayers in your city, and you send it to a directory where it searches that. The scraper should:

  • Submit an HTTP request containing the desired search string
  • Download the results page
  • Parse the result page to find a link to the first result
  • Open that link
  • Extract the contact details from that new page
  • Continue parsing the results page to find the second result
  • Open that link
  • And so on…

Some websites are built to include ‘honeypots’ that will trap and confuse bots. These honeypots are pieces of HTML that are set with a display tag saying ‘display:none’, so they won’t show up in a normal browser. Bots can see them, however, and if they’re not configured to ignore them, they’ll process them just like normal HTML.

It’s very hard to program a bot to completely ignore all bot-trapping HTML because some of these traps are incredibly sophisticated. What you can do, however, is set limits on how many links your bot will follow. You can also view the source of the page yourself and look for any obvious traps so that you can set the bot to ignore them.

Ethical Marketing: Use Your Scraped Leads Wisely

Web scraping is something that many sites frown upon and that business owners should tread carefully when doing. Under GDPR, it’s illegal to scrape the information of an EU resident without their consent, for example.

In addition, many websites that hide data behind a login screen explicitly ban web scraping in their terms and conditions. This means you run the risk of being banned from that website if you’re found to be using a scraper.

If you decide to use scraping to gather leads, try to do so sensibly. Think of scraping as a way of saving time when gathering leads you would have gathered anyway, rather than a way of mounting a massive marketing campaign.

Avoid casting a net too wide with scraping. It can be tempting to gather the contact details of every business or person in your area and the surrounding areas, in the hopes of converting one of those businesses into a customer, but such a wide, unfocused campaign will most likely backfire.

Clean and Maintain Your Database

Before you start your marketing campaign, run some checks on the data you’ve gathered. Clean the database to remove any obviously incorrect data, such as businesses that have closed down, duplicate records, or records for people that aren’t in your target area.

Once you start the campaign, keep the database up-to-date. If a lead asks to be removed from your database, delete them. If you’re legally able to do so in your jurisdiction, retain just enough data about them to add their email or phone number to a ‘do not contact’ list so that they can’t be re-added to your marketing database next time you go scraping.

Some other things to remember when managing your marketing campaigns include:

  • Limit the number of emails or calls you make to cold leads
  • Provide opt-out information in any contacts you send out
  • Respect opt-out requests and perform them promptly
  • If someone responds to your marketing, update their details

There’s a fine line between proactive marketing and aggressive spam. Repeated contacts from marketers are a part of the customer journey, and it’s important to stay in touch with prospective customers, but overly-aggressive marketing could alienate prospects and give your brand a bad reputation.

Consider importing the data you get from scraping into a CRM system so you can keep track of each customer, what stage they’re at in the conversion process, and how they’ve been responding to marketing messages.

Doing this will not only help you to stay on top of individual customers, but it will also make it easier for you to see how your marketing campaigns are performing collectively so you can refine your messages.

Tracking the source of leads could also be helpful since it will give you an idea of which data sources contain the highest-quality information.