AutoScraper is a powerful, open-source web scraping library for Python that simplifies the process of extracting data from websites. Unlike traditional web scraping frameworks that require extensive coding to parse HTML content, AutoScraper can automatically generate rules to extract the desired information based on examples you provide. AutoScraper is particularly well-suited for beginners in the web scraping world. Its user-friendly interface and automatic rule generation make it accessible for those who may not have extensive coding experience.
Suppose you want to scrape an e-commerce store without dealing with HTML parsing. AutoScraper enables you to input product names into the 'wanted_list,' and it will automatically learn the HTML structure and parse subsequent products on its own.
Here is a clear example to demonstrate the process, including the implementation of proxies:
First, you'll need to install AutoScraper. You can do this using pip:
from autoscraper import AutoScraper
Specify the URL you want to scrape and the elements or products you wish to extract. By doing so, AutoScraper can learn the HTML structure and accurately parse all similar elements within that framework:
url = 'https://books.toscrape.com/'
wanted_list = [
"Tipping the Velvet",
"Soumission",
]
Use the AutoScraper to build your scraping model:
scraper = AutoScraper()
proxies = {
"http": 'http://test_user112:[email protected]:6060',
"https": 'http://test_user112:[email protected]:6060',
}
# if you wish to use the same scraper again
scraper.save('books_to_scrape')
result = scraper.build(url, wanted_list, request_args=dict(proxies=proxies))
print(result)
scraper = AutoScraper()
scraper.load('books_to_scrape')
result = scraper.get_result(url)
['A Light in the ...',
'Tipping the Velvet',
'Soumission',
'Sharp Objects',
'Sapiens: A Brief History ...',
'The Requiem Red', 'The Dirty Little Secrets ...',
'The Coming Woman: A ...',
'The Boys in the ...',
'The Black Maria',
'Starving Hearts (Triangular Trade ...',
"Shakespeare's Sonnets",
'Set Me Free',
"Scott Pilgrim's Precious Little ...",
'Rip it Up and ...',
'Our Band Could Be ...',
'Olio',
'Mesaerion: The Best Science ...',
'Libertarianism for Beginners',
"It's Only the Himalayas",
'A Light in the Attic',
'Sapiens: A Brief History of Humankind',
'The Dirty Little Secrets of Getting Your Dream Job',
'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
'Starving Hearts (Triangular Trade Trilogy, #1)',
"Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)",
'Rip it Up and Start Again',
'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
'Mesaerion: The Best Science Fiction Stories 1800-1849']
One of the main limitations of AutoScraper is that it doesn't support JavaScript rendering or dynamically loaded data. But don't worry, there's a solution! By using Python libraries like Selenium or Playwright, which do handle dynamic data, we can grab the HTML data and then let Autoscraper take care of the parsing for us.
If your target website employs anti-bot protection, at ProxyScrape, we offer a dependable web scraping API that takes care of everything for you, making your data collection process effortless and efficient.
Here is an example on how you can use our web scraping API with AutoScraper:
import requests
from autoscraper import AutoScraper
def send_request(url_to_scrape):
api_key = 'your_api_key'
data = {
"url": url_to_scrape,
"browserHtml": True # Use browserHtml for JavaScript rendering
}
headers = {
"Content-Type": "application/json",
"X-Api-Key": api_key
}
response = requests.post("https://api.proxyscrape.com/v3/accounts/freebies/scraperapi/request",
headers=headers, json=data)
# we return the html data that web scraping api extracted
return response.json()['data']['browserHtml']
if __name__ == '__main__':
target_url = 'https://books.toscrape.com/'
# get html data using web scraping api
html_content = send_request(target_url)
# parse that html data using AutoScraper
scraper = AutoScraper()
wanted_list = [
"Tipping the Velvet",
"Soumission",
]
result = scraper.build(wanted_list=wanted_list, html=html_content)
print(result)
Web scraping is a powerful tool for data acquisition, and with the right combination of AutoScraper and proxies, you can unlock its full potential. By integrating ProxyScrape's premium proxies, you ensure that your scraping activities are efficient, anonymous, and uninterrupted.We provided you with the necessary elements to get you started, if you want to get more advanced with AutoScraper check this gist.
Ready to elevate your web scraping game? Start exploring the capabilities of AutoScraper with ProxyScrape's premium proxies today. Visit ProxyScrape to sign up and take advantage of our state-of-the-art proxy solutions.
If you need assistance with web scraping, feel free to join our Discord channel where you can find support.
Happy Scraping!