Scrape Static & Dynamic Sites with Python and ProxyScrape API

Guides, Python, The differences, May-02-20245 mins read

In a world that is becoming ever more reliant on data, the ability to gather and analyze vast amounts of information can give businesses and professionals a significant competitive edge. Web scraping, the process of extracting data from websites, is a powerful tool in the arsenal of data analysts, web developers, digital marketers, and Python programmers. This guide takes you through basic and advanced web scraping techniques, highlights best practices, and introduces ProxyScrape's Web Scraping API as a flexible solution for both static and dynamic websites.

Identifying Whether a Website is Static or Dynamic

To determine if a website is static or dynamic:

  • Inspect the Page Source: Right-click and select "View Page Source." If all content is visible and matches what’s displayed on the page, it’s likely static.
  • Use Browser Developer Tools: Open the developer tools by right-clicking the page and selecting "Inspect," then look at the "Network" tab as you interact with the page. If new network requests are made in response to interactions, it's likely a dynamic site.
  • Disable JavaScript: Try disabling JavaScript in your browser settings and reload the page. If the page stops functioning correctly or shows very little content, it’s likely relying on JavaScript for data fetching and rendering, indicating a dynamic nature.

These methods represent just a few ways to determine if a website is static or dynamic. While there are additional strategies, we have analyzed and identified these techniques, which we believe are widely utilized and effective.

Scraping Static Websites with Requests and BeautifulSoup

To scrape static content, Python offers robust libraries such as `requests` for making HTTP requests and `BeautifulSoup` for parsing HTML and XML documents. Here’s a simple example:

  • Making a Request: Use `requests` to retrieve the HTML content of the page.
  • Parsing with BeautifulSoup: Once you have the page content, `BeautifulSoup` can parse and extract specific information.
import requests
from bs4 import BeautifulSoup

response = requests.get('http://example.com')

soup = BeautifulSoup(response.text, 'html.parser')

# Extract the desired data
data = soup.find_all('p')

This method is perfect for those just starting their web scraping journey. It’s effective for websites with static content, requiring minimal setup.

Scraping Dynamic websites

Dynamic websites present a different challenge. These websites load their content asynchronously with JavaScript, meaning straightforward HTML scraping won't work because the data isn’t present in the initial page load.

There are two ways to approach dynamic website scraping:

  • First approach is using a browser automation library like Playwright/Selenium to get the content and then parse it with Beautifulsoup.
  • The second approach is like playing detective with the network tab to spot the endpoint the website uses to fetch its data. Then, you just use Python's 'request' module to grab that data yourself.

Navigating Dynamic Websites with Playwright

To scrape dynamic content, tools like Playwright mimic a real user’s interaction with the browser, allowing you to scrape data that’s loaded dynamically. Here's a brief insight into using Playwright with Python:

  • Installing Playwright: Install the Playwright package and command-line tool.
    - "pip install playwright"
    - "playwright install"
  • Using Playwright to Simulate Interactions: Write a script that navigates the website and interacts with it as necessary to trigger the loading of dynamic content
from playwright.sync_api import sync_playwright


if __name__ == '__main__':

   with sync_playwright() as p:

       browser = p.chromium.launch(headless=True)

       page = browser.new_page()

       page.goto('https://www.scrapethissite.com/pages/ajax-javascript/')

       # Simulate interactions here

       page.click('//*[@id="2014"]')

       # Extract the dynamic content

       content = page.inner_text('//*[@id="oscars"]/div/div[5]/div/table')

       print(content)

       browser.close()

Analyzing Network Panel to get API endpoints:

  • Open Developer Tools
    a.  Open the website you are interested in your browser.
    b.  Right-click anywhere on the page, and select Inspect or press Ctrl+Shift+I (Cmd+Option+I on Mac) to open developer tools.
  • Inspect the Network Tab
    a.  Click on the Network tab in the developer tools. This tab is where you'll see every network request the website makes.
    b.  Refresh the page to start capturing the traffic from the start.
  • Filter and Identify AJAX Requests
    a.  You can filter the requests by types like XHR (XMLHttpRequest), which are commonly used for AJAX requests.
    b.  Interact with the page—like clicking buttons, filling forms, or scrolling—to trigger the dynamic loading of content.
    c.  Observe the network requests that appear when you perform these actions. Look for requests that fetch data you are interested in.
  • Analyze the Request
    a.  Click on a request in the Network tab that looks like it's retrieving data you need.
    b.  Check the Headers section to see the request method (GET, POST, etc.), the URL, and other headers.
  • Replicate the Request Using Python
    a.  Use the information from the Headers tab to replicate the request using Python’s requests library. Here's a basic example of how you might do it:
import requests

# URL from the AJAX request
url = 'https://example.com/api/data'

# Any headers you observed that are necessary, like user-agent, authorization tokens, etc.
headers = {
    'User-Agent': 'Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Mobile Safari/537.36,gzip(gfe)',
    'Authorization': 'Bearer token_if_needed'
}

# If it's a POST request, you might also need to send data
data = {
    'example_key': 'example_value'
}

# Replace 'get' with 'post' if the request type is POST
response = requests.get(url, headers=headers, data=data if 'post' in locals() else None)

# To view the response
print(response.json()) 

Utilizing a Web Scraping API for Both Scenarios

While mastering requests and Playwright or any other Http client library can be rewarding, they require time and effort to handle correctly. An alternative approach is to leverage a Web Scraping API that abstracts the complexity of scraping tasks. Not only does it handle the sending of HTTP requests for you, but it also provides assistance with anti-ban techniques to prevent getting blocked by certain websites.

Introducing ProxyScrape's Web Scraping API

ProxyScrape offers a Web Scraping API that simplifies data extraction from both static and dynamic websites.

The API features include:

  • Easy integration with dynamic and static sites.
  • Comprehensive support for different types of web scraping activities.
  • Extensive pool of IP addresses
  • Offers up to 100,000 free requests, enabling users to explore and realize the full potential of the API without immediate investment.
  • Sophisticated anti-ban technology, tailored for websites known for their scraping difficulties.
  • Actions enable precise control over the timing of receiving website output. This includes waiting for a particular URL request, anticipating the appearance of an element on the site, post-scroll activities, and more. Utilizing

ProxyScrape Web Scraping API with a static website:

This is an illustration of how you can incorporate our web scraping API into your Python scripts for static websites, or for calling an API endpoint that you have extracted from the inspection panel on your browser:

import requests
import base64
import json

data = {
    "url": "https://books.toscrape.com/",
    "httpResponseBody": True
}

headers = {
    'Content-Type': 'application/json',
    'X-Api-Key': 'YOUR_API_KEY'
}

response = requests.post('https://api.proxyscrape.com/v3/accounts/freebies/scraperapi/request', headers=headers, json=data)

if response.status_code == 200:
    json_response = response.json()
    if 'browserHtml' in json_response['data']:
        print(json_response['data']['browserHtml'])
    else:
        print(base64.b64decode(json_response['data']['httpResponseBody']).decode())
else:
    print("Error:", response.status_code)

ProxyScrape Web Scraping API with a dynamic website:

Here's an example where we wait for the favicon to start loading. That's usually the last request to kick off on the test website we're using.

import requests
import json

url = 'https://api.proxyscrape.com/v3/accounts/freebies/scraperapi/request'

headers = {
    'Content-Type': 'application/json',
    'X-Api-Key': '<your api key>'  # Make sure to replace <your api key> with your actual API key
}

payload = {
    "url": "https://books.toscrape.com/",
    "browserHtml": True,
    "actions": [
        {
            "action": "waitForRequest",
            "urlPattern": "https://books.toscrape.com/static/oscar/favicon.ico",
            "urlMatchingOptions": "exact"
        }
    ]
}

response = requests.post(url, headers=headers, json=payload)

# Print the response from the server
print(response.text)  # Prints the response body as text

Best Practices in Web Scraping

Regardless of the tools or APIs you choose, respecting website terms of use, limiting request rates to avoid IP bans, and using proxies for anonymous scraping are critical best practices. ProxyScrape not only provides premium, residential, mobile and dedicated proxies for such needs but encourages ethical web scraping.

Conclusion

  • Whether you're picking up web scraping as a hobby or integrating it into your professional toolkit, understanding the distinction between static and dynamic websites and knowing how to effectively scrape both is essential. By combining Python libraries like Requests and Playwright/Selenium with Beautifulsoup, you are equipped to handle your web scraping challenges.
  • If your web scraping scripts are being detected as bots and subsequently blocked, or if you wish to optimize and simplify your request-sending process, consider exploring our Web Scraping API. It's designed to manage these issues efficiently on your behalf.
  • Remember, the future of web scraping is bright, and by sticking to best practices and leveraging cutting-edge tools, you can unlock a world of data waiting to be discovered.

Ready to begin your web scraping adventure? Sign up for ProxyScrape today and explore the endless possibilities of the web with our dedicated proxies, residential proxies, and comprehensive Web Scraping API.