How To Scrape Infinite Scrolling Pages Using Python

How to's, Python, Scraping, Nov-15-20225 mins read

In today’s world, everyone makes use of new technologies. You can get access to the structured data in an automated fashion with the help of web scraping. For instance, you can use web scraping for: Infinite scrolling, also known as endless scrolling, is a web design technique that websites often use with AJAX or Javascript

Table of Contents

In today’s world, everyone makes use of new technologies. You can get access to the structured data in an automated fashion with the help of web scraping. For instance, you can use web scraping for:

  • Price monitoring
  • Lead generation
  • News monitoring
  • Market research
  • Price intelligence

Infinite scrolling, also known as endless scrolling, is a web design technique that websites often use with AJAX or Javascript for loading additional content dynamically when the user scrolls down to the bottom of the webpage. This technique gained popularity due to its success on social media sites. For instance, infinite scrolling in Twitter is produced through asynchronous loading. Twitter makes AJAX calls after the page is loaded for continuously adding new content as it scrolls. Though infinite scrolling has many advantages, it is not recommended for goal-oriented finding tasks that require people to locate particular content.

Let’s first understand the benefits of scraping infinite scrolling pages.

Why Do You Need To Scrape Infinite Scrolling Pages?

Following are some of the reasons to scrape infinite scrolling pages.

  • User Engagement – Infinite scrolling keeps the users engaged on a page. There’s tons of user-generated content to scroll through for social media sites like Twitter and Facebook, so the user is constantly engaged. 
  • Fewer Clicks – Scrolling requires less action, and it is easier for users than clicking.
  • Ideal for Mobile – Infinite Scrolling is great for mobile devices and touch screens. Users can swipe down to generate new content instead of switching to new tabs. 

Apart from the above benefits of scraping infinite scrolling pages, there are some cons as well like:

  • It is not great for Search Engine Optimization (SEO).
  • It is not easy for users with physical disabilities to navigate through the pages that have an infinite scroll.
  • Infinite scrolling websites can have a long load time that may come from the user end or the development end.

How To Scrape Infinite Scrolling Pages Using Python

Let’s see how to scrape infinite scrolling pages using Python with the help of the below-mentioned steps.

Import Libraries

You need to import the Selenium library.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys                       
import time

Selenium Setup

Here you have to choose the browser that you want to use. We will go with Chrome as it offers more options than Firefox. 

def get_selenium():                          
   options = webdriver.ChromeOptions()
   options.add_argument('--ignore-certificate-errors')
   options.add_argument('--incognito')
   options.add_argument('headless')                       
   driver = webdriver.Chrome(chrome_options=options)
   return (driver)

The headless argument mentioned above is pretty important. Selenium will not open Chrome in a new window when it runs headless in Python. However, if you encounter a problem while scraping, you can comment on the headless option and see what’s going on in Chrome and what is loaded on the page.

We can omit the two flags i-e;  ignore-certificate-errors and incognito.

If you encounter a captcha or a cookie banner that prevents your page from loading, you can click OK and proceed to the page normally. However, if the browser gets closed unexpectedly, you can use  time.sleep() to pause the code and take ample time to debug.

Fixing Infinite Scrolling

You need to look into your page HTML structure for fixing infinite scrolling and follow the below-mentioned steps.

  • You have to find the last element loaded onto the page.
  • You have to use Selenium to scroll down to that element.
  • To wait for the page to load more content, use time.sleep().
  • Scroll again to the last element that was loaded onto the page.
  • You need to repeat the same process until you reach the end of the page.

You can consider the example below for a better understanding.

selenium = get_selenium()                           
selenium.get("your/url")   
last_elem = '';
while True:
   current_last_elem = "#my-div > ul > li:last-child"
   scroll = "document.querySelector(\'" + current_last_elem + "\').scrollIntoView();"
   selenium.execute_script(scroll)
   time.sleep(3)
   if (last_elem == current_elem)
      break
   else
      last_elem = current_elem

In the above code, we used jQuery and Javascript inside Python. 

Here,

  • We used the selenium.get() function that will open our URL page. However, if you want to add a keyword to your URL search, you can use the following line of code.
selenium.get("your/url.com/{0}".format(keyword))
  • We initialized the last_time to 0 by storing an empty string in it.
  • We used a while loop in which we used CSS_selector or Xpath to get the current_last_elem. To get the path, follow the below steps. Open your page.To select an element you need the path to, you have to use webdev tools. You can follow this tutorial to select the element in the page HTML structure and get the Xpath in Chrome.
  • Open your page.
  • To select an element you need the path to, you have to use webdev tools. You can follow this tutorial to select the element in the page HTML structure and get the Xpath in Chrome.
  • For scrolling the page down to the selected element, we used jQuery and scrollIntoView(). 
"document.querySelector(\'" + .. + "\').scrollIntoView();"

Here, your format should be correct, so you need to pay attention to the single and double quotes and the escape characters.

  • We run the js script by using  selenium.execute_script().
  • You need to give the page enough time to load so that it can find the last element. Therefore, time.sleep() function is important as it suspends execution for some seconds. If you don’t give the page enough time to load, it will stop scrolling, and you will get an undefined result.
  • We check if a new last element is found every time we scroll down to the bottom of the page. If it is found, it means we have not reached the end of the page yet, and we need to keep scrolling. If not found, it means the page has finished scrolling down, and we can break out of the loop.

Fixing Frequent Problems

Some of the frequently occurring problems when doing infinite scrolling are as:

  • It takes some time to find the right Xpath to the last element. You need to check the single and double quotes in the js script.
  • If you get undefined or the same last element every time, you need to increase the time duration i-e., increase time.sleep() as the page might not have enough time to load completely. 
  • You can comment out the headless option in get_selenium()  if everything is correct, but it still does not work.

Triggering js Within Python

It is possible to trigger a js script from within Python and get a list as a result. 

For instance, we can use the code below to get the sources from all the images on the page.

js_script = '''\                          
var jslist = []       
document.querySelectorAll('img').forEach(i => jslist.push(i.src));
return jslist;
'''             
python_list = selenium.execute_script(js_script)

In the above code,

  • We created an empty array called jslist.
  • We selected all the img tags in the page.
  • We used forEach for pushing each img.src in our array.
  • We returned the jslist.

We can use the same approach for the href links by:

  • Selecting all the “a” tags.
  • Pushing every a.href into our array.

Afterwards, we can run the script with selenium.execute_script().We can then stock the returned value by js in a python variable i-e., python_list. 

This is how we can scrape infinite scrolling pages using Python.

Using a Proxy

You know that a proxy is a third-party server that acts as an intermediary between a client requesting a resource and a server providing that resource.  If you want to use proxies with Selenium and Python, you can use the following lines of code.

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s'% hostname +":"+port)
driver = webdriver.Chrome(chrome_options=chrome_options)

For handling infinite scrolling, you can use scroll-proxy that supports programmatic scrolling of the scrollable views within a view hierarchy. If you use npm, you can install scroll-proxy using the command below. We will be using js to demonstrate the use of scroll-proxy.

npm install scroll-proxy --save

After installing scroll-proxy, you can instantiate a ScrollProxy object using the below code.

var myScroll = new ScrollProxy();

You can see we did not pass any arguments to the ScrollProxy constructor because it will report actions by default when the user scrolls the page.

However, if you want to get updates when the user scrolls inside some specific HTML element, you have to pass it into the constructor.

var myDiv = document.querySelector('.scrollable');
var myDivScroll = new ScrollProxy(myDiv);

Why Use Proxies For Scraping Infinite Scrolling?

Below are some reasons to use proxies while scraping infinite scrolling.

  • A captcha can cause your page to timeout and can block your scraper. You can manually check the page to look for a captcha if you are getting frequent timeout errors. Most captchas get triggered by security measures, and you can avoid them by using rotational residential proxies along with your scraper.
  • Some sites prefer to filter out suspicious header requests based on the assumption or likelihood that the user agent can be a bot. To avoid signaling that you are a bot, you can use proxies that can change your IP address and prevent the red flags for you (user-agent). 

Conclusion

We discussed that infinite scrolling is preferred when the user isn’t looking for specific information. News websites and social media feed that constantly generate new content can benefit from infinite scrolling. On the other hand, business pages and E-commerce sites aren’t good candidates for infinite scrolling as users seek specific information. Further, we discussed the steps involved in scraping infinite scrolling pages using Selenium. We can also use rotating residential proxies for handling infinite scrolling as they help avoid captchas and filter our suspicious header requests.

Hope you got an understanding of how to scrape infinite scrolling pages using Python.