Do you know which is the correct selector to use in web scraping? Web scraping has been quite popular in the recent decade to extract data from the internet. It helps businesses acquire and analyze data to make better business decisions. Thanks to automated technologies, web scraping has never been easier than it is now.
However, regardless of the tool or framework you select, you must make a crucial decision to ensure that your scraper scrapes the data politely. That is whether to extract web elements using XPath or CSS selectors, which you”ll learn in this article.
Let’s dive in with some existing examples.
XPath stands for XML Path Language. However, it uses non-XML syntax to select tags or groups of tags from an XML document or HTML, as with web scraping. XPath enables you to write expressions to access an HTML or XML element directly without traversing the entire HTML tree.
To understand how you could access an element using the XPath, let’s dig deeper with an HTML code. I assume that you already know some basic HTML.
<!doctype html> <html xmlns=”http://www.w3.org/1999/xhtml” lang="en" xml:lang="en"> <head> <meta charset="utf-8"> <title>Awesome Products at your Fingertips</title> </head> <body> <h1>Description of product features</h1> <p>These products are great. So let's just look at the features !</p> <ul id="product-list" class=”basic-list”> <li>Item 1</li> <li>Item 2</li> <li>Item 3</li> </ul> </body> </html>
You can type the above code in an editor of your choice and save it as products.html. Then you may view it in a browser (preferably Google Chrome as we’ll walk through this example with it).
When the browser runs this code, it phrases the HTML and creates a tree representation of the elements. It is known as the DOM (Document Object Model) in the following form:
You can read more about the DOM at the given link. Now our focus here is on the XPath on how to navigate to each of these elements straight away without traversing the entire tree. So let’s begin with the basic terminology of the Xpath.
With XPath, the most fundamental element is a node. Well, nodes are simply the individual elements you just saw in the DOM tree. As we go along in this article, you will even further discover that nodes consist of tag elements, attributes, strings values assigned to it, and so on. There are seven in each XML or HTML page, and let’s look at each node type closely.
While the above three are the most significant ones, it is also vital to know the following four just for information’s sake.
There are two ways of doing this. First, Let’s demonstrate it or code an example. As I have mentioned above, I hope you have saved it on your local disk and are ready to be viewed in your browser.
When the page loads, hover your mouse over item 1 and right-click on it. Then select the Inspect from the menu items that appear as shown in the screenshot below:
Then you would be able to find the full XPath by clicking on the <li> element in the console and selecting “copy” from the drop-down menu, and then specifying “Copy full XPath as shown below:
Then after pasting it in a text file or somewhere, you would get:
This is known as the absolute path. I will explain below how you have derived it.
Step 1 => li //Here one denotes the first li element Step 2 => /li Step 3 => ul/li Step 4 => /ul/li Step 5 => body/ul/li Step 6 => /body/ul/li Step 7 => html/body/ul/li Step 8 => /html/body/ul/li
With this method, you need to work your way backward, starting from the target element all the way back to the root element. You add a forward slash before the element you have just added as you write each element. So let’s look at how you could work out the XPath for the first <li> element manually:
Although the above method seems to be lengthy, it will help you understand how to build the full XPath. Now over to the relative method.
As you can see, it is pretty short, and the path is relative to the parent <ul> element. Since the <li> element does not have an id attribute, its relative path is relative to the <ul> element.
The significant differences are that full XPath is not readable and hard to maintain. The other obvious concern is that if there are changes to any element starting from the root element, the absolute XPath will not be valid. So it makes sense to use the relative XPath.
However, before commenting further on that, let’s first look at the advantages and disadvantages.
With XPath, you don’t have to worry if you don’t know the name of an element because you can utilize the contains function to look for probable matches. Therefore, you can go up the DOM when querying for items to scrape.
The other significant benefit of CSS is that it works in older legacy browsers like outdated versions of Internet Explorer.
As you have learned above, its most significant disadvantage is easier to break when you change the elements in the path. It can be hard to comprehend compared to the CSS selectors that you will find out below.
Also, when retrieving elements from the XPath, its performance is pretty much slower than that of CSS.
As you already know, CSS stands for Cascading Style Sheets prominently used in styling HTML elements in a web page. These styles include applying colors to your font, background images, and colors, aligning and positioning elements, and increasing/decreasing spaces between paragraphs.
To set a style to an HTML element, you need to specify it through a CSS Selector. Let’s start with a simple example starting with the markup in the next section.
<h1 id="main-heading" class="header-styles" name="h1name">What are CSS Selectors?</h1>
So here is the CSS selector for the above element:
You could combine them as well:
h1.header-styles-This CSS selector selects h1 elements with class header-styles.
The > operator is used to select children. In contrast, the + operator chooses the first sibling, and the operator ~ is used to pick all siblings. A few instances are as follows:
Unlike XPath, which the Beautiful Soup does not support, CSS selectors are supported with the most effective scraping libraries. Also, unlike XPath, CSS selectors are easier to learn and maintain. Almost all the browsers support it, except legacy browsers below Internet Explorer version 8. However, people rarely use those browsers in this day and age.
Even though you take the older versions of Internet Explorer out of the equation, there could still be inconsistencies with how they render on different browsers.
Since there are various CSS versions, they could create confusion among developers and beginners alike.
Another vital factor in today’s web technology is the CSS’s security.
The apparent difference between XPath and CSS is that XPath is bidirectional. This means you can traverse in both directions in the DOM tree. However, you can only traverse from the parent node to child nodes with CSS, known as one-directional flow.
As discussed in previous sections, XPath is harder to maintain and not a good candidate for effective readability. On the other hand, although XPath may operate in legacy browsers, the way it renders differs from one browser to another.
Therefore in that regard, the CSS has the edge.
XPATH stands out as CSS could only traverse from parents to the child in specific areas such as traversing up the DOM tree. AS far as the speed is concerned, CSS has the edge.
However, the speed difference between XPath and CSS doesn’t count much when it comes to web scraping. There are other factors to consider, such as network latency in web scrapping.
CSS would be your first choice when it comes to Beautiful Soup as it does not support the XPath.
There is no precise answer on which selectors to use for your web scraping project. As you have discovered in this article, the XPath has the edge in certain situations, whereas the CSS stands out in others.
Therefore, you need to factor in specific vital points such as traversing, browser support, and some of the technical capabilities that we discussed. We hope you will practice what you have learned and stay tuned for more articles.