XPath is an incredibly powerful language for finding, navigating and selecting data in XML style documents. With HTML being a derivative of XML, it means that most of the content on the web is selectable, and in turn, scrapeable with the help of XPath. That's pretty damn cool.
Since XPath is just a way of selecting parts of a document (or in our case, the DOM) it needs to be combined with some kind of tool in order to leverage it. As you might've guessed, SeoTools does this through the XPathOnUrl implementation.
It's also available as a column in the SeoTools Spider crawler.
Why you should scrape
There's a plethora of reasons you might want to scrape a website, but here's a couple of use cases to get you started.
Scraping can be a godsend when it comes to gathering data to make important SEO decisions. Sure, you can do it by hand, but do you really want to manually compare the social media penetration of 25 different blog posts? Trust me, you don't.
By comparing organic search results you can quickly find your competitors for certain search terms, grab their URLs and analyse their pages to see what keywords, titles and descriptions they're targeting.
Note: Google search result pages use Ajax, which doesn't play nice with XPath, so in order to perform the above example you first need to grab a HTML snapshot of the results page in question.
Find out what your competitors are doing by targeting specific parts of the DOM like meta tags, og tags or title tags. You can do this through SeoTool's XPathOnUrl function, or use one of the higher level functions like HtmlMetaDescription.
Websites > APIs
Websites are more important than APIs, for a couple of reasons. For starters, website owners usually care more about their visitor facing information than about their structured data feeds, and you can't really blame them. Sometimes they outright limit it for superficial reasons, take Twitter for instance - that on the other hand, you can blame them for.
No rate limits
Some APIs have strict rate limits, which can be frustrating to say at the least. This issue is completely negated by scraping a website instead.
The data is already there
The data already there, ripe for the picking. No need to apply for API keys, wait for data implementations in existing APIs or figure out that really weird authentication pattern that makes no sense.
So how do you actually use this powerful language? There are already great introductions to XPath on the web, like this one.
What they might not mention however, is how useful the Chrome DevTool can be in helping you create XPath selectors. I'm sure you've used it in some capacity before, but if you haven't, you bring it up by pressing F12 or right clicking anywhere on the page. It should open to the element panel by default.
What you see here is the DOM, the Document Object Model. It's a document structure defined by HTML tags. Now find the tags encompassing the content you'd like to scrape and right click to bring up the context menu and select "Copy Xpath".
It'll copy something along these lines to your clipboard:
Easy, huh? You can paste that right in as a paramter in SEOTools.
But there's more!
If you prefer writing your own XPath selectors, you can use the Chrome DevTool to validate these by bringing up the elements panel, pressing CTRL + F and pasting in your XPath. If it's valid, it'll highlight the element in the DOM.
Another handy feature of DevTools is $x(). If you pass an XPath selector, it'll return a list of the corresponding results. If there were zero matches, it'll return an empty list. If you didn't pay attention while reading this article and wrote invalid syntax, it'll throw an error to get you back on track.
$x() also works with CSS selectors.
Constructing your own XPath selector
In the above example you can see that we're targeting an element that's nested inside of several others. Let's dissect what's actually going on in the above query.
- Select the 4th div that's a child of body, which in turn is a child of html.
- Select the second section element that's a child of main.
- Traverse through three div elements and return the value of the first h2 tag.
Let's say you'd like to fetch the titles of a blog you really like. That might look something like this:
This will select the h2 tags of the 10 first article elements and return their value.
Dealing with arrays
The XPathOnUrl function in SeoTools will return an array if your query results in more than one string. That's where Dump() comes in - it makes dealing with arrays in Excel a breeze. You simply pass your XPathOnUrl formula as a parameter to Dump() and it takes care of the rest.
Where to go from here
We've barely scraped the surface of what XPath can do, and there's loads of great articles on the internet about how you can apply it to your SEO game. For a more in-depth guide on how to use XPath with SeoTools, take a look at the documentation.