XML sounds like some weird technical language, and it kinda is. But it unlocks one of the most powerful Google Sheets formulas used by smart marketeers (like yourself) for scraping webpages. Scrape all headers from a site in bulk, find internal linking opportunities, and spy on your competitor’s affiliate opportunities. With the IMPORTXML formula, you can do it all.
In this post, I’ll share some of the formulas I use most often. After reading this, you’ll be able to copy and paste the powerful use cases that this formula creates.
Contents
- Import any <h> header from a page
- Import title from a page
- Import meta description from a page
- Import all image links
- Import all links from a page’s body
- Import all internal links from a page’s content
- Import all links that point to a specific domain (and find affiliate links in bulk!)
- Import all elements that contain a string or keyword
- Check if a page contains a specific script
- Extract the featured images from a list of pages
- Limitations of IMPORTXML
Import any <h> header from a page
To import all headers from a page, use this formula:
=IMPORTXML("URL", "//*[self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6]")
To import all h2 headers from a page, use this formula:
=IMPORTXML("URL", "//h2")
Important tip: Most sites will have headers in the header, footer or sidebar, which you generally want to filter out. In that case, it’s important to filter your XML statement. You can do this by targeting only the <body> element. Alternatively, you can find the name of the class that’s linked with the element you’re looking for. I build all my websites with GeneratePress websites and the main content on GP sites is always classed as “main-content”.
To import all h2 headers from a page, focusing only on the <body> element, use this formula:
=IMPORTXML("URL", "//body//h2")
To import all h2 headers from a page, focusing only on the element with the class “main-content”, use this formula:
=IMPORTXML("URL", "//*[@class='content-area']//h2")
To import both h2 and h3 headers from a page, use this formula:
=IMPORTXML("URL", "//*[self::h2 or self::h3]")
To import only the first h2 header from a page, use this formula:
=IMPORTXML("URL", "(//h2)[1]")
To import the first 3 h2 headers from a page, use this formula:
=IMPORTXML("URL", "(//h2)[position()<=3]")
Make sure to replace “URL” with any URL you want.
Pretty cool, huh?
But it gets a lot better!
Import title from a page
To import the title of a page, use this formula:
=IMPORTXML("URL", "//title")
Import meta description from a page
To import the meta description of a page, use this formula:
=IMPORTXML("URL", "//meta[@name='description']/@content")
From here, you can quickly audit your meta description lengths using the LEN function. Use the formula below to quickly get the number of characters inside a page’s meta description:
=LEN(IMPORTXML("URL", "//meta[@name='description']/@content"))
Important tip: The length of a meta description is a good indicator. Usually, the maximum number of characters ranges from 155 to 160. If it’s longer, it will get truncated. But this actually depends on the total “pixel width” of your meta description. For example, the character M requires more pixels than the character i. The pixel width is currently limited at ~900 pixels, but this changes all the time (and Google does what it wants anyway!)
Import all image links
To import all image links from a webpage, use this formula:
=IMPORTXML("URL", "//img/@src")
To import all images from within the element that’s classed “content-area” of a page, use this formula:
=IMPORTXML("URL", "//*[@class='content-area']//img/@src")
Import all links from a page’s body
To import all links from a page, use this formula:
=IMPORTXML("URL", "//a/@href")
To import all links that are within elements that contain the class “button”, use this formula:
=IMPORTXML("URL", "//*[contains(@class, 'button')]//a/@href")
Import all internal links from a page’s content
Now, this is where it gets more interesting!
To import all links that contain a specific URL in the link, use this formula:
=IMPORTXML("URL", "//a[contains(@href, 'huijer.co')]/@href")
Make sure to replace “huijer.co” with your own domain.
Import all links that point to a specific domain (and find affiliate links in bulk!)
To import all links pointing to a specific domain (like Amazon), use this formula:
=IMPORTXML("URL", "//a[contains(@href, 'amazon.com')]/@href")
This is great for analyzing a website’s content from a sitemap: How many URLs contain affiliate links? And therefore, how many “money posts” are there on the domain?
In fact, if you want to know how often a webpage links out to Amazon, you can wrap the IMPORTXML formula in a COUNTA formula. This will output the number of links that are imported:
=COUNTA(IMPORTXML("URL", "//a[contains(@href, 'amazon.com')]/@href"))
Alternatively, you can take this further by only counting the number of links to a specific domain, but only if they are within a specific element (like the element with the class “main-content”):
=IMPORTXML("URL", "//*[@class='main-content']//a[contains(@href, 'amazon.com')]/@href")
Here’s a good trick that I’ve never seen someone else use*: Want to know the ratio between money posts/non-money posts on an entire website?
Step 1. Copy the sitemap of a website. You can often find it by appending “/post-sitemap.xml” to the domain. This works for all WordPress websites that have Yoast installed (which is a lot!)
Step 2. Create 2 columns for your IMPORTXML formulas:
Step 3. Count the number of amazon links within the content area of each post with this formula: =IMPORTXML(“URL”, “//*[@class=’main-content’]//a[contains(@href, ‘amazon.com’)]/@href”)
Step 4. Count the number of all langs within the content area of each post with this formula: =IMPORTXML(“URL”, “//*[@class=’main-content’]//a/@href”)
Step 5. Divide the first column by the second column, and you’ll have a ratio of affiliate links versus all links.
Step 6. Calculate the average of all the ratios, and you’ll know how affiliate-heavy a website is!
Import all elements that contain a string or keyword
Sometimes, I want to retrieve a specific data field of a webpage, like the author. In that case, you can use IMPORTXML to look for elements that contain a string, like so:
=IMPORTXML("URL", "//*[contains(text(), 'written by')]")
Check if a page contains a specific script
You can also check <script> elements, as long as they are loaded as part of the source.
If you want to check which pages contain the Google Analytics tag, use this formula:
=IMPORTXML("URL", "//script[contains(text(), 'tagmanager')]")
Extract the featured images from a list of pages
This is another fun one that demonstrates how powerful Google Sheets can be.
The best use case here is to copy the sitemap of a website into a spreadsheet. For each link, retrieve the URL of the featured image by using this formula:
=IMPORTXML("URL", "//meta[@property='og:image']/@content")
If the website’s owner is correctly using the Open Graph protocol, this is where the featured images will be stored in the HTML.
If a website’s using Yoast, the featured image automatically turns into the Open Graph image. Perfect!
But you can take it one step further:
Wrap this IMPORTXML formula inside an IMAGE formula and you’ll get an overview like this:
The formula used here is:
=IMAGE(IMPORTXML("URL", "//meta[@property='og:image']/@content"))
Limitations of IMPORTXML
There are some things that this wonderful formula can not do:
- Google imposes undisclosed limits on the frequency and volume of IMPORTXML requests. Excessive use can lead to temporary blocks or the formula returning errors.
- IMPORTXML cannot retrieve content dynamically generated by JavaScript. It only fetches the HTML content as initially served by the server, so any modifications to the DOM (Document Object Model) made by client-side scripts after the page loads are not captured.