Really Easy Custom Extractions To Boost E-commerce SEO Performance

David Gossage
Interaction
Published in
8 min readFeb 14, 2018

Important: In this guide we will provide some examples of individual pages to crawl. Please DO NOT CRAWL ENTIRE SITES unless you have permission from the website owner.

For any technical SEO project, crawling tools are essential. They are simply the best way to gather large amounts of data about your website and identify key issues and opportunities and they make our jobs a lot easier.

Our favourite crawling tools are DeepCrawl, for the depth of analysis and intuitive interface, and Screaming Frog, for its speed and ease of use. Both of theses include custom extraction tools which can provide a vast range of awesome insights into your website. This is particularly useful for large ecommerce sites, which we will focus on here, where manual checks are not always feasible.

In this article, we want to share some of our favourite custom extractions with simple guides on how to do them with only a small amount of HTML knowledge.

Whilst this guide is aimed at beginners, we hope that there will be something that all SEOs can take away.

The Basics

Let’s start with the absolute basics by asking: “What is a custom extraction?”

Don’t worry, its nothing to do with the dentist. Instead of describing it, let’s give you a crash course in how to use it.

At the top of this page, the estimated reading time is displayed. If you want to crawl the site and sort the articles by their reading time, you can do that. In Google Chrome, right click on the reading time and click ‘Inspect’. You will see the code for the text as displayed below:

<span class=”readingTime” title=”8 min read”></span>

To run in Deepcrawl, navigate to ‘Settings > Advanced > Extraction’ within a website report. Click “New Extraction” and add ‘<span class=”readingTime” title=”(.*?)”></span> in the settings like in the example below:

The (.*?) is what makes this work; this will tell your crawler to find the text contained within.

Once the crawl is completed, go to the bottom of the sidebar in your reporting page and navigate to ‘Extraction > Custom Extraction > [your extraction name]’ to find the results.

To perform this extraction in Screaming Frog, go to ‘Configuration -> Custom -> Extraction’ as in the screenshot below:

Select ‘Regex’ from the first drop down menu and enter <span class=“readingTime” title=”(.*?)”></span> into the field like in the following screenshot.

Then crawl this page. Go on, do it!

Once complete, click on the ‘Custom’ tab and select ‘Extraction’ from the drop down menu. There you can see the the reading time for the selected page.

Simple as that! There are a number of possible uses for this tool, from finding pages on a blog written by a particular author to the destinations of certain links.

Using XPath

An alternative to using regex snippets in Screaming Frog is to copy the XPath of the HTML element. If you right click on the the text in Chrome and click ‘Inspect’ the code will appear. Right click on the appropriate HTML tag then ‘Copy -> XPath’.

Then paste into the Custom Extraction field in Screaming Frog whilst selecting ‘Xpath’ from the drop down menu:

The main advantage of using XPath is that the element you copy doesn’t have to be unique. For example, if you want to copy a <div> but it has no class, it will still work. The disadvantage is that sometimes the XPath can vary from page to page, even if it is part of a template.

Since this guide is aimed at relative beginners, the remaining examples with use mainly regex examples. However, feel free to play around with XPath as it can be a lot more powerful if done right.

Thin Categories

One big problem for ecommerce websites is keeping track of thin or empty categories. If your website has a large number of categories which contain few or no products, search engines can see this as a sign of poor user experience.

Too many of these pages can cause your search rankings and conversion rate to suffer. Using a simple custom extraction can help to monitor these pages to ensure a healthy website architecture which benefits the user.

Have a look at this example from Tesco. The page displays the number of products within the category in the top left, like in the screenshot below:

Note: This is common on many ecommerce websites. If yours does not display this, it is worth making a request with your web developers.

You can easily find out how many products are in every category by creating a custom extraction for this element.

For this example, the regex required is <div class=”filter-productCount”>(.*?)</div> which can be obtained using the method described earlier based on the code in the screenshot below:

A crawl using this extraction will help you to identify how many products are in every category, thus allowing you to audit your overall architecture and decide if any pages need to be removed. Since thin pages can negatively affect SEO performance, it may be worth removing categories containing few or no products.

Alternatively, categories containing a large number of products could represent an opportunity to expand the website, create new sub-categories and increase overall organic visibility.

Checking Product Stock Levels

It is vitally important that users can buy the products which they see to prevent them trying to find the same product elsewhere. Whilst it is perfectly normal for a website to run out of a particular product, you need to be careful that they don’t take over the site.

For this example, a stock status example is displayed in the screenshot below:

Be careful, some websites will use different HTML code depending on whether or not the product is in stock. For this example, we will need to have 2 separate extractions running at the same time using the following regex queries:

<p class=”availability out-of-stock”><span>(.*?)</span></p>

<p class=”availability in-stock”><span>(.*?)</span></p>

Using both extractions together will provide a list of all the products which are in stock or out of stock. Too many out of stock products can signal a poor quality site and can negatively affect user engagement and search rankings. If a large percentage of products are out of stock then there may be a need to remove any which are unlikely to return to the website.

Duplicate Content

One common issue on ecommerce websites is duplicate product descriptions across a number of pages. This is really common where there are a number of similar products available in different sizes, styles or colours. This should be avoided but can be hard to keep track of on larger websites.

DeepCrawl has a built in feature which can find duplicate content for you. However, if you want to get results for a specific area of the page (such as the product description) or check for close matches then a custom extraction of the specific area of text is recommended.

For this example, I have used a product page from Pottery Barn below:

We want to check the highlighted text for duplication across the site. Using Inspect Element, we know that the text is contained within the following code:

Therefore, we want to run a custom extraction for the following regex:

<div class=”accordion-tab-copy”>(.*?)</div>

Once completed, use Excel to find duplicates or use a fuzzy lookup to find close matches. If a significant number of products are using duplicate or templated content, it will be worth auditing their search performance to see if they are being hindered.

If you find a multitude of products with the same or similar descriptions, it’s likely that they would benefit from being merged into a smaller number of configurable products. These allow a user to select their size or colour from a single page instead of navigating between different product pages.

GA Implementation

If you are using Google Analytics (other tracking tools are available) then it is vital that the tracking code is present on every page. Fortunately, there is a REALLY easy piece of regex code you can use to check this:

(UA-[0–9]+-[0–9]+)

If you are using DeepCrawl, this is actually one of the preset options, which makes it even simpler.

Alternatively, the same code will also work in Screaming Frog. If there are any pages with missing tracking code then the report will show a blank row. Easy peasy.

Number of Reviews / Review score

Reviews are a great way to boost your conversion rate or your click through rate if your structured data is set up correctly. If you have a large number of reviews on a site, a custom extraction could be an effective way to find areas of the site where reviews are either low in quantity or poor in quality. Once a list of issue products has been identified, a strategy can be produced to encourage customers to leave positive reviews in more important areas.

Most websites display their review scores in a variety of different ways, so have a play with your site and see what the best way to implement your regex code is.

For this example, the review is displayed on the website as below:

In this snippet, using inspect element, we learned that the number of reviews is contained within <p class=”rating-links”>(.*?)</p>. The rating score is generated using <div class=”rating” style=”width:93%”> where 93% is the average score which is then converted into a star rating. Therefore, we can use the regex <div class=”rating” style=”width:(.*?)”> to obtain the average review score for each product.

These are just a few of ways we use custom extractions at We Influence, but there is an incredible amount you can do with them to help improve the SEO and conversion performance of your website. If you have your own ideas for custom extractions or need any help setting them up then let us know.

--

--

David Gossage
Interaction

Senior Technical SEO Executive at We Influence. Self confessed nerd and passionate about digital marketing.