Anomalies in measuring site indexation using Google search console

Pady Nair
3 min readJun 8, 2017

If you have worked on search, you care about indexation. Indexation of your pages is one of the key metrics, used to measure the health of a site. What is indexation? The page you created on your site and submitted to Google using your sitemap file, first gets crawled by google and then gets added to the Google search index. Not all pages in the sitemaps get crawled and not all pages crawled by Google gets indexed. Crawling and indexing are two fundamental concepts on how a search engine works. Remember a page needs to be in Google index before Google can show it on search results and users can access your pages.

Measuring indexation

How do you determine if your site is indexed properly?

There are a few different ways to measure indexation of your site. There is a “Google index” menu item in Google search console (GSC) that shows the overall status of the pages indexed for the site.This data is trended over an year and is useful to see any major changes in the number of pages indexed.

The other option is to look at the Crawl->Sitemaps section in GSC. This section shows the number of indexed pages from your sitemaps.

You can also use third party tools such as ScreamingFrog depending on the scale of your site. For a large site, accurately determining all indexed pages can be tricky.

Display anomaly in Google search console

For one of our local language sites we found that number of pages indexed from sitemaps is very low compared to the number of pages submitted.

We did a some investigation to understand what is going on:

  • The overall index status was showing stable trend over the last year with expected changes
  • Looked into crawl behaviour for the site using server logs and found that this is in line with other similar sized sites
  • Checked for common issues such as duplicate pages, incorrect canonical URLs, redirect chains etc to validate that the site was functioning as expected.

The key difference for this site compared to other language sites was that the URLs were localised using non Latin character set. The sitemaps had encoded URLs as recommended by Google with ASCII characters such as this “B4%E0%B8%95”

Google didn’t have any problems in showing the proper localised URLs in search results . We also did not see any issues in Google bot crawling the URLs. So the hypothesis that this must be a an issue with how Google search console displays the sitemaps indexation data. To validate this hypothesis, all the site URLs for this site was transliterated to Latin character set. All the non latin URLs were 301 redirected to the latin equivalent URLs. Within couple of weeks, we could see a big difference in the sitemaps indexation data as show below. There is still a long way to go here as Google is refactoring URLs but this clearly shows that the fundamental issue was with the way GSC handled the sitemap URLs and not with actual site indexation. The glitch in handling localised URLs in GSC resulted in showing a low number of indexed pages where in fact most of our pages are already in the index.

In conclusion

Use GSC to your heart’s content as it is the most useful data you can get from Google for organic search but watch out for data anomalies and issues such as the one outlined above. You would want to validate any hypothesis with multiple sources of data such as server logs to understand the crawl behaviour, web analytics(Omniture, Google analytics etc) to look at the landing page performance etc before arriving at a conclusion.

--

--