How to QA Noindex Tags in a Staging Environment

Driven by Code
Driven by Code
Published in
6 min readMay 16, 2022

By: Harrison DeSantis

Maximizing Google’s crawl and indexation efficiency is one of the most important pillars for a successful technical SEO program, especially for a large enterprise site like TrueCar. One major technical SEO concern is controlling which pages are indexed, as we don’t want to submit excessive pages to Google that are unworthy of indexation. However, the nature of staging environments makes it difficult to verify whether your indexation strategy is going to work as planned in a live environment. In this post, we’ll review some of the history around TrueCar’s indexing strategy and go in-depth into how we developed a QA process for testing meta noindex tags before promoting changes to the live site.

Photo by Maksym Kaharlytskyi on Unsplash

What’s a Meta Robots Tag?

A meta robots tag is an HTML snippet placed in the header of a URL. It controls how search engines treat that page. It may look something like this:

<meta data-rh="true" name="ROBOTS" content="INDEX, FOLLOW"/>

One of the most common uses for the meta robots tag is the “noindex.” It looks like this:

<meta data-rh="true" name="ROBOTS" content="NOINDEX, NOFOLLOW"/>

A meta robots content=”NOINDEX, NOFOLLOW” tag tells Google to neither index this page nor follow any of the links coded onto it.

What’s the Problem with Deploying Noindex Tags?

Before deploying any change to TrueCar.com, our team reviews the code in a staging environment to ensure no bugs are released. However, QA’ing meta robots bugs has been impossible because our staging site is already using the meta robots tag to tell Google, “Do not crawl or index this staging page. It’s only for testing.” Since that meta robots field is already being used to keep the staging site out of Google’s indexes, it cannot reflect what the meta robots field will read in production. Therefore, we have to wait until the change is merged, initiate a sitewide crawl with a bot emulator, compare the meta robots tags to the previous crawl, and hope that nothing incorrect was merged. It’s a risky workflow. If an indexation bug is merged, Google could catch it before our team does and remove valuable pages from its search results.

Solution: Faux Meta Robots and Botify Extract

In TrueCar’s staging environment, we include an additional header tag that looks very similar to a meta robots header, but with a slight difference. Whereas a normal header would read name=”ROBOTS” prior to the directive, we add one that reads name=”TC-PROD-ROBOTS” to reflect what the directive will be when the staging change gets merged. Because this header does not read name=”ROBOTS”, it is not a true meta robots tag. Therefore, Google does not acknowledge it as any kind of indexation instruction. It is completely inconsequential to the staging site’s indexability.

Real Meta Robots Tag: <meta data-rh="true" name="ROBOTS" content="NOINDEX, NOFOLLOW"/>Faux Meta Robots Tag: <meta data-rh="true" name="TC-PROD-ROBOTS" content="NOINDEX, FOLLOW"/>

By adding this tag into the HTML, we now have a field we can extract to see what the index/follow status of a page will be. But we still need a way to validate this en masse, as we can’t check all our URLs one by one. This is where Botify (an enterprise site crawling platform) comes into play. With some custom coding magic and one simple HTML extract rule in Botify, there is now a scalable solution to see what the staging version’s index status will be once merged.

Botify HTML Extract

Now that there’s a faux meta robots tag in the staging code, we just need to run a crawl to extract that “TC-PROD-ROBOTS” value. Here’s how to do that.

  • Gather a list of sample URLs where we expect meta robots changes to take place. Put those URLs into a txt file. Let’s call it List.txt.
  • Take those same URLs, copy them into a separate word processor file, and search/replace their domain name with the staging domain name. Ensure they resolve in a Chrome browser. Then, throw those staging URLs into that same txt file. List.txt now has two versions of each URL: a production version and a staging version.
  • Create an ad-hoc crawl project in Botify.
  • Under Settings > Crawler, upload our newly created List.txt.
  • Under the “Allowed Domains” section, input two domains: the staging and the production sites.
  • Create another HTML extract regex rule that searches for the faux meta robots field and returns the string. Name the rule something memorable (like “Faux Meta Robots”).
  • Create another HTML regex rule that searches for the real meta robots field and returns the string. Name the rule something memorable (like “Prod Meta Robots”).
  • After testing the rule out to ensure it’s retrieving the string, run the crawl!

Compare those “TC-PROD-ROBOTS” values to the “ROBOTS” values on production.

Once the Botify crawl is completed, enter Site Crawler > URL Explorer (in the Botify platform) and make a 3-column report. The only columns to select are as follows:

  • URL Card
  • Prod Meta Robots
  • Faux Meta Robots

Export this 3-column report to Excel.

To avoid making this post too Excel-heavy, we’ll just describe the desired output from this export. Get a domain-free URL slug (or “key”) to VLOOKUP against both the Prod and Faux Meta Robots columns. When this is complete, there will be a nice side-by-side comparison of tag status in both the production and staging environments. The output should be a simple 3-column report that’s very similar to the original export (but with half as many rows and a domain-free “URL Slug” instead of the full “URL Card”):

  • URL Slug
  • Prod Meta Robots
  • Faux Meta Robots

QA the Staging Noindexes

We now have a list showing the URL slug and what the meta robots value is on both the prod site and staging site. The last thing to do is add another column with a simple formula: Faux meta robots cell = Prod meta robots cell.

Example: D2 = E2

If the two meta robots values are the same, then the formula will say True, and we can expect these URLs to merge with the same index values they have now. If they read False, the values are now different, and there will be a change to these URLs’ index status. We’ll need to investigate these URLs and identify whether that change was intentional or not.

Conclusion: Faux Noindex Has Been an SEO QA Game Changer

Since implementing this solution, our team has been able to QA for meta robots changes in our staging environment and prevent any unexpected changes from going live. This has helped us avoid unexpected dips in indexable pages and the headaches that follow. For companies needing a solution around how to QA meta robots tags before they go live, the faux meta tag method has helped TrueCar ensure our indexation strategy is deployed as planned.

--

--

Driven by Code
Driven by Code

Welcome to TrueCar’s technology blog, where we write about the interesting things we‘re working on. Read, engage, and come work with us!