AEM — SEO | robots.txt | sitemap.xml |Troubleshoot URL Indexing | Block Lower Environment URLs from Indexing through Server Configuration
In this article, I’ll be talking about the approach that you can take to make sure that your site works well with Search Engines. As we know “There are many ways of solving the problem.” — I’ll restrict myself to describe a little of common practices for Search Engine Optimization (SEO).
1. Server Configuration — To ensure that only the correct content is being crawled.
2.1. Blocking our lower environments (dev, stage…) from getting indexed by search engines.
2.2. Block Search indexing with
<meta>
tag.3. And the Last, Troubleshooting Existing AEM sites through Google Search Console — Remove unwanted or lower environments URL(s) that are indexed already.
Before we jump into Server Configuration, let’s understand how Search engines works and how exactly your sites get indexed and served to your customers/visitors. Search Engines have two main jobs:
A). Crawling the web to discover content and
B). Indexing that content so that it can be served up to searchers who are looking for information.
To crawl sites, search engines follow links to get from one site to another — ultimately, crawling across many billions of links and websites. Now let’s assume you’ve your new site going live and the time that it’ll take to crawl and index this site. So, How to get my page indexed by Search Engine faster?? Unfortunately there is no straight answer to this question. Look at what Google has to say about this:
“Crawling can take anywhere from a few days to a few weeks. Be patient and monitor progress using either the Index Status report or the URL Inspection tool. Requesting a crawl does not guarantee that inclusion in search results will happen instantly or even at all.”
Yes! “There are many ways of solving the problem.” There are some recommendations for straightening out that labyrinth of URLs and helping crawlers find more of your content faster. One of them is Optimized Server Configurations(sitemap & robots.txt)
1. Server Configuration — To ensure that only the correct content is being crawled.
- To make it easier for search engines to crawl your content, implement an sitemap(xml). Make sure to include a mobile sitemap for mobile and/or responsive sites.
- Use a
robots.txt
file to block crawling of any content that should not be indexed. Block all crawling on test environments. - When launching a new site with updated URLs, implement 301 redirects to ensure that your existing SEO ranking is not lost.
- Include a favicon for your site.
Let’s talk about robots.txt understanding this will also help you understand the importance of sitemap xml. A robot. txt file gives you more control over what and more importantly what not to be crawled(such as pdf or any other assets URL(s)). But How? Here is a sample robots.txt that fedex.com uses: (https://www.fedex.com/robots.txt)
The /robots.txt file is a publicly available: just add /robots.txt to the end of any root domain to see that website’s directives (if that site has a robots.txt file!). This means that anyone can see what pages you do or don’t want to be crawled, so don’t use them to hide private user information.
User-agent: *
Allow: /
Allow: /de/careers/?*
Allow: /?location=home
Allow: /Tracking?cntry_code=us
Allow: /apps/fedextrack/?action=track
Allow: /en-us/home.html?location=home
Allow: /locate/index.html?locale=en_US
Allow: /en-us/tracking.html?action=track
Allow: /global/choose-location.html?location=home
Allow: /ratefinder/home?cc=US&language=en&locId=express
Allow: /apps/fedextrack/?action=track&cntry_code=us&freight=yes
Allow: /fedextrack/?cntry_code=us&tab=1&tracknums=&clienttype=wtrk
Disallow: /*?*
Disallow: /libs/
Disallow: /en-us/quick-help/*
Disallow: /en-us/GDCMTestSites/*
Disallow: /us/developer/WebHelp/*
Disallow: /content/dam/fedex-com/hdn/
Sitemap: https://www.fedex.com/en-us/sitemap.xml
Sitemap: https://www.fedex.com/global/sitemap_index.xml
This file clearly allows all Search Engine(bots) to crawl the sites with certain rules using Allow: and Disallow: statements. You’ll also find sitemaps associated with this domain at the bottom (Sitemap: path/sitemap.xml). Validate your robots.txt here:
Wait a minute? Where does robots.txt go on a AEM site?
In order to ensure your robots.txt file is found, Make sure to upload this file in a whitelisted folder of DAM(where you’ll manage robots.txt for multiple domains) and write dispatcher Rewrite Rule for robots.txt so that user agent finds this file here www.example.com/robots.txt Otherwise it would not be discovered and site would be treated as if it had no robots file at all.
RewriteRule ^/robots.txt$ /content/dam/yourproject/global/robots/exampledotcom/robots.txt [PT, L]
Now, let’s talk about Sitemap XML. How does it help?
Sitemaps are a good way to indicate which content Google should crawl, as opposed to which content it can or cannot crawl. Here is the sample sitemap.xml
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<div id="in-page-channel-node-id" data-channel-name="in_page_channel_yGAGtp"/>
<sitemap>
<loc>https://www.fedex.com/ar-ae/sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.fedex.com/ar-sa/sitemap.xml</loc>
</sitemap>
.
.
<sitemap></sitemap>
.
.
</sitemapindex>
-------------------------------------------------------------------------------------------------------------------------------
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<div id="in-page-channel-node-id" data-channel-name="in_page_channel_uaqPsC"/>
<url>
<loc>https://ex1.host.com/content/yourproject/levelone/leveltwo/home.html</loc>
<lastmod>2022-12-24</lastmod>
</url>
<url>
<loc>https://ex1.hostname.com/content/yourproject/levelone/leveltwo/terms-of-use.html</loc>
<lastmod>2023-01-03</lastmod>
</url>
There are many ways of generating sitemap.xml such as
Now you know having these two files will boost your websites SEO. Let’s play with it and resolve below mentioned scenarios :
1. Your website is live and you want to have control access to your site on a page-by-page basis. For example, some of your published page(s) are indexed and you want them to disappear from search result AND you don’t have root access to your server.
2. How do you prevent lower environments from getting indexed by search engines?
3. What if your lower environments URL(s) are indexed already by Google? How Do I remove and Block them from getting indexed.
Let’s talk about it…
Scenario 1 — Very simple and straight way of fixing this is to implement noindex meta tag. To have control access to your site on a page-by-page basis —Implementing noindex <meta> tag is helpful. To prevent all search engines that support the noindex rule from indexing a page on your site, place the following <meta> tag into the <head> section of your page.
<meta name="robots" content="noindex">
<meta> tag implementation has it’s own limitation — you can use this to hide one or more page(s) in production but when you use this in lower environments(dev, stage..) it’ll make things complex.
Scenario 2: To prevent lower environments from getting indexed, it’s better to control/block it using separate robots.txt (one for dev publish and other for stage publish and so on). Add this to your lower environments robots.txt and publish it. So whenever crawlers visit these site to index the page robots.txt would instruct not to do it for this instance. (Disallow crawling).
User-agent: *
Disallow: /
Scenario 3: What if these URLs are indexed already and you want to remove them from search result. USE GOOGLE SEARCH CONSOLE (crate an account if you don’t have already) and verify the ownership of domain (use the multiple ways offered to get the ownership of your site). Once added the property (your site) you can submit URL removal request: When the request is submitted, It will bring down all the Indexed Stage or Dev URLs from Google. (May not be immediate). Here is the snapshot to request:
Yay! You have successfully cleared the indexed URL but make sure to add robots.txt for lower environments(scenario 2) —So that It won’t get indexed in future.
3. Troubleshooting Existing AEM sites through Google Search Console
3.1 URL Inspection: Indexed/Not Indexed
Check if noindex <meta> tag is already added in page. You can Exclude:
3.2 Submit new sitemap.xml
You can request re-indexing with the help of updated sitemap.xml path.
source:
Thanks for your time! Hope this was helpful. Do give it a like.
Regards,
Nishant Tiwary | AEM Developer
https://www.linkedin.com/nishant-tiwary