AEM Gyaan Time: A Guide to Configuring Robots.txt for AEM as a Cloud Service

Veena Vikraman
4 min readSep 27, 2023

--

A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.

https://developers.google.com/search/docs/crawling-indexing/robots/intro

Recently, while setting up robots.txt on AEM as a Cloud Service, I encountered several issues that required extensive research and Googling to find appropriate solutions. Although the solutions were available, they were scattered and not adequately documented in a single, easily accessible location.

If you’re looking to add a robots.txt file to AEM as a Cloud Service, you can follow these steps. There are two common approaches:

Option 1: Adding Robots.txt to DAM

You can upload the robots.txt file directly to the DAM and enable access to it through dispatcher rules. Here’s how to do it:

  1. Create your robots.txt file and upload it to DAM. Let’s assume you’re uploading it to /content/dam/xyzproject/robots.txt.
  2. Check your default_filters configuration located at dispatcher/src/conf.dispatcher.d/filters/default_filters.any. You'll notice that by default, the txt extension is not allowed access from DAM.
# This rule allows content to be accessed
/0010 { /type "allow" /extension '(css|eot|gif|ico|jpeg|jpg|js|gif|pdf|png|svg|swf|ttf|woff|woff2|html|mp4|mov|m4v)' /path "/content/*" } # disable this rule to allow mapped content only

Important: Do not make any changes to default_filters.any. Instead, create a new filter.any file in the same location for all the filters related to your projects. Include default_filters.any as the first line in your new filter file. In most project created using latest archetypes, you'll find the filters.any file provided by default.

To allow the filters for txt to be accessed from DAM, you can add a rule similar to the following:

# This rule allows txt content to be accessed
/0111 { /type "allow" /extension "txt" /path "/content/dam/xyzproject/*" }

Once the above is done, your robots.txt file will be accessible via your full URL, something like https://xyz.com/content/dam/xyzproject/robots.txt.

3. The next step is to enable access via https://xyz.com/robots.txt. To achieve this, you need to add a rewrite rule. You can add something similar to the following in your custom rewrite rule file located at dispatcher/src/conf.d/rewrites/rewrite.rules:

#####################################
# rewrite for robots.txt
#####################################
RewriteRule ^/robots.txt$ /content/dam/xyzproject/robots.txt [PT,L]

Explanation of the Rewrite Rule:

RewriteRule: This is a directive for Apache's mod_rewrite module, which allows you to perform URL rewriting.

^/robots.txt$: This is a regular expression pattern that matches requests for /robots.txt. The ^ and $ symbols denote the start and end of the URL, ensuring an exact match.

/content/dam/xyzproject/robots.txt: This is the internal path to your robots.txt file in DAM, which is the target of the rewrite.

[PT,L]: These are flags used with the rewrite rule:

PT (Pass Through): This flag tells Apache to pass the rewritten URL to the next phase of processing.

L (Last): This flag tells Apache to stop processing further rules if this one is applied. It prevents additional rules from being executed after a successful rewrite.

Now when you access https://xyz.com/robots.txt, your dispatcher should pass through (without exactly redirecting, not changing the URL) to the DAM path and render the robots.txt.

There is another issue that can happen when you try to access the URL. Your robots.txt might get downloaded instead of opening in a new window. This occurs because your txt files normally have Content-Disposition set as an attachment. To resolve this, you may need to adjust the Content-Disposition header in your dispatcher configuration or on your web server to specify that it should be displayed inline rather than as an attachment.

To resolve this issue, you can add a rule to the vhost file located at dispatcher/src/conf.d/available_vhosts, as shown below

<LocationMatch "^/content/.*/robots\.txt$">
Header unset "Content-Disposition"
Header set Content-Disposition inline
</LocationMatch>

This rule, when added to the vhost file, will unset the existing Content-Disposition header and set it to inline, ensuring that the robots.txt file is displayed in the browser rather than being downloaded.

In conclusion, adding a robots.txt file to AEM as a Cloud Service can be a straightforward process when you follow the right steps. By uploading the file to the DAM and configuring dispatcher rules, you can ensure that your robots.txt is accessible from the desired URLs.

Option 2 : Maintaining robots.txt as a page.

You can choose to maintain the robots.txt file like a regular page. For detailed instructions on how to implement this approach, you can refer to Albin Issac’s blog post on https://www.albinsblog.com/2020/04/how-to-configure-robotxtxt-file-in-aem.html. This blog provides clear and comprehensive steps to guide you through the process.

Thank you for reading, and best of luck with your AEM project! If you have any questions or need further assistance, please don’t hesitate to reach out.

--

--

Veena Vikraman

Veena is a software engineer with 11+ years of Adobe Experience Manager expertise, specializing in AEM development and loves to write blogs. Enjoys traveling.