Technical SEO: an experiment to optimize Crawl Budget in big ecommerce sites

Adriano Rodrigues
b2w engineering -en
7 min readApr 8, 2020

Every day, search engines robots’ crawl the web searching for new websites, they discover new content, while collecting all this information into a huge index or “library”.

To access all the content on the web, these robots “click” on every link they find, and in this way, they move around the web.

The search for the best and most up-to-date content is a continuous job that requires a lot of resources.

New information appears all the time on the internet and robots need to revisit sites frequently, in order to ensure that content is always up to date. This way, it will reflect the best response to the user’s search.

To understand how search engines decide on how and when these robots visit your site, it is necessary to talk about “Crawl Budget”.

1. What is Crawl Budget?

Crawl Budget is the number of requests a search engine makes to one domain in a set period. Most sites don’t need to worry about Crawl Budget, once it will not be an issue unless the website has an enormous number of pages.

As Gary Illyes explained in the Google Webmaster Central Blog:

“If a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently.”

In websites with millions of URLs, crawling can be a challenge. In this case, control of when and how many accesses are allowed to the robots is very relevant, as excessive access can affect server performance or hinder user experience.

Gary Illyes, in his article for the Official Google blog, states that the Crawl Budget is defined by two concepts: Craw rate limit and Crawl demand.

Crawl rate limit refers to the number of parallel connections that robots make in order to crawl your site and the waiting time between calls. This limit may vary depending on the site's health or performance. For example, if a site responds quickly, this limit may increase automatically.

However, if a site is slow or has errors, the crawl rate may be limited. One way to modify the crawl rate limit is by configuring the crawl rate in Google Search Console.

However, increasing the crawl rate limit in Google Search Console does not guarantee that the Crawl Budget will increase.

Crawl demand is a measure of the importance of your site to Google and comprises two factors: popularity and recency. A popular URL tends to be crawled more often, to ensure that the content is up to date. Changes to site content may also cause an increase in crawl demand.

Google’s definition:

“Taking crawl rate and crawl demand together we define crawl budget as the number of URLs Googlebot can and wants to crawl.”

2. Crawl Budget Optimization

Several factors can negatively affect the robot’s behavior, including loading speed, 4XX access errors, duplicate content, excessive usage of parameters and filters.

Adding filters or parameters on very large websites, there is a possibility of having duplicate content, such as in product pages, for example.

With ecommerce websites, product offerings can run into the millions of items, generating an almost infinite number of URLs due to the various combinations of filters and parameters.

Also, it is important to understand which pages and how many times the robot is visiting.

For example, if a website has 1,000 pages and the robot finds 1,000 URLs every day, it could be that the crawl budget is perfectly optimized, or it could mean that Google is hitting the same page 1,000 times.

Another important factor is the server capacity. There is no point in receiving thousands of visits from the robot each day if this hinders the site’s UX.

One of the ways to control and optimize the Crawl Budget is to create a txt file with guidelines for crawling the website. Robots.txt files are created at the root domain level, for example, www.yourdomain.com.br/robots.txt.

The robots.txt file defines the instructions that robots follow when navigating the website, such as, which pages they should access. A well-structured robots.txt file is crucial for helping robots find the right content.

Example of a snippet from Google’s robots.txt, that can be found at https://www.google.com/robots.txt:

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
Disallow: /?hl=*&*&gws_rd=ssl

3. Case: Black Friday in B2W

As stated, “demand” is a relevant aspect of Crawl Budget. If Google understands a website as important for users, it will increase the frequency of robots visits.

On important dates such as Black Friday, search engines aim to provide all the best and latest content on the web and to achieve this the crawl demand increases.

Black Friday is one of the most important shopping events in the world, featuring some of the biggest promotions of the shopping year. Many customers wait for this date to buy higher value items or to start Christmas shopping.

During this period, B2W’s e-commerce websites (Americanas.com, Shoptime, Submarino and SouBarato) have an increase inthe number of visitors and, simultaneously, web crawlers also want to access the pages more frequently in order to keep content updated.

Analyzing the log files from Black Friday 2018, it was noticed a peak in the number of accesses from Googlebot, Google’s robot that collects site information.

For 2019, the goal was to reduce the volume of crawler requests in the days preceding and over the Black Friday weekend. This experiment aimed to preserve the website performance and provide optimal shopping experience for the visitors.

3.1. The experiment

Before Black Friday it was applied a restriction in the robots.txt file so that certain pages could not be found by search engines. However, the policy was not too restrictive and it still allowed a large number of pages with filters to be found.

In 2019 there was an increase in bots access before Black Friday, starting on November 23rd. Between November 22nd and 25th, there was an 840% increase in the number of Googlebot accesses.

To reduce the growing number of non-human accesses and to preserve the user experience on the websites, it was planned to restrict crawler access via robots.txt.

In parallel, the crawl rate was reduced in Google Search Console. If the bot access was successfully contained, the robots.txt rules should be modified again, in order to gradually give the crawlers more freedom to navigate.

In the first stage of the experiment, only the top-level pages of each domain were allowed and access to any pages with a filter was blocked.

The rule was applied on Wednesday, November 27th, at 18:50 (UTC-3) and within 30 minutes we saw a drastic reduction in the number of crawler accesses.

About an hour after the change, the total volume of hourly accesses was reduced by 95%, dropping bots accesses from 400k req/hour to around 20k req/hour. The access amounts continued to reduce and on Thursday, 28th, it returned to a normal level as experienced before the peak.

The image below shows the number of accesses of an API during the experiment period. After the new robots.txt policy was implemented, the large reduction in the number of robots’ accesses was seen.

Crawlers’ accesses to americanas.com.br a day before Black Friday

B2W’s tech ecosystem has around one thousand APIs, therefore controlling output is crucial to preserve user experience during high traffic moments. The goal was to reduce the output and to maintain on-prem infrastructure to process crawler requests.

The image below shows the throughput of an API that it’s directly connected to crawlers’ accesses during the experiment.

Throughput of an API directly connected to crawlers’ accesses

After Black Friday, on December 2nd, at 15:30, it was applied a new rule in the robots.txt file to allow crawlers to navigate all domains, including specific combinations of filters.

In the days that followed, it was experienced an increase in the number of requests, but it was near to the normal level.

3.2. Possible side effects

An important point was to assess and understand other aspects and KPIs that may have suffered side effects.

To understand other variables when changing the robots.txt file, the following metrics were evaluated: number of indexed pages, weighted average position, number of pages and number of keywords in the top 10 positions.

None of these metrics showed significant variation or indicated any turbulence during the period of the experiment, validating the effectiveness of the change in the Robots.txt file.

Final considerations

For any e-commerce website, it is essential to experiment in order to evolve and guarantee the best user experience and business growth.

On large sites, with hundreds of millions of pages, it is necessary to closely watch the volume of accesses received, to ensure the smooth functioning of the servers.

This experiment brought positive results and it was possible to limit robot access, therefore, ensuring the best user experience during the Black Friday event!

I hope you enjoyed this article and that it can help you and your business. I would love to hear your feedback, feel free to send me a message through Linkedin!

I would like to thank Richard Fenning for the co-authorship of the translation of this article; to Pedro Gil Alcantara for his partnership for the creation of the above Robots experiment and to my colleagues from the SEO Team here in B2W for reading and reviewing this article, and especially to Tiago Andrade for his support and suggestions. Thank you!!

Anyone who is interested in this topic or wants to continue the discussion or simply to contribute to the community, we have two LinkedIn groups: SEO in RJ (Rio de Janeiro) an SEO in SP (São Paulo).

If you are looking for a development opportunity, working with innovation in a high-impact business, visit the B2W Careers portal! In it, you can access all available positions. Come join our team!

--

--

Adriano Rodrigues
b2w engineering -en

SEO and Data analyst at B2W, MSc in mechanical engineering, passionate about using technology to leverage business performance