Analyzing Server Logs for SEO

Benjamin Burkholder
Jun 30, 2018 · 10 min read

This article will teach you how to leverage your site’s server logs as an SEO tool to get a better sense of how search engine bots interact with your site. For this article, we’re only going to be concerning ourselves with Google’s Googlebot and Googlebot Smartphone bots. So from this point on, whenever I refer to bots, I’m referring to Google in particular.

Also important to note, server logs generally contain other interaction data as well but we’re only concerning ourselves with search engine bot traffic here.

What are Server Log Files?

In a nutshell, server log files are the records automatically generated by a server that lists out each interaction a search engine bot has with pages/resources housed on it. Put simply, each time a bot requests a page housed on the server, a log item is created displaying a variety of data points such as:

*Warning: Jargon Zone*

  • What page or resource was requested?
  • What type of resource was requested? (HTML, JavaScript etc.)
  • What was the method used? (GET or POST)
  • When did this request occur?
  • What was the response code of the request? (200, 301, 404, 500 etc.)
  • And the data point that’s most important for us…*drumroll*… which search engine bot made the request and how many were made!! (Googlebot, Googlebot Smartphone, Bingbot, Baidu etc.)

Excited yet? No? Well don’t worry, we’re getting to the good stuff now.

Why does Bot Traffic Matter?

Being able to determine, at a high-level glance, which pages are being requested by bots over a period of time gives us the ability to identify potential “problem” pages.

What’s a “problem” page?

I define these as pages which have little to no bot requests over a particular time span, be it over one month or three. For the sake of measurement, let’s say any page with ten or fewer bot requests over the course of a month is a potential “problem” page. Since this is a sliding scale, feel free to start at this number and continue to go up the list depending on the size of your website (e.g., less than 10, less than 30, less than 50 etc.)

What are some Reasons for Low Bot Traffic?

Here are a few possible scenarios which can attribute to low bot traffic to a page:

  • Pages living deep within website navigation, resulting in bots taking longer to reach them on average.
  • Content is thin or duplicative across multiple pages, resulting in bots placing less urgency or importance on requesting your pages.
  • Pages are not included in XML sitemap, resulting in bots relying solely on natural crawl progression to find them.
  • Lack of internal/external links pointing to pages, resulting in far fewer avenues for bots to find them.
  • Technical issues, such as an improperly formatted robots.txt file that blocks bots from crawling certain areas on a site.

So how do bots behave in the scenarios mentioned above? It’s a lot like this…

What is the Impact of Low Bot Traffic?

Think of it this way, if pages are seldom being requested by bots then it could indicate:

  • The pages aren’t being picked up and indexed at all.
  • If page content is updated on indexed pages, this refreshed content could take significantly longer to be picked up and indexed. Meaning visitors to your site could be seeing outdated content.

With that being said, it’s important to note that a page having low bot traffic is NOT always a cause for concern. This is merely a barometer in order to help identify potential problem pages, it’s not a guarantee that there is actually anything wrong. It’s up to you to determine if it will impact your site or not.

How Does Site Size Affect Bot Crawling?

Now that we’ve covered some of the possible reasons for low bot traffic to a page and the potential impact, it’s time to quickly talk about how website size frames all of this.

Before beginning to analyze the log files for your site, it’s important to understand where your site falls in terms of size. The reason? Knowing this can set your expectations and know when something doesn’t seem quite right when auditing.

Using the graph below as a very high level baseline, you can place your website into a category based on the amount of pages it is comprised of. Remember to think of this strictly as a high level categorization. Many website sizes are relative to the verticals they operate in, so keep this in mind.

Now that you have a general idea of what the size of your site is, you can start setting expectations of what to potentially look for when digging into your log files.

What Should I Expect Depending on my Website’s Size?

  • Large — With a large website containing 1000+ pages, bots could have an issue consistently reaching deep pages via natural crawl progression. These are probably the pages you’ll be auditing with the log files to determine importance and prioritization.
  • Medium & Small — With both medium and small websites, containing roughly less than 500 pages, bots should be able to crawl most of the pages with little issue. In this scenario, you’ll probably be auditing the log files to find pages that logically should be crawled more often but for some reason are not.

Now, I’ll walk you through the process of actually auditing and identifying these potential “problem” pages.

Getting Started

Before you can start auditing the bot traffic on your website, you will first need two things:

1. Access to your website’s server logs.

a. Start by asking your IT team, they should be able to either provide access or point you in the right direction.

  • If your company doesn’t have direct access or manage the servers your domain’s hosted on, it could prove difficult to gain access to your server log files. Best advice in this scenario is to directly ask the hosting platform to see if they can provide you access to just your files, or suggest an alternative solution.

2. A tool to parse the server logs and provide the meaningful data at a high level.

a. I’ll be using Screaming Frog’s Log Analyzer tool for this particular walk-through as it was built with SEO in mind, however there other suitable log analyzer tools available as well.

Starting the Audit

Once you’ve gained access to your site’s log files and have a tool to parse them, now you’re ready to start auditing for potential “problem” pages.

In terms of best practices, you’ll want to audit at least one month’s worth of data to paint an accurate picture of what bot activity looks like on your site. The amount of data you’ll want to analyze really depends on the size of your site, the more pages you have the more “events” there will be which can really slow down the import.

Verifying Bots

Upon starting the import process, a box will appear asking you whether you’d like to “verify” the bots upon import. This process essentially involves performing a reverse DNS lookup on each bot’s IP address that made a request to the server. The reasoning behind this is to expose bots that are “spoofing” a search engine bot, which is to say that they are mimicking one and are not who they say they are.

As you can imagine…it dramatically slows down the import process if you wish to verify. In some instances we’re talking many, many hours depending on your computer specs and the amount of data being imported.

Obviously it’s not required to verify the bots when importing, but if you don’t then your data could be muddied by potentially fake bot traffic.

If only there were an easier way…oh wait, there is!

This is where the command prompt (filename: cmd.exe) comes to the rescue. It allows you to perform manual reverse DNS lookups for IP addresses and return the domain from which it originated. I need to note that since I’m using Windows my experience will look different than if you’re on another operating system. You’ll need to determine how to reach the command prompt equivalent for your OS.

You can reach the command prompt on Windows by:

1. Opening the “Start” menu on your desktop.

2. Inputting “run”.

3. With the box open, input “cmd” and click ok.

4. This should bring up the cmd.exe file window.

Don’t panic! This may look intimidating but what we’re using it for is really quite simple. We will be leveraging the “nslookup” command for this task.

Next we need to compile the list of IP addresses we wish to verify, you can do this directly within the log analyzer within the “IPs” tab or export the data as well. In the upper right you can filter the bots to include only the ones which the log analyzer “thinks” are who they say they are. As stated before, we’re only concerned about Googlebot and Googlebot Smartphone right now.

Once you filter by one of the bots you’ll see a list of IP addresses in the far left column, make sure to filter the list by “Num Events” by greatest to fewest. This will allow us to prioritize the most important IPs to verify. In this example we’ll filter by Googlebot.

As you can see, the majority of Googlebot requests are being made by a relatively small set of IPs. These are the ones we’ll be verifying. The idea here is that if we verify the IPs performing the most requests then, directionally, we can make the assumption that they’re accounting for the bulk of Googlebot traffic to the site.

Now for the fun part, time to verify some IPs!

Bring your command prompt window back up and input “nslookup [input IP]” and press enter.

Voila! You can see that the IP address you input returns the origin domain (googlebot.com) and the IP associated with it (66.249.75.206).

If the domain from which the Googlebot originated is either googlebot.com or google.com, then it’s a pretty good indicator that the bots coming from that IP are legitimate.

Google has provided additional context on verifying Googlebot via reverse DNS lookup, definitely worth a read.

After Import is Complete

Once you’ve imported the amount of log file data you wish to analyze, one of the views I like best is in the “Directories” tab. This allows you to view the data in more of a natural folder/sub folder format which mimics the structure on your site to a degree. Note the two Googlebot columns called out below, these are the ones we will be focusing on later.

Next we’ll be exporting the data so we can better filter and manipulate it. You’ll find the export button to the left, right above the “Row” column.

Auditing the Data

Once the data has been exported into an Excel spreadsheet, you can filter the data by Googlebot and Googlebot Smartphone. Filtering the pages with the greatest amount of requests down to the least amount will isolate the potential “problem” pages.

As you can see below, there are a plethora of pages on the site which have only garnered one Googlebot request over the course of a whole month! This could potentially be an issue, we’ll need to investigate further though.

Once you’ve compiled the list of pages you want to analyze, it’s time to revisit the earlier section about possible causes for low bot traffic. These different scenarios can essentially act as a checklist as you work through your list of potential “problem” pages:

  • Pages living deep within website navigation, resulting in bots taking longer to reach them on average.
  • Content is thin or duplicative across multiple pages, resulting in bots placing less urgency or importance on requesting your pages.
  • Pages are not included in XML sitemap, resulting in bots relying solely on natural crawl progression to find them.
  • Lack of internal/external links pointing to pages, resulting in far fewer avenues for bots to find them.
  • Technical issues, such as an improperly formatted robots.txt file that blocks bots from crawling certain areas on a site.

I would suggest prioritizing your list of pages in order from highest to lowest priority, this way you can spend more of your time optimizing pages that will have the most impact for your site.

If/when you do find pages which require manual action, your next task is to make the necessary changes and then monitor the logs going forward to verify it was successful.

Conclusion

Server log files provide us with an unparalleled view into how search engine bots interact with a particular site, providing you with the information needed to better understand why bots may be ignoring certain pages on your site and how best to optimize. Armed with the knowledge and tips outlined in this article, you should feel confident in going forth and identifying those potential “problem” pages on your site. Leveraging bot insights to better guide your SEO strategy and tactics going forward.

Interested in learning about more ways server logs can be used to inform site wide SEO strategy? Check out an article published by Screaming Frog which outlines 22 ways to analyze server log files.

Benjamin Burkholder

Written by

Digital Integration Specialist | Python Developer in Cleveland, OH.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade