Automating Screaming Frog reports with Python and DataStudio

Screaming Frog is an amazing SEO tool for auditing almost any website. I have been using it for years, and it saved me a lot of time when auditing projects. At some point, I started needing recurrent SEO audits for some of my projects. Using the “Schedule” features of the tool, and a bit of Python code, I set up a solution that provided me and my team with updated information about the status of the site.

The logic behind the solution

The idea behind the project is pretty simple. Using the schedule features in Screaming Frog, we could set up a crawl to run overnight. After the crawl, the files we needed would be saved in a folder, and called afterwards by a custom Python script, which we would use to modify and upload the files to our database. From there, it was just DataStudio doing it’s magic, with a Dashboard that included any information we needed.

Screaming Frog Scheduling Options

The first step towards implementing this solution, is properly scheduling the crawl you want to perform. In order to do this, open Screaming Frog and go to File -> Scheduling. Click on “+Add” in order to add a new scheduled crawl. In the “Start Options” tab, you can select the crawler mode, the crawl seed and the crawl configuration. (You can save a specific configuration for each project, depending on what you want to implement.)

In this tab you’ll also select which APIs you will use during the crawl.

In the export tab, you’ll select which tabs of the report you want to save. For this script, I only saved “Internal:All” and “All Inlinks” (bulk export).

Also, if you want to automate the report, I would recommend you choose to “Overwrite files in output”, for more simplicity when writing the script.

The Python Script to Automate Screaming Frog reporting

At this point, Screaming Frog will crawl your site and then save the resulting crawl in a folder. From that folder, the files will be imported using the a python script. In a simplistic way, we could use just the following python script to upload the raw data into the database, without even modifying the dataframes.

With this script, it would be enough to upload the tables into the database. Nevertheless, we would be using too much space inefficiently, since many columns would be the same on both Dataframes. Therefore, the following functions may help you shape the data you want to upload.

Dropping Columns of the Page Report you may not want

Passing a list of columns to drop into this function will help you avoid duplicate columns in your DF tables.

Modify the headings of each Dataframe in Python

Modify the heads of the dataframe so they are “database — friendly”.

Defining clusters and classifying data from the Page Report

Sometimes you may want to classify data, and assign values to certain URLs. At first I tried to do so using CASE within DataStudio, which is a perfectly valid solution if your data is not very heavy. But when playing with big datasets, it is more efficient to set this classification within the script. Functions like the following can help you create clusters based on different data:

Running the Python script in a Remote Desktop

In order to run this script, I got inspired by this tutorial by IPullRank, on how to run Screaming Frog on AWS. I applied the same logic to my script, I launched and configured an instance. In order to properly install and configure Screaming Frog, I recommend you use a remote Desktop Application. Once you’ve installed Screaming Frog, you’ll need to set up the proper environment for Python to run. In order to do so, just install all the packages you use in your script, and download it in your desktop.

After that, you can create a text file with the following commands, and save it in .bat format:

Using task scheduler, you’ll be able to set up your computer to run the script anytime you want.

Just select “Create a task”, and under “Action” select “Start a Program” and the name fo your .bat file.

Plotting the results in DataStudio

In order to plot the results in DataStudio, I would recommend using Cloud SQL. It’s fast and works perfectly fine with DataStudio, unless you are playing with a big amount of data. If that’s the case I would recommend switching from DS to Tableau, and maybe use another DataBase.

SEO specialist at ViscaWeb. Graduated in Business Management at Pompeu Fabra University. Editor at https://www.myspanishresidency.com/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store