Screaming Frog is an amazing SEO tool for auditing almost any website. I have been using it for years, and it saved me a lot of time when auditing projects. At some point, I started needing recurrent SEO audits for some of my projects. Using the “Schedule” features of the tool, and a bit of Python code, I set up a solution that provided me and my team with updated information about the status of the site.
The logic behind the solution
The idea behind the project is pretty simple. Using the schedule features in Screaming Frog, we could set up a crawl to run overnight. After the crawl, the files we needed would be saved in a folder, and called afterwards by a custom Python script, which we would use to modify and upload the files to our database. From there, it was just DataStudio doing it’s magic, with a Dashboard that included any information we needed.
Screaming Frog Scheduling Options
The first step towards implementing this solution, is properly scheduling the crawl you want to perform. In order to do this, open Screaming Frog and go to File -> Scheduling. Click on “+Add” in order to add a new scheduled crawl. In the “Start Options” tab, you can select the crawler mode, the crawl seed and the crawl configuration. (You can save a specific configuration for each project, depending on what you want to implement.)
In this tab you’ll also select which APIs you will use during the crawl.
In the export tab, you’ll select which tabs of the report you want to save. For this script, I only saved “Internal:All” and “All Inlinks” (bulk export).
Also, if you want to automate the report, I would recommend you choose to “Overwrite files in output”, for more simplicity when writing the script.
The Python Script to Automate Screaming Frog reporting
At this point, Screaming Frog will crawl your site and then save the resulting crawl in a folder. From that folder, the files will be imported using the a python script. In a simplistic way, we could use just the following python script to upload the raw data into the database, without even modifying the dataframes.
# -- LIBRARIES IMPORT --
import numpy as np
import pandas as pd
from sqlalchemy import create_engine# Database Access
engine=connector+user+password+ip+db# Database Tables to be created
method=["replace,"replace"]# DataFrames from CSV Files
IL=pd.read_csv("all_inlinks.csv")# Dataframes List for DB
Dataframes=[PR,DP]def toDB (dataframes,tables,method,engine):
for df in dataframes:
sqlEngine =create_engine(engine, pool_recycle=3600)
frame=df.to_sql(assignedtable, dbConnection, if_exists=methodus, index=False);
except ValueError as vx:
except Exception as ex:
print("Table ",assignedtable," created successfully.");
With this script, it would be enough to upload the tables into the database. Nevertheless, we would be using too much space inefficiently, since many columns would be the same on both Dataframes. Therefore, the following functions may help you shape the data you want to upload.
Dropping Columns of the Page Report you may not want
Passing a list of columns to drop into this function will help you avoid duplicate columns in your DF tables.
for col in toDrop:
if col in df.columns:
print('Could not drop',col
Modify the headings of each Dataframe in Python
Modify the heads of the dataframe so they are “database — friendly”.
for df in DataFrames:
Defining clusters and classifying data from the Page Report
Sometimes you may want to classify data, and assign values to certain URLs. At first I tried to do so using CASE within DataStudio, which is a perfectly valid solution if your data is not very heavy. But when playing with big datasets, it is more efficient to set this classification within the script. Functions like the following can help you create clusters based on different data:
if row['GA Users']==0 and row['Status']=='OK':
val= "URLs with No Users"
elif row['GA Users']>0 and row['GA Users']<=10:
val= "Between 1 and 10 Users"elif row['GA Users']>11 and row['GA Users']<=30:
val= "Between 11 and 30 Users"elif row['GA Users']>30 and row['GA Users']<=60:
val= "Between 31 and 60 Users"elif row['GA Users']>60 and row['GA Users']<=100:
val= "Between 61 and 100 Users"elif row['GA Users']>100 and row['GA Users']<=150:
val= "Between 101 and 150 Users"elif row['GA Users']>150 and row['GA Users']<=250:
val= "Between 151 and 250 Users"elif row['GA Users']>250 and row['GA Users']<=350:
val= "Between 251 and 350 Users"elif row['GA Users']>350 and row['Status']=='OK':
val= "More than 350 Users"else:
val="URLs with No Users"
print("Traffic Clusters Error")
Running the Python script in a Remote Desktop
In order to run this script, I got inspired by this tutorial by IPullRank, on how to run Screaming Frog on AWS. I applied the same logic to my script, I launched and configured an instance. In order to properly install and configure Screaming Frog, I recommend you use a remote Desktop Application. Once you’ve installed Screaming Frog, you’ll need to set up the proper environment for Python to run. In order to do so, just install all the packages you use in your script, and download it in your desktop.
After that, you can create a text file with the following commands, and save it in .bat format:
Using task scheduler, you’ll be able to set up your computer to run the script anytime you want.
Just select “Create a task”, and under “Action” select “Start a Program” and the name fo your .bat file.
Plotting the results in DataStudio
In order to plot the results in DataStudio, I would recommend using Cloud SQL. It’s fast and works perfectly fine with DataStudio, unless you are playing with a big amount of data. If that’s the case I would recommend switching from DS to Tableau, and maybe use another DataBase.