Automating Screaming Frog reports with Python and DataStudio

Jordi Cubiró
5 min readAug 9, 2020

--

Screaming Frog is an amazing SEO tool for auditing almost any website. I have been using it for years, and it saved me a lot of time when auditing projects. At some point, I started needing recurrent SEO audits for some of my projects. Using the “Schedule” features of the tool, and a bit of Python code, I set up a solution that provided me and my team with updated information about the status of the site.

The logic behind the solution

The idea behind the project is pretty simple. Using the schedule features in Screaming Frog, we could set up a crawl to run overnight. After the crawl, the files we needed would be saved in a folder, and called afterwards by a custom Python script, which we would use to modify and upload the files to our database. From there, it was just DataStudio doing it’s magic, with a Dashboard that included any information we needed.

automating screaming frog reporting in Datastudio
The logic behind the solution

Screaming Frog Scheduling Options

The first step towards implementing this solution, is properly scheduling the crawl you want to perform. In order to do this, open Screaming Frog and go to File -> Scheduling. Click on “+Add” in order to add a new scheduled crawl. In the “Start Options” tab, you can select the crawler mode, the crawl seed and the crawl configuration. (You can save a specific configuration for each project, depending on what you want to implement.)

In this tab you’ll also select which APIs you will use during the crawl.

In the export tab, you’ll select which tabs of the report you want to save. For this script, I only saved “Internal:All” and “All Inlinks” (bulk export).

Also, if you want to automate the report, I would recommend you choose to “Overwrite files in output”, for more simplicity when writing the script.

The Python Script to Automate Screaming Frog reporting

At this point, Screaming Frog will crawl your site and then save the resulting crawl in a folder. From that folder, the files will be imported using the a python script. In a simplistic way, we could use just the following python script to upload the raw data into the database, without even modifying the dataframes.

# -- LIBRARIES IMPORT -- 
import numpy as np
import pandas as pd
import pymysql
from sqlalchemy import create_engine
# Database Access
connector='mysql+pymysql://'
user='root'
password='your_password'
ip='@0.00.000.00'
db='name_of_your_database'
engine=connector+user+password+ip+db
# Database Tables to be created
tables=["report","inlinks"]
method=["replace,"replace"]
# DataFrames from CSV Files
PR=pd.read_csv("internal_all.csv")
IL=pd.read_csv("all_inlinks.csv")
# Dataframes List for DB
Dataframes=[PR,DP]
def toDB (dataframes,tables,method,engine):
tablenumber=0
n=0
for df in dataframes:
assignedtable=tables[tablenumber]
methodus=method[n]
try:
sqlEngine =create_engine(engine, pool_recycle=3600)
dbConnection=sqlEngine.connect()
frame=df.to_sql(assignedtable, dbConnection, if_exists=methodus, index=False);
tablenumber=tablenumber+1
n=n+1

except ValueError as vx:
print(vx)

except Exception as ex:
print(ex)

else:
print("Table ",assignedtable," created successfully.");

finally:
dbConnection.close()
toDB(Dataframes,tables,method,engine)

With this script, it would be enough to upload the tables into the database. Nevertheless, we would be using too much space inefficiently, since many columns would be the same on both Dataframes. Therefore, the following functions may help you shape the data you want to upload.

Dropping Columns of the Page Report you may not want

Passing a list of columns to drop into this function will help you avoid duplicate columns in your DF tables.

def droper(df,toDrop):
try:
for col in toDrop:
if col in df.columns:
del df[col]
except:
print('Could not drop',col

Modify the headings of each Dataframe in Python

Modify the heads of the dataframe so they are “database — friendly”.

def heads(DataFrames):
try:
for df in DataFrames:
df.columns=df.columns.str.replace(' ','_')
df.columns=df.columns.str.replace('-','_')
df.columns=df.columns.str.lower()
except:
print("Heads Error")

Defining clusters and classifying data from the Page Report

Sometimes you may want to classify data, and assign values to certain URLs. At first I tried to do so using CASE within DataStudio, which is a perfectly valid solution if your data is not very heavy. But when playing with big datasets, it is more efficient to set this classification within the script. Functions like the following can help you create clusters based on different data:

def TrafficClusters(row):
try:
if row['GA Users']==0 and row['Status']=='OK':
val= "URLs with No Users"

elif row['GA Users']>0 and row['GA Users']<=10:
val= "Between 1 and 10 Users"
elif row['GA Users']>11 and row['GA Users']<=30:
val= "Between 11 and 30 Users"
elif row['GA Users']>30 and row['GA Users']<=60:
val= "Between 31 and 60 Users"
elif row['GA Users']>60 and row['GA Users']<=100:
val= "Between 61 and 100 Users"
elif row['GA Users']>100 and row['GA Users']<=150:
val= "Between 101 and 150 Users"
elif row['GA Users']>150 and row['GA Users']<=250:
val= "Between 151 and 250 Users"
elif row['GA Users']>250 and row['GA Users']<=350:
val= "Between 251 and 350 Users"
elif row['GA Users']>350 and row['Status']=='OK':
val= "More than 350 Users"
else:
val="URLs with No Users"

return val

except:
print("Traffic Clusters Error")

Running the Python script in a Remote Desktop

In order to run this script, I got inspired by this tutorial by IPullRank, on how to run Screaming Frog on AWS. I applied the same logic to my script, I launched and configured an instance. In order to properly install and configure Screaming Frog, I recommend you use a remote Desktop Application. Once you’ve installed Screaming Frog, you’ll need to set up the proper environment for Python to run. In order to do so, just install all the packages you use in your script, and download it in your desktop.

After that, you can create a text file with the following commands, and save it in .bat format:

cd\
cd C:\Users\Administrator\Documents\yourauditfiles
py thepythonscript.py

Using task scheduler, you’ll be able to set up your computer to run the script anytime you want.

Just select “Create a task”, and under “Action” select “Start a Program” and the name fo your .bat file.

Plotting the results in DataStudio

In order to plot the results in DataStudio, I would recommend using Cloud SQL. It’s fast and works perfectly fine with DataStudio, unless you are playing with a big amount of data. If that’s the case I would recommend switching from DS to Tableau, and maybe use another DataBase.

Plotting the Screaming Frog report in DataStudio

--

--

Jordi Cubiró

Technical SEO Growth Strategist at Skale. Graduated in Business Management from Pompeu Fabra University. Checkout my website https://jordicubiro.com/