Finding Outliers in Google Search Console Clicks with Interquartile Range (IQR)

Masaki (o '_')--đź‘Ť
3 min readDec 17, 2019

--

At work, I recently wrapped up setting my individual development plan for this upcoming year, which included taking 3 courses in data science to better support my team in SEO analyses.

What I forgot to include in the plan? A goal to build a number of Python scripts to help my team get the answers they need to get the job done quickly. So for now, let’s just say that goal is a modest 3 scripts with this being the first of three. :)

Question:

What landing pages and their respective queries in Search Console are losing significant amounts of clicks period over period?

Data:

For this script we’ll be using Search Console URL data from Google Data Studio with following criteria:

  • Dimensions: Landing Page, Query
  • Metrics: Clicks, Clicks Delta (Absolute), Impressions, Impressions Delta (Absolute), CTR, CTR (Absolute)

Sample data: https://github.com/saksters/iqr-gsc-outliers/blob/master/sample-pop-gsc-data.csv

Methodology:

One way of determining outliers in a data set is to calculate the interquartile range (IQR). IQR is used to identify the spread of the middle 50% of a data set and can help us see how concentrated or spread out our click differences are from the median. A higher IQR indicates that our clicks are spread out more widely from the median, whereas a low IQR indicates that clicks are more concentrated around the median.

Once we calculate the IQR, we only really need the landing pages and queries that are doing poorly so we’ll calculate the Lower Fence or the lower limit of our data set and then export that data for optimizations.

Code:

1. Import Pandas

import pandas as pd

2. Import and Clean Dataset

In the data, you may notice the use of delta for some of the column headers. This makes referring to those columns a little bit tricky when working with the table, so we’ll go ahead and rename the columns to our preferred naming convention as soon as we read in the csv.

df = pd.read_csv("sample-pop-gsc-data.csv", skiprows=1, names=[’Landing Page’, 'Query’, 'URL Clicks’, 'URL Clicks Diff’, 'Impressions’, 'Impressions Diff’,’URL CTR’, 'URL CTR Diff’])

You may also notice a number of NaN instances. This means that the cell is undefined or in this case empty, which makes sense as there are instances where Clicks, Impressions, or Avg. Position did not change over the period. For our calculations, we want to only work with the landing pages and queries that saw a change over the period, so we’ll call DataFrame.fillna() to fill in each NaN instance with a 0. From there we can filter out any landing pages that did not see a change over the period.

df = df.fillna(0)
df = df[df['URL Clicks Diff'] != 0]
Data filtering out URLs that did not observe a change in clicks

3. Graph as a Box Plot

Just for fun, we can graph the difference in clicks as a box plot to see how widespread our data is.

df.boxplot(column=['URL Clicks Diff'])

4. Calculate IQR and Lower Fence

Now we can go ahead and calculate the first and third quartile using the Pandas function Dataframe.quantile() , assign the difference as IQR, and print the values.

Q1 = df['URL Clicks Diff'].quantile(0.25)
Q3 = df['URL Clicks Diff'].quantile(0.75)
IQR = Q3-Q1
print('Q1: '+str(round(Q1, 2)))
print('Q3: '+str(round(Q3, 2)))
print('IQR: '+str(round(IQR, 2)))

Since we only want to see Clicks that we’ve lost we can calculate the Lower Fence or lower bound of our data.

LF = Q1-1.5*IQR
print('Lower Fence: '+str(round(LF, 2)))

5. Pull Outliers Out and Export to CSV

With the Lower Fence calculated we can now filter out all landing pages that are outside of the lower limit and export the remaining URLs to a csv.

outliers = df[df['URL Clicks Diff'] < LF]# writes outliers DataFrame to a csv
outliers.to_csv(index=False, path_or_buf='outliers-pop-gsc-data.csv')

Conclusion:

Although it’s nothing fancy like machine learning, this formula little formula has been working wonders in understanding what landing pages and queries have been suffering in our weekly reports. It beats staring at a large spreadsheet and “spotting” for outliers right?

Thanks for being here and joining me in my practice sesh!

--

--