Revisiting Google Search Console Data into Google BigQuery

Ben Powis
5 min readFeb 11, 2020

--

Taking a fresh look at sending Google Search Console data into BigQuery, using Python, for free.

Back in 2018 I wrote my first post about loading Google Search Console data into BigQuery. I was using examples found online and adding in the BigQuery element.

This has worked just fine for a long time, but more recently I’ve had need to recreate this process, update the code and make it easier to run without storing local files.

I wanted my new code to be:

  • Easily transferable, without the need for local files, everything can be defined in one easy to read piece of code.
  • Written in Python 3 (obviously, but many of the older code examples for Search Console are still in Python 2).
  • Able to pull lots of data — you can now make requests up to 25,000 rows in a single call to the Search Console API, I want to make use of this enhancement and grab as much data as I can!
  • Easy to scale — make it so that one could easily add in a new search console property.

An updated approach

First things first, my old method used a more complex method of authentication (for me at least!) and relied on saving a local file, which I don’t want to do in my new code. As such, I am going to authenticate using a Google Cloud Platform service account, if you’re not familiar with service accounts, take a look at the Google documentation.

You’ll need the .json key file that is generated when you create the account, so keep this somewhere handy — I did say i didn’t want to use local files, but this is the exception, since it’s how I authenticate all my other Python jobs!

The Google Client libraries provide a really nice way to authenticate access directly to a service, with a few lines of code we can build a service for accessing search console data:

SCOPES = ['https://www.googleapis.com/auth/webmasters']
SERVICE_ACCOUNT_FILE = 'my_key.json'
credentials = service_account.Credentials.from_service_account_file(
SERVICE_ACCOUNT_FILE, scopes=SCOPES)
service = build(
'webmasters',
'v3',
credentials=credentials
)

To grab our data from Search Console we are going to use the very handy searchanalytics() function.

response = service.searchanalytics().query(siteUrl=site_url, body=request).execute()

This returns a Python dictionary of results, which we then turn into a Pandas DataFrame, so we can clean it up then fire it right back into BigQuery (or anywhere else with some code changes!).

Making things repeatable

You’ll notice the code uses some pre-defined variables, my next step in making this code reusable and scalable is to wrap it all into a single function, I’ll spare you the full code here, but the starts like this:

def get_sc_df(site_url,start_date,end_date,start_row):

The arguments this function takes are:

  • The web URL of the search console property, e.g. http://google.com.
  • The start date — when do you want your data collection to start?
  • The end date.
  • The starting row — we know that you can return up to 25,000 rows of data with each API call, but what if we have more than 25,000 rows? We can grab these by setting out starting row at a value higher than 0 — more on this shortly!

Grabbing all our data

So we know that we have many rows of data, more than we can acquire in a single API call (up to 25,000 rows), but we are greedy, we want it all! How can we get it?

We can use the function we have defined in a loop, for example:

for x in range(0,100000,25000):
y = get_sc_df("https://www.google.com","2020-01-01","2020-01-01",x)
y.info()
if len(y) < 25000:
break
else:
continue

Let’s break this down, we are passing the value of x into our function, which if you remember is the starting row. x is defined as a value in the range of 0 and 100,000, with a step size of 25,000 — basically this gives us all numbers, up to 100,000, every 25,000 — enabling us to loop through all our results from Search Console.

Note that if you expect to have more than 100,000 rows of data per day you should increase this number!

My code is written to pull Search Console data it it’s most granular form, down to page level, the exact filters used are [“page”,”device”,”query”] — you can change this, but a few other changes will be required when the final DataFrame is cleaned and constructed.

Now, great — we can loop through all our results and send them to BigQuery in chunks of 25,000 rows. But we have multiple Search Console properties, can we grab data for them all, in one go?

Sure we can, let’s wrap an additional loop around our current one:

for p in properties:
for x in range(0,100000,25000):
y = get_sc_df(p,"2020-01-01","2020-01-01",x)
if len(y) < 25000:
break
else:
continue

The look requires properties which is a list of URL strings relating to the Search Console properties you want to pull data from. Remember that you’ll need to add the service account user to each Search Console property, so you have permission to access this data.

Wrapping up!

So here we go, we have a single block of Python code where we define a single function and, by setting a few customer variables, we can can this up to loop through many properties every day and download all our Search Console data into BigQuery.

The full code is on GitHub here along with an explanatory Jupyter notebook, feel free to share any comments/suggestions!

Technical notes

Techy steps you’ll need to take to get everything working:

  • Create a Google Cloud project.
  • Activate the Google Search Console API on your GCP project.
  • Create a service account user, with access to write to BigQuery.
  • Add this service account user to each Search Console property that you want to download data from.
  • Edit the search_console_bq.py file and set the parameters at the top of the code, populating them with your values.
  • The code is designed to load all data to a single BigQuery table, appending each time. This table does not need to exist before the first run.

--

--

Ben Powis

Machine Learning Specialist working in online retail, sharing my interests in data, visualization and code.