Learning to analyze huge BigQuery datasets using Python on Kaggle

In this last few weeks I’ve learned how to analyze some of BigQuery’s cool public datasets using Python. If you’re not familiar, BigQuery makes it very easy to query Terabytes amounts of data in seconds.

In this post, I’m going to share some tips and tricks for analyzing BigQuery data using Python in Kernels, Kaggle’s free coding environment. By the end, you’ll see how easy it is to write and execute your own queries. Plus you’ll learn which is the most popular license on GitHub based on hundreds of thousands of open source repositories.

Background

Recently, Kaggle made several BigQuery public datasets like GitHub Repos and Hacker News accessible. There’s a few things that are nice about this:

  • You get a 5TB 30-day free quota
  • You can use Kernels to query the data and analyze it in Python
  • No credit card and zero risk of paying any money

The free quota plus Kernels’ one-click notebook environment makes this the perfect sandbox to learn to work with big data. All you need is a Kaggle account.

If you’d prefer to follow this tutorial interactively, skip ahead to my kernel here. Just click the blue “Fork Notebook” button and execute the code yourself.

Step One: BigQuery Datasets on Kaggle

The first step is to find the BigQuery datasets accessible on Kaggle. Go to Kaggle Datasets and select “BigQuery” in the “File Types” dropdown. You’ll get a list like this:

I’m going to go for the GitHub Repos dataset. You can read Felipe Hoffa’s introduction to this amazing, 3TB dataset here. In fact, you’ve probably seen his analyses comparing tabs versus spaces.

Here’s what the GitHub Repos dataset page looks like on Kaggle:

Now that I’m ready to start analyzing the data, all I have to do is click “New Kernel”. Fortunately I don’t have to worry about package installation or downloading anything so from here on I can focus on the code (and so can you!).

Step Two: Querying Data

In this part of the tutorial, I’ll demonstrate how to use this BigQuery helper module to safely query the largest BigQuery dataset available on Kaggle, GitHub Repos. Weighing in at 3TB total, you can see how it would be easy for me to quickly exhaust my 5TB 30-day quota scanning large tables.

The bq_helper module simplifies the common read-only tasks we can do using the BigQuery Python client library on Kaggle.

You’ll learn how to do the following in this section using bq_helper:

  1. Create a BigQueryHelper object
  2. Understand the tables in the dataset
  3. Estimate the size of your queries before you make them
  4. Execute a query safely

Create a BigQueryHelper Object

To start, we use bq_helper to create a BigQueryHelper object. This requires the project name and dataset name as its two arguments.

Below is the code I used to create a BigQueryHelper object for the GitHub dataset which I named bq_assistant. Behind the scenes this establishes an authenticated session and prepares a reference to the dataset that lives in BigQuery.

When you run this code in a kernel on Kaggle, you’ll see a BigQueryHelper object in the kernel editor’s environment variable viewer like below.

We only need to create this object once in our session and it’s ready to use for any queries we throw at it.

Understand the tables in the dataset

With our BigQueryHelper object for the GitHub dataset, bq_assistant, there are a few super simple read-only functions we can use to learn more about the data including:

  • Listing tables:bq_assistant.list_tables()
  • Getting table schema:bq_assistant.table_schema("licenses")
  • Inspecting table rows:bq_assistant.head("licenses", num_rows=10)

These are easy ways to get a feel for the data you want to work with without expending quota.

Of course if you follow along by forking this kernel, you can also get all of this information by clicking the “Data” tab and looking at the file previews.

Estimate the size of your queries before you make them

Now for the really cool part. We’re eager to start analyzing the GitHub BigQuery dataset to uncover insights into open source software development, but we should be careful about how much data we scan in our queries. Let’s make those free 5TB count!

Fortunately, the bq_helper module gives us tools to very easily estimate the size of our queries before we get ahead of ourselves. See below:

The result of `estimate_query_size`

Without needing to run the query, we now know it would “cost” nearly 18GB. Now you can see why it’s extremely helpful to know this without needing to actually expend the quota!

Note that the query obviously wouldn’t return an object that’s 17.6GB — it would only return 2000 rather short commit messages.

Let’s change the query and see what happens:

The result of the second `estimate_query_size`

Again, we’re scanning the same 17.6GB which is the full size of the message field. The lesson: the data you're scanning is not equivalent to the data you expect the query to return as its result. Check out BigQuery's best practices docs for more information.

Don’t be discouraged by this. When you do make a query request, your results will be cached by default. As long as you’re in the same interactive kernel session, you can run the same cell/query multiple times without expending multiples of the quota cost. Whew! The nice caching behavior is especially helpful because when you click “New Snapshot” to save your work, your kernel will be freshly executed from top to bottom.

Execute a query safely

Now that we’re comfortable with the huge amounts of data at our fingertips and know we can confidently estimate the size of our queries, let’s actually try fetching some data!

The query_to_pandas_safe function is another bq_helper function that makes the call to execute our query. It has two advantages over using the base BigQuery Python client library:

  • By default it won’t execute a query with a size estimate over 1GB
  • It returns a convenient pandas dataframe

Let’s test out query_to_pandas_safe on our first query. If it works correctly, the query will fail to execute as we know it scans 17.6GB of data.

Our query was canceled because it would scan more than 1GB of data

As expected, the query was canceled for exceeding 1GB of data estimated.

With these functions, we’re now safe and ready to write and execute a query that will help answer our question: What are the most popular licenses used for open source projects shared on GitHub?

Let’s use the best practices we learned with estimate_query_size and query_to_pandas_safe to query the licenses table of the GitHub BigQuery dataset:

Executing this query takes seconds. And now that we’ve got a pandas dataframe of our results, we’re in very familiar data science territory.

Step Three: Visualize Most Popular GitHub Project Licenses

The results of our query (and our efforts learning BigQuery) all become real once we visualize them! I’m going to use matplotlib with seaborn to create a bar plot showing the count of GitHub projects by type of license.

Here’s the code I used to do that in my kernel:

And here’s the resulting bar plot:

This shows that MIT is the most popular license by far compared to the next most common licenses: Apache 2.0 and GPL 2.0 and 3.0. And because this is Python, I wrote more lines trying to visualize my data than I did to retrieve query results from BigQuery. It’s amazing how simple and quick it is to analyze big data in BigQuery.

For the record, the Python kernel I wrote is Apache 2.0-licensed. :)

Conclusion

I hope you’ve enjoyed this walkthrough and learned something new. Personally, I felt confident about exploring this massive 3TB dataset knowing the functions available to me in bq_helper would prevent me from accidentally spending more quota than I intended.

Here are some additional resources for working with BigQuery datasets in Kaggle Kernels:


Let me know what you thought in the comments or follow me on Twitter. Thanks to Sohier Dane for creating the BigQueryHelper module.