How to get into BigQuery analysis on Kaggle with Python?

Can you substitute BigQuery SQL queries with Python?

Andras Novoszath
Oct 21 · 4 min read
A lazy python

This post gives an overview of the steps needed to start using BigQuery on Kaggle. Initially, I believed that using Python can be enough. As I explain here, however, for effective work, learning some BigQuery seems unavoidable.

(This post is part of a series about analyzing BigQuery blockchain data with Python. For an overview of the project, see this link.)

Outline of this post:

  • An overview of the Python libraries and methods I considered to use to access the blockchain data.
  • A short discussion of the reasons why current Python methods are not enough and the need of BigQuery.
  • Resources to learn BigQuery with the focus of using it on Kaggle.

Methods to access BigQuery with Python

At the very first, I wanted to avoid learning too much BigQuery SQL and hoped to rely on Python as much as possible. I am most familiar with Python’s pandas, which has some libraries and methods to handle BigQuery.

The ones I looked into were:

However, all of these turned out to be limited in several ways:

  • Ibis: does not support BigQuery’s UNNEST* operation, and while it is in the roadmap, as of today, it is the least developed issue.
  • BigQuery’s Client methods: It is handy when starting the analysis, but, again, neither it supports UNNEST and nor it is in its scope. Though, when I asked, they provided some workaround code.
  • bigquery_helper: Strangely, although almost all the Kaggle tutorials on BigQuery and SQL are using it, it throws a warning that it is going to be deprecated. (By the time I finished writing this post, they actually rewrote some of the tutorials. They do not use this library any more). Thus, I chose to avoid it.
  • pandas-read_gpq: This method itself takes an SQL query and turns it into a pandas dataframe, so its use is very limited.

(* UNNEST is the BigQuery operation for flattening nested arrays. I might write about it in another post.)

While, from among these, Ibis seems to be the most promising tool for future use, this will not happen soon. Furthermore, BigQuery also has the option to run machine learning algorithms on the data, which is not even in the scope of Ibis.

UNNEST with Ibis: not anytime soon…

Assessing the need of BigQuery

In an ideal world, even if I would use BigQuery for pulling the data, I still would do most of the data munging and analysis with pandas. The reason behind this that, first, I am used to it, second, because I can parameterize it more efficiently by trying out different queries relatively quickly.

However, because of the sheer size of the data, and because of query limits, this is not viable. As a consequence, it is worth to give some time and think through exactly what data we want to pull and, therefore, about the SQL query we need to write.

So, in a nutshell, I decided to get a firmer hold on BigQuery than I intended originally.

A sad panda. Photo by Dušan Smetana on Unsplash

Resources to learn BigQuery

I found several resources floating around in Kaggle for SQL practice. Because I found it a bit confusing to find my way around them (especially with the issue with bigquery-helper), here is a quick overview of the ones I found the most useful:

Kaggle also had a great series of more practical SQL tutorials called the SQL Scavenger Hunt, which, however, uses the bigquery-helper. Accordingly, I would not advise it to learn BigQuery, but I think it is still a useful resource for practice.

All of these tutorials, nevertheless, touch only the surface of SQL and its BigQuery specifics. Unfortunately, I did not find too much BigQuery material around, and besides the ones above, all of them assume that you use BigQuery on Google Cloud. Nevertheless, for its use on the Kaggle datasets, the Query reference guide seems to be particularly useful.

I also looked into blockchain-specific tutorials, which I do not discuss here. I might do this in a later post.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Andras Novoszath

Written by

Freelance Data Analyst

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade