How to get into BigQuery analysis on Kaggle with Python?
Can you substitute BigQuery SQL queries with Python?
This post gives an overview of the steps needed to start using BigQuery on Kaggle. Initially, I believed that using Python can be enough. As I explain here, however, for effective work, learning some BigQuery seems unavoidable.
(This post is part of a series about analyzing BigQuery blockchain data with Python. For an overview of the project, see this link.)
Outline of this post:
- An overview of the Python libraries and methods I considered to use to access the blockchain data.
- A short discussion of the reasons why current Python methods are not enough and the need of BigQuery.
- Resources to learn BigQuery with the focus of using it on Kaggle.
Methods to access BigQuery with Python
At the very first, I wanted to avoid learning too much BigQuery SQL and hoped to rely on Python as much as possible. I am most familiar with Python’s pandas, which has some libraries and methods to handle BigQuery.
The ones I looked into were:
- The Python Ibis project
- BigQuery’s client-side library.
- bigquery_helper developed by the folks at Kaggle.
- Pandas’ pandas-read_gbq method and the pandas-gbq library behind it.
However, all of these turned out to be limited in several ways:
- Ibis: does not support BigQuery’s UNNEST* operation, and while it is in the roadmap, as of today, it is the least developed issue.
- BigQuery’s Client methods: It is handy when starting the analysis, but, again, neither it supports UNNEST and nor it is in its scope. Though, when I asked, they provided some workaround code.
- bigquery_helper: Strangely, although almost all the Kaggle tutorials on BigQuery and SQL are using it, it throws a warning that it is going to be deprecated. (By the time I finished writing this post, they actually rewrote some of the tutorials. They do not use this library any more). Thus, I chose to avoid it.
- pandas-read_gpq: This method itself takes an SQL query and turns it into a pandas dataframe, so its use is very limited.
(* UNNEST is the BigQuery operation for flattening nested arrays. I might write about it in another post.)
While, from among these, Ibis seems to be the most promising tool for future use, this will not happen soon. Furthermore, BigQuery also has the option to run machine learning algorithms on the data, which is not even in the scope of Ibis.
Assessing the need of BigQuery
In an ideal world, even if I would use BigQuery for pulling the data, I still would do most of the data munging and analysis with pandas. The reason behind this that, first, I am used to it, second, because I can parameterize it more efficiently by trying out different queries relatively quickly.
However, because of the sheer size of the data, and because of query limits, this is not viable. As a consequence, it is worth to give some time and think through exactly what data we want to pull and, therefore, about the SQL query we need to write.
So, in a nutshell, I decided to get a firmer hold on BigQuery than I intended originally.
Resources to learn BigQuery
I found several resources floating around in Kaggle for SQL practice. Because I found it a bit confusing to find my way around them (especially with the issue with bigquery-helper), here is a quick overview of the ones I found the most useful:
- A general introduction to BigQuery on Kaggle.
- Kaggle’s maintainers just have updated their SQL tutorials (previously these were using the deprecated bigquery-helper library). They are very basic, but still useful: Intro to SQL and Advanced SQL.
- A useful tutorial on the client object, using schemas, and estimating query size.
- A practical overview of BigSQL on Kaggle and especially about how to control resource use.
Kaggle also had a great series of more practical SQL tutorials called the SQL Scavenger Hunt, which, however, uses the bigquery-helper. Accordingly, I would not advise it to learn BigQuery, but I think it is still a useful resource for practice.
All of these tutorials, nevertheless, touch only the surface of SQL and its BigQuery specifics. Unfortunately, I did not find too much BigQuery material around, and besides the ones above, all of them assume that you use BigQuery on Google Cloud. Nevertheless, for its use on the Kaggle datasets, the Query reference guide seems to be particularly useful.
I also looked into blockchain-specific tutorials, which I do not discuss here. I might do this in a later post.