BigQuery and Public Datasets. Overview for Data Analysts

3 min readJul 2, 2019

In this article we’ll briefly explore what is BigQuery and how a data analyst can access and use it through various interfaces with publicly available datasets.

BigQuery Introduction

BigQuery [1] is a service of Google Cloud providing a highly scalable enterprise-level warehouse for big data. BigQuery uses ANSI:2011 compliant dialect of SQL and offers up to 1 TB (up to 5 TB when accessing via Kaggle) of data and store 10 GB of data for free each month.

Although designed to store and process petabytes of data in seconds, BigQuery is also useful for smaller tasks and datasets.

Public Datasets

Google Cloud hosts quite a few (113 as of the date of publication) public datasets [2] for anyone to explore. These are also available via BigQuery. We will use some of these datasets in our BigQuery demonstrations below.

Interacting with BigQuery

There are currently four ways to interact with BigQuery:

The BigQuery web UI in the Google Cloud Platform (GCP) Console
The BigQuery client libraries
The BigQuery command-line tool
The BigQuery REST API

There is also BigQuery classic web UI, but it is planned to be deprecated in June 2020.

Let’s explore the first two of the mentioned interfaces.

BigQuery web UI

Using web user interface (UI) [3] with BigQuery implies that you have set up an account with Google Cloud Platform and billing in Google Cloud Console.

Let’s get data from the International Census Data by the United States Census Bureau public dataset [4] and fetch query to answer this question: Which countries have the largest proportion of their population under 25?

Note that the table’s name in the FROM clause is in backticks.

This SQL query produces the result on the screenshot below.

Should you wonder about your own country, here is the full output.

BigQuery Client Libraries

There are quite a few languages supported by the Google Client libraries [5] to interact with BigQuery, namely: C#, GO, JAVA, NODE.JS, PHP, PYTHON and RUBY. In this article we will explore Python.

My experience with Python’s client library was smooth thanks to Kaggle’s Kernels (a.k.a. Jupyter Notebooks). All you need is to import and set up a client:

Using Kaggle's public dataset BigQuery integration.

Interaction with BigQuery is made via this client object. Let’s setup another public dataset — Hacker News [6].

Let’s get the list of tables in the dataset:

comments
full
full_201510
stories

And one table’s schema:

id	INTEGER	Unique comment ID
by	STRING	Username of commenter
author	STRING	Username of author
time	INTEGER	Unix time
time_ts	TIMESTAMP Human readable time in UTC (format: YYYY-MM-DD hh:mm:ss)
text	STRING	Comment text
parent	INTEGER	Parent comment ID
deleted	BOOLEAN	Is deleted?
dead	BOOLEAN	Is dead?
ranking	INTEGER	Comment ranking

Now let’s actually send an SQL query. We’ll request a number of the commentators with more than 10000 comments made, sorted in descending order.

There is a neat way to ensure that our query won’t eat up the whole free quota and thus require a payment — maximum_bytes_billed.

The library classes, the attributes and methods are well described in the API Reference [7].

You are welcome to view or fork my kernel on Kaggle and play with it. Comments are always welcomed.

Conclusion

We’ve explored a few ways to interact with the public datasets on BigQuery. I hope this helps you to get going with those or your own datasets.

References

BigQuery service description
Google Cloud Public Datasets
BigQuery web UI (GCP Console)
International Census Data (United States Census Bureau) dataset
BigQuery client libraries
Hacker News public dataset
Google Cloud BigQuery Python API Reference