Step by Step Guide to load data into BigQuery

Published in

Frontend Weekly

8 min readAug 15, 2020

In this Part 6 of the series, “Modernisation of a Data Platform”, we would be focussing a little more on BigQuery’s key concepts which are essential for designing a DWH.

In this part, we will see how to deal with table design in BigQuery using different methods and load a covid19_tweets dataset and run a query to analyse the data.

Creating a Schema:

We can create a schema in BigQuery either while migrating the data from an existing datawarehouse or ingesting the data into BigQuery from various data sources that are either on cloud or on-premise.

Other than manually creating the schema, Bigquery also gives an option to auto-detect the schema.

How does this auto-detect work?

BigQuery compares the header row of an input file and a representative sample of 100 records from row 2 onwards. If the data types of 100 samples differ from the header row, BigQuery proceeds to use them as column names. User will just have to enable auto-detect to have the schema created automatically while load happens.

Datatypes in BigQuery:

While most of the data types are standard ones such as Integer, float, Numeric, Boolean etc, one special data type that we need to discuss is STRUCT.

This data type is particularly used for nested and repeated fields. The best example to represent a STRUCT is addresses. Normally, addresses have multiple sub-fields such as Is_active, address line 1, address line 2, town, city, post code, number of years_addr etc.,

All these fields can be nested under the parent field ‘Addresses’. While normal data types are either ‘nullable’ or ‘non nullable’, the mode of STRUCTS would always be defined as ‘REPEATED’.

Creating Tables & Managing Accesses:

The easiest way to create a table in BQ is by using the Cloud Console. The UI is extremely friendly and user can navigate to BigQuery console to create tables in a dataset.

Alternatively, there is a REST API service that can be used to insert tables with a specific schema into the dataset.

BigQuery provides an option to restrict access at a dataset level. However, there is a beta feature (as of this article is being published) to grant access at a table level or view level too. Access can be granted as a data viewer, data admin, data editor, data owner etc.,

BigQuery allows users to copy table, delete table, alter the expiration time of the table, update table description etc. All these actions are possible either by Console, API, bq command line or using Client libraries.

What is Partitioning & Clustering?

Query performance is paramount in BigQuery and one of the key features that enable the same is table partitioning. Dividing the large sized tables into small partitions is the key in enhancing the query performance and fetching the results quicker.

Partitions can be done by the following methods:

· Ingestion time — On the basis of the time data arrives to bigquery

Ex: If the data load happens to BigQuery on a daily basis, then partitions are created for every day. A pseudo column called “_PARTITIONTIME” is created which calls out the date of load. By default the schema of all these tables will be the same, but bigquery provides an option to change the schema.

· Date/Timestamp — Based on the data or timestamp column of the table

Ex: Based on the timestamp at which a row is created, a partition can be created. In a user table, every user who gets registered will have a registration timestamp. We can define the partition based on the registration timestamp at a granularity of daily or even hourly.

· Integer range — Based on the range of Integer column.

Perhaps the simplest one where partitions are on the basis of IDs. Ex: Customer IDs 1 to 100 in Partition 1 and so on…

In case we do not care on the basis of which tables are to be partitioned, we can use Clustering techniques to split the tables. Both will improve the query performance, but partitioning is more specific and of user choice.

Let us load a dataset which is a collection of some of the tweets related to Covid19 and do a short analysis.

Step 1: Create a project on Google Cloud “Test Project”

Step 2: Enable BigQuery API to enable calls from client libraries.

BigQuery sandbox lets user to load data up to 10GB and query data up to 1TB for free of cost without enabling the billing account.

Step 3: Install Cloud SDK to run the commands from your local machine. Alternatively one can login to the Google Cloud console and click on the Cloud Shell icon adjacent to the search bar on the header of the console.

On Cloud SDK type gcloud init to initialise.

Follow the instructions such as selecting the credentials with which you choose to login. The credentials are normally gmail credentials which is associated with the google cloud project. Most of it is self-explanatory

In case you choose to operate via Google Cloud Shell, then the following commands will set it up.

gcloud config set project <project-id>
sudo apt-get update
sudo apt-get install virtualenv
virtualenv -p python3 venv
source venv/bin/activate

The above commands will setup the project, installs a virtual environment and activates it.

Once the venv is activated, install google cloud’s bigquery library into the Cloud Shell VM’s virtual env.

pip install --upgrade google-cloud-bigquery

On Cloud Console, navigate to IAM > Service accounts >Create service account > Download the key json to local

export GOOGLE_APPLICATION_CREDENTIALS='PATH TO JSON FILE'

This will allow us to operate bigquery via Client libraries as well as using native cloud shell commands.

Let us now create a Google Cloud storage bucket in order to store the csv file. Via Cloud Console, User can just navigate to Storage > Create Bucket. It is as simple as creating a folder on Google Drive and uploading files to it. If you choose to do the same using the shell commands.

For now, to create a GCS bucket and to create bigquery dataset and tables, we will use Command line interface/Cloud Shell, as using cloud console UI makes it too simple and using Client libraries/API will need additional setup which makes it a little complex. hence the middle path:

gsutil mk -b on -l us-east1 gs://<bucket-name>

The above command is split as below:

gsutil — A command used for accessing Google Cloud Storage via Cloud Shell as well as Cloud SDK

mk — Command to create a bucket

-l <location name> — Creates the bucket in a specific location of choice. This effectively means that the data will be stored in the servers of that location.

gs://<bucket-name>: The bucket name we wish to assign

Once bucket is created we can upload data. Our data is in csv format. Having the data in a google cloud storage bucket gives 99.999% availability and ensures it is not lost as the data is replicated redundantly as we discussed in one of the previous posts.

gsutil cp <path of csv on local> gs://<bucket-name>

In simple words the command copies the file ‘cp’ from the path we mention and pastes it inside the bucket we created previously.

BigQuery is Democratisation in its true sense. Making the features easily understandable and accessible to everyone irrespective of their technical acumen.

Now we have the file we want to load into BigQuery, available in the Google Cloud Storage bucket. Next task is to load it into BigQuery for further analysis.

Similar to “gsutil” for Google Cloud storage the shell command for BigQuery is “bq”

We can perform “bq” commands either via Cloud SDK or Cloud Shell in the console.

BigQuery creates datasets and they can be created in a particular geography, just as how we create VM’s in a particular location.

bq --location=<location-name> mk -d <dataset-name>

We can add additional parameters to the command such as setting up a partition pattern or setting up an expiration date to the dataset etc.,Ex:

bq --location=location mk \

--dataset \

--default_table_expiration integer1 \

--default_partition_expiration integer2 \

--description description \

project_id:dataset

After creating a dataset, we create a table on BigQuery. The below command creates a table called covid19_tweets in the dataset called mydataset1. We are creating an empty table without any schema.

bq mk -t mydataset1.covid19_tweets

We will now load the data using another bq command. We also need to create schema while loading the data. For that we will use auto-detect command along with bq. We will be giving the path of the Cloud storage bucket and the destination table details in the command.

bq load --autodetect  --source_format=CSV mydataset1.covid19_tweets gs://srikrishna1234/covid19_tweets.csv

Once run we see the below messages on the terminal indicating that the job is in progress and finally DONE.

Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (0s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (1s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (2s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (3s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (5s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (6s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (7s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (9s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (10s) Current status:                                                                                 Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (11s) Current status:                                                                                 Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (11s) Current status: DONE

So far, we have see how to use Console UI and CLI to operate BigQuery. Now that the data is loaded, let us run some queries to analyse a scenario using Python’s client libraries.

Let us find out from the dataset on “How many users from United Kingdom have made controversial tweets or rather used the word “Controversial” in their tweets.

Query would look like below:

with X as (select user_name, user_location, user_description from `<project-name>.mydataset1.covid19_tweets` where user_description like '%controversial%') select count(user_location) as count from X where user_location in ('United Kingdom')

A simple python program with the SQL query that hits Bigquery and extracts the information for us:

from google.cloud import bigquery
def query_tweeples():client = bigquery.Client()query_job = client.query("""with X as (select user_name, user_location, user_descriptionfrom `canvas-seat-267117.mydataset1.covid19_tweets`where user_description like '%controversial%')select count(user_location) as count from Xwhere user_location in ('United Kingdom')""")results = query_job.result()  # Waits for job to complete.for row in results:print("{} people".format(row.count))if __name__ == "__main__":query_tweeples()

Save the above code as a .py file. Ex: bigquery.py

Remember, we have already previously created a service account, downloaded the json key file, exported them to application credentials and installed google cloud bigquery library too. Time to run the python code.

cd <path where the python file is placed>python bigquery.py

You will find the output as ‘43 people’ which means there are 43 people in the dataset who are in United Kingdom and had the word ‘controversial’ in their tweets.

In case you wish to use bq command for the above activity, use the below command:

Bq query –destination_table mydataset1.covid19_tweets ‘<query>’

So, in this part of the series we learnt how to interact with Bigquery via console, Command line interface using bq commands and also have performed some actions using client libraries.

Hope you are enjoying the series. In the next part we will learn about streaming data into Bigquery.

Step by Step Guide to load data into BigQuery

Written by Nikhil (Srikrishna) Challa