CODEX

Cracking BigQuery and Cloud SQL

Ryle Zhou
Ryle Zhou
Feb 21 · 13 min read

SQL (Structured Query Language) is a standard language for data operations that allows you to ask questions and get insights from structured datasets. It’s commonly used in database management and allows you to perform tasks like transaction record writing into relational databases and petabyte-scale data analysis.

Below, you will learn how to:

  • Distinguish databases from tables and projects.
  • Use the SELECT, FROM, and WHERE keywords to construct simple queries.
  • Identify the different components and hierarchies within the BigQuery console.
  • Load databases and tables into BigQuery.
  • Execute simple queries on tables.
  • Learn about the COUNT, GROUP BY, AS, and ORDER BY keywords.
  • Execute and chain the above commands to pull meaningful data from datasets.
  • Export a subset of data into a CSV file and store that file into a new Cloud Storage bucket.
  • Create a new Cloud SQL instance and load your exported CSV file as a new table.
  • Run CREATE DATABASE, CREATE TABLE, DELETE, INSERT INTO, and UNION queries in Cloud SQL.

The Basics of SQL

Databases and Tables

As mentioned earlier, SQL allows you to get information from “structured datasets”. Structured datasets have clear rules and formatting and often times are organized into tables, or data that’s formatted in rows and columns.

An example of unstructured data would be an image file. Unstructured data is inoperable with SQL and cannot be stored in BigQuery datasets or tables (at least natively.) To work with image data (for instance), you would use a service like Cloud Vision, perhaps through its API directly.

The following is an example of a structured dataset — a simple table:

If you’ve had experience with Google Sheets, then the above should look quite similar. As we see, the table has columns for User, Price, and Shipped and two rows that are composed of filled in column values.

A Database is essentially a collection of one or more tables. SQL is a structured database management tool, but quite often you will be running queries on one or a few tables joined together — not on whole databases.

SELECT and FROM

SQL is phonetic by nature and before running a query, it’s always helpful to first figure out what question you want to ask your data (unless you’re just exploring for fun.)

SQL has predefined keywords which you use to translate your question into the pseudo-english SQL syntax so you can get the database engine to return the answer you want.

The most essential keywords are SELECT and FROM:

  • Use SELECT to specify what fields you want to pull from your dataset.
  • Use FROM to specify what table or tables we want to pull our data from.

An example may help to understand. Assume that we have the following table example_table, which has columns USER, PRICE, and SHIPPED:

And let’s say that we want to just pull the data that’s found in the USER column. We can do this by running the following query that uses SELECT and FROM:

SELECT USER FROM example_table

If we executed the above command, we would select all the names from the USER column that are found in example_table.

You can also select multiple columns with the SQL SELECT keyword. Say that you want to pull the data that's found in the USER and SHIPPED columns. To do this, modify the previous query by adding another column value to our SELECT query (making sure it's separated by a comma!):

SELECT USER, SHIPPED FROM example_table

And just like that you’ve covered two fundamental SQL keywords! Now to make things a bit more interesting.

WHERE

The WHERE keyword is another SQL command that filters tables for specific column values. Say that you want to pull the names from example_table whose packages were shipped. You can supplement the query with a WHERE, like the following:

SELECT USER FROM example_table WHERE SHIPPED='YES'

Running the above returns all USERs whose packages have been SHIPPED to from memory:

Now that you have a baseline understanding of SQL’s core keywords, apply what you’ve learned by running these types of queries in the BigQuery console.

Exploring the BigQuery Console

The BigQuery paradigm

BigQuery is a fully-managed petabyte-scale data warehouse that runs on the Google Cloud. Data analysts and data scientists can quickly query and filter large datasets, aggregate results, and perform complex operations without having to worry about setting up and managing servers. It comes in the form of a command line tool (preinstalled in cloudshell) or a web console — both ready for managing and querying data housed in Google Cloud projects.

Open BigQuery Console

In the Google Cloud Console, select Navigation menu > BigQuery

Uploading queryable data

In this section you pull in some public data into your project so you can practice running SQL commands in BigQuery.

Click on the + ADD DATA link then select Explore public datasets.

In the search bar, enter “london”, then select the London Bicycle Hires tile, then View Dataset.

A new tab will open, and you will now have a new project called bigquery-public-data added to the Explorer panel.

Click on bigquery-public-data > london_bicycles > cycle_hire. You now have data that follows the BigQuery paradigm:

  • Google Cloud Project → bigquery-public-data
  • Dataset → london_bicycles
  • Table → cycle_hire

Now that you are in the cycle_hire table, in the center of the console click the Preview tab. Your page should resemble the following:

Inspect the columns and values populated in the rows. You are now ready to run some SQL queries on the cycle_hire table.

Running SELECT, FROM, and WHERE in BigQuery

You now have a basic understanding of SQL querying keywords and the BigQuery data paradigm and some data to work with. Run some SQL commands using this service.

If you look at the bottom right corner of the console, you will notice that there are 24,369,201 rows of data, or individual bikeshare trips taken in London between 2015 and 2017 (not a small amount by any means!)

Now take note of the seventh column key: end_station_name, which specifies the end destination of bikeshare rides. Before we get too deep, let's first run a simple query to isolate the end_station_name column. Copy and paste the following command in to the Query editor:

SELECT end_station_name FROM `bigquery-public-data.london_bicycles.cycle_hire`;

Then click Run.

After ~20 seconds, you should be returned with 24369201 rows that contain the single column you queried for: end_station_name.

Why don’t you find out how many bike trips were 20 minutes or longer?

Clear the query from the editor, then run the following query that utilizes the WHERE keyword:

SELECT * FROM `bigquery-public-data.london_bicycles.cycle_hire` WHERE duration>=1200;

This query may take a minute or so to run.

SELECT * returns all column values from the table. Duration is measured in seconds, which is why you used the value 1200 (60 * 20).

If you look in the bottom right corner you see that 7,334,890 rows were returned. As a fraction of the total (7334890/24369201), this means that ~30% of London bikeshare rides lasted 20 minutes or longer (they’re in it for the long haul!)

More SQL Keywords: GROUP BY, COUNT, AS, and ORDER BY

GROUP BY

The GROUP BY keyword will aggregate result-set rows that share common criteria (e.g. a column value) and will return all of the unique entries found for such criteria.

This is a useful keyword for figuring out categorical information on tables. To get a better picture of what this keyword does, clear the query from the editor, then copy and paste the following command:

SELECT start_station_name FROM `bigquery-public-data.london_bicycles.cycle_hire` GROUP BY start_station_name;

Without the GROUP BY, the query would have returned the full 24,369,201 rows. GROUP BY will output the unique (non-duplicate) column values found in the table. You can see this for yourself by looking in the bottom right corner. You will see 880 rows, meaning there are 880 distinct London bikeshare starting points.

COUNT

The COUNT() function will return the number of rows that share the same criteria (e.g. column value). This can be very useful in tandem with a GROUP BY.

Add the COUNT function to our previous query to figure out how many rides begin in each starting point. Clear the query from the editor, then copy and paste the following command and then click Run:

SELECT start_station_name, COUNT(*) FROM `bigquery-public-data.london_bicycles.cycle_hire` GROUP BY start_station_name;

This shows how many bikeshare rides begin at each starting location.

AS

SQL also has an AS keyword, which creates an alias of a table or column. An alias is a new name that's given to the returned column or table—whatever AS specifies.

Add an AS keyword to the last query we ran to see this in action. Clear the query from the editor, then copy and paste the following command:

SELECT start_station_name, COUNT(*) AS num_starts FROM `bigquery-public-data.london_bicycles.cycle_hire` GROUP BY start_station_name;

The COUNT(*) column in the returned table is now set to the alias name num_starts. This is a handy keyword to use especially if you are dealing with large sets of data — forgetting that an ambiguous table or column name happens more often than you think!

ORDER BY

The ORDER BY keyword sorts the returned data from a query in ascending or descending order based on a specified criteria or column value. We will add this keyword to our previous query to do the following:

  • Return a table that contains the number of bikeshare rides that begin in each starting station, organized alphabetically by the starting station.
  • Return a table that contains the number of bikeshare rides that begin in each starting station, organized numerically from lowest to highest.
  • Return a table that contains the number of bikeshare rides that begin in each starting station, organized numerically from highest to lowest.

Each of the commands below is a separate query. For each command, clear the Query editor, copy and paste the command in to the Query editor, and then click Run. Examine the results.

SELECT start_station_name, COUNT(*) AS num FROM `bigquery-public-data.london_bicycles.cycle_hire` GROUP BY start_station_name ORDER BY start_station_name;
SELECT start_station_name, COUNT(*) AS num FROM `bigquery-public-data.london_bicycles.cycle_hire` GROUP BY start_station_name ORDER BY num;
SELECT start_station_name, COUNT(*) AS num FROM `bigquery-public-data.london_bicycles.cycle_hire` GROUP BY start_station_name ORDER BY num DESC;

Working with Cloud SQL

Exporting queries as CSV files

Cloud SQL is a fully-managed database service that makes it easy to set up, maintain, manage, and administer your relational PostgreSQL and MySQL databases in the cloud. There are two formats of data accepted by Cloud SQL: dump files (.sql) or CSV files (.csv). You will learn how to export subsets of the cycle_hire table into CSV files and upload them to Cloud Storage as an intermediate location.

Back in the BigQuery Console, this should have been the last command that you ran:

SELECT start_station_name, COUNT(*) AS num FROM `bigquery-public-data.london_bicycles.cycle_hire` GROUP BY start_station_name ORDER BY num DESC;

In the Query Results section click SAVE RESULTS > CSV(local file) > SAVE. This initiates a download, which saves this query as a CSV file. Note the location and the name of this downloaded file — you will need it soon.

Click COMPOSE NEW QUERY, then copy and run the following in the query editor:

SELECT end_station_name, COUNT(*) AS num FROM `bigquery-public-data.london_bicycles.cycle_hire` GROUP BY end_station_name ORDER BY num DESC;

Upload CSV files to Cloud Storage

Go to the Cloud Console where you’ll create a storage bucket where you can upload the files you just created.

Select Navigation menu > Storage > Browser, and then click Create bucket.

Click Upload files and select the CSV that contains start_station_name data. Then click Open. Repeat this for the end_station_name data.

Rename your start_station_name file by clicking on the three dots next to on the far side of the file and click rename. Rename the file start_station_data.csv.

Rename your end_station_name file by clicking on the three dots next to on the far side of the file and click rename. Rename the file end_station_data.csv.

Create a Cloud SQL instance

In the console, select Navigation menu > SQL.

Click Create Instance.

From here, you will be prompted to choose a database engine. Select MySQL.

Now enter in a name for your instance (like “qwiklabs-demo”) and enter in a secure password in the Root password field (remember it!), then click Create.

New Queries in Cloud SQL

CREATE keyword (databases and tables)

Now that you have a Cloud SQL instance up and running, create a database inside of it using the Cloud Shell Command Line.

Activate Cloud Shell

Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.

In the Cloud Console, in the top right toolbar, click the Activate Cloud Shell button.

Run the following command in Cloud Shell to connect to your SQL instance, replacing demo if you used a different name for your instance:

gcloud sql connect demo --user=root

It may take a minute to connect to your instance.

When prompted, enter the root password you set for the instance.

A Cloud SQL instance comes with pre-configured databases, but you will create your own to store the London bikeshare data.

Run the following command at the MySQL server prompt to create a database called bike:

CREATE DATABASE bike;

You should receive the following output:

Query OK, 1 row affected (0.05 sec)MySQL [(none)]>

Make a table inside of the bike database by running the following command:

USE bike;
CREATE TABLE london1 (start_station_name VARCHAR(255), num INT);

This statement uses the CREATE keyword, but this time it uses the TABLE clause to specify that it wants to build a table instead of a database. The USE keyword specifies a database that you want to connect to. You now have a table named "london1" that contains two columns, "start_station_name" and "num". VARCHAR(255) specifies variable length string column that can hold up to 255 characters and INT is a column of type integer.

Create another table named “london2” by running the following command:

USE bike;
CREATE TABLE london2 (end_station_name VARCHAR(255), num INT);

Now confirm that your empty tables were created. Run the following commands at the MySQL server prompt:

SELECT * FROM london1;
SELECT * FROM london2;

You should receive the following output for both commands:

Empty set (0.04 sec)

Upload CSV files to tables

Return to the Cloud SQL console. You will now upload the start_station_name and end_station_name CSV files into your newly created london1 and london2 tables.

  1. In your Cloud SQL instance page, click IMPORT.
  2. In the Cloud Storage file field, click Browse, and then click the arrow opposite your bucket name, and then click start_station_data.csv. Click Select.
  3. Select CSV as File format.
  4. Select the bike database and type in "london1" as your table.
  5. Click Import
  6. Do the same for the other CSV file.
  7. In your Cloud SQL instance page, click IMPORT.
  8. In the Cloud Storage file field, click Browse, and then click the arrow opposite your bucket name, and then click end_station_data.csv Click Select.
  9. Select CSV as File format.
  10. Select the bike database and type in “london2” as your table.
  11. Click Import:

You should now have both CSV files uploaded to tables in the bike database.

Return to your Cloud Shell session and run the following command at the MySQL server prompt to inspect the contents of london1:

SELECT * FROM london1;

You should receive 881 lines of output, one more each unique station name.

Run the following command to make sure that london2 has been populated:

SELECT * FROM london2;

You should receive 883 lines of output, one more each unique station name.

DELETE keyword

Here are a couple more SQL keywords that help us with data management. The first is the DELETE keyword.

Run the following commands in your MySQL session to delete the first row of the london1 and london2:

DELETE FROM london1 WHERE num=0;
DELETE FROM london2 WHERE num=0;

You should receive the following output after running both commands:

Query OK, 1 row affected (0.04 sec)

The rows deleted were the column headers from the CSV files. The DELETE keyword will not remove the first row of the file per se, but all rows of the table where the column name (in this case "num") contains a specified value (in this case "0"). If you run the SELECT * FROM london1; and SELECT * FROM london2; queries and scroll to the top of the table, you will see that those rows no longer exist.

INSERT INTO keyword

You can also insert values into tables with the INSERT INTO keyword. Run the following command to insert a new row into london1, which sets start_station_name to "test destination" and num to "1":

INSERT INTO london1 (start_station_name, num) VALUES ("test destination", 1);

The INSERT INTO keyword requires a table (london1) and will create a new row with columns specified by the terms in the first parenthesis (in this case "start_station_name" and "num"). Whatever comes after the "VALUES" clause will be inserted as values in the new row.

You should receive the following output:

Query OK, 1 row affected (0.05 sec)

If you run the query SELECT * FROM london1; you will see an additional row added at the bottom of the "london1" table.

UNION keyword

The last SQL keyword that you’ll learn about is UNION. This keyword combines the output of two or more SELECT queries into a result-set. You use UNION to combine subsets of the "london1" and "london2" tables.

The following chained query pulls specific data from both tables and combine them with the UNION operator.

Run the following command at the MySQL server prompt:

SELECT start_station_name AS top_stations, num FROM london1 WHERE num>100000
UNION
SELECT end_station_name, num FROM london2 WHERE num>100000
ORDER BY top_stations DESC;

The first SELECT query selects the two columns from the "london1" table and creates an alias for "start_station_name", which gets set to "top_stations". It uses the WHERE keyword to only pull rideshare station names where over 100,000 bikes start their journey.

The second SELECT query selects the two columns from the "london2" table and uses the WHERE keyword to only pull rideshare station names where over 100,000 bikes end their journey.

The UNION keyword in between combines the output of these queries by assimilating the "london2" data with "london1". Since "london1" is being unioned with "london2", the column values that take precedent are "top_stations" and "num".

ORDER BY will order the final, unioned table by the "top_stations" column value alphabetically and in descending order.

As you see, 13/14 stations share the top spots for rideshare starting and ending points. With some basic SQL keywords you were able to query a sizable dataset, which returned data points and answers to specific questions.

CodeX

Everything connected with Code & Tech!

Ryle Zhou

Written by

Ryle Zhou

Studying Computer Science at Stanford University. Love writing articles to benefit everyone who wants to learn.

CodeX

CodeX

Everything connected with Code & Tech!

Ryle Zhou

Written by

Ryle Zhou

Studying Computer Science at Stanford University. Love writing articles to benefit everyone who wants to learn.

CodeX

CodeX

Everything connected with Code & Tech!

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store