05# SQL for Data Science

Andre Vianna
My Data Science Journey
5 min readNov 21, 2021

--

Dataset: Crypto BitCoin, Chicago Taxi Trips

Example: How many Bitcoin transactions are made per month?

We’re going to use a CTE to find out how many Bitcoin transactions were made each day for the entire timespan of a bitcoin transaction dataset.

We’ll investigate the transactions table. Here is a view of the first few rows. (The corresponding code is hidden, but you can un-hide it by clicking on the "Code" button below.)

Photo by Pierre Borthiry on Unsplash

Since the block_timestamp column contains the date of each transaction in DATETIME format, we'll convert these into DATE format using the DATE() command.

We do that using a CTE, and then the next part of the query counts the number of transactions for each date and sorts the table so that earlier dates appear first.

First BitCoin Transaction: 2009–01–03

Since they’re returned sorted, we can easily plot the raw results to show us the number of Bitcoin transactions per day over the whole timespan of this dataset.

Chicago Taxi Trips

Photo by Antonio Janeski on Unsplash

Introduction

You are getting to the point where you can own an analysis from beginning to end. So you’ll do more data exploration in this exercise than you’ve done before. Before you get started, run the following set-up code as usual.

Exercises

You are curious how much slower traffic moves when traffic volume is high. This involves a few steps.

1) Find the data

Before you can access the data, you need to find the table name with the data.

Hint: Tab completion is helpful whenever you can’t remember a command. Type client. and then hit the tab key. Don't forget the period before hitting tab.

2) Peek at the data

Use the next code cell to peek at the top few rows of the data. Inspect the data and see if any issues with data quality are immediately obvious.

3) Determine when this data is from

If the data is sufficiently old, we might be careful before assuming the data is still relevant to traffic patterns today. Write a query that counts the number of trips in each year.

Your results should have two columns:

  • year - the year of the trips
  • num_trips - the number of trips in that year

Hints:

  • When using GROUP BY and ORDER BY, you should refer to the columns by the alias year that you set at the top of the SELECT query.
  • The SQL code to SELECT the year from trip_start_timestamp is SELECT EXTRACT(YEAR FROM trip_start_timestamp)
  • The FROM field can be a little tricky until you are used to it. The format is:
  1. A backick (the symbol `).
  2. The project name. In this case it is bigquery-public-data.
  3. A period.
  4. The dataset name. In this case, it is chicago_taxi_trips.
  5. A period.
  6. The table name. You used this as your answer in 1) Find the data.
  7. A backtick (the symbol `).

4) Dive slightly deeper

You’d like to take a closer look at rides from 2017. Copy the query you used above in rides_per_year_query into the cell below for rides_per_month_query. Then modify it in two ways:

  1. Use a WHERE clause to limit the query to data from 2017.
  2. Modify the query to extract the month rather than the year.

5) Write the query

It’s time to step up the sophistication of your queries. Write a query that shows, for each hour of the day in the dataset, the corresponding number of trips and average speed.

Your results should have three columns:

  • hour_of_day - sort by this column, which holds the result of extracting the hour from trip_start_timestamp.
  • num_trips - the count of the total number of trips in each hour of the day (e.g. how many trips were started between 6AM and 7AM, independent of which day it occurred on).
  • avg_mph - the average speed, measured in miles per hour, for trips that started in that hour of the day. Average speed in miles per hour is calculated as 3600 * SUM(trip_miles) / SUM(trip_seconds). (The value 3600 is used to convert from seconds to hours.)

Restrict your query to data meeting the following criteria:

  • a trip_start_timestamp between 2017-01-01 and 2017-07-01
  • trip_seconds > 0 and trip_miles > 0

You will use a common table expression (CTE) to select just the relevant rides. Because this dataset is very big, this CTE should select only the columns you’ll need to create the final output (though you won’t actually create those in the CTE — instead you’ll create those in the later SELECT statement below the CTE).

This is a much harder query than anything you’ve written so far. Good luck!

--

--

Andre Vianna
My Data Science Journey

Software Engineer & Data Scientist #ESG #Vision2030 #Blockchain #DataScience #iot #bigdata #analytics #machinelearning #deeplearning #dataviz