TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial…

SQL from A to Z: Part 1

Stephanie A.
TDS Archive
Published in
6 min readFeb 26, 2021

--

Why learn SQL?

As data scientists, we utilize data to provide insights and recommendations to inform product strategy, growth and marketing, operations, and many other areas of the business. In many cases, data is stored in a relational database format, uses a structure that allows access to data points that are related to one another (source: AWS). SQL (Structured Query Language) is a programming language used to access relational databases. As data scientists, we use SQL to execute queries against a database to retrieve data, insert, update, and delete records from the database, and create new databases and tables. The syntax may differ slightly depending on which relational database management system (e.g., using a semicolon at the end of a query), but the general logic should hold across.

The data is stored in the database as objects called tables, which is a collection of data entries in the form of columns and rows. Columns, or also called as fields, contain the column name and its properties. A row, also called a record, contains data for each column.

Photo by Austin Distel on Unsplash

Example

Let’s use a simple example to go through the most commonly used SQL queries. Suppose there’s a company that serves content to users through an app similar to Facebook, Reddit, Pinterest, etc.

When a user logs into the app, the company serves content to the users. This is logged in the content_insertion table. This table stores what content was served to whom and the timestamp.

We also have a table called user_content_activity that stores all the data when a user makes an action with the content. It includes the user id, content id, the action type and the timestamp the user made the action.

Lastly we have two tables that give the properties of the users and contents, named users and contents respectively. The users table stores the country the user lives and the date the user joined (or signed up) for the app. The contents table stores the content type.

Table name: user_content_activity

+--------+-----------+--------+---------------------+
| userid | contentid | action | timestamp |
+--------+-----------+--------+---------------------+
|1 | 5 | view | 2021-01-05 10:30:20 |
|1 | 5 | click | 2021-01-05 10:30:55 |
|2 | 21 | view | 2021-01-06 03:12:25 |
|3 | 100 | view | 2021–02–04 06:25:12 |
+--------+-----------+--------+---------------------+

Table name: users

+--------+---------------+------------+
| userid | country | join_date |
+--------+---------------+------------+
|1 | United States | 2018-05-06 |
|2 | Italy | 2019-12-31 |
|3 | Japan | 2020-03-05 |
|4 | United States | 2021–01-26 |
+--------+---------------+------------+

Table name: contents

+-----------+-----------+
| contentid | type |
+-----------+-----------+
|1 | video |
|2 | photo |
|3 | text |
|4 | photo |
+-----------+-----------+

The Basic Form

SELECT {column1, column2, …} 
FROM {tablename}

This is the very basic form of retrieving information from the database. When writing queries, the first question to ask is ‘which table(s) have the information I need?’ And then, ‘which column(s) have the appropriate information?’ For example,

  • Q: Give me the list of countries that our users are from
    > SELECT country FROM users
  • Q: Give me all the users, contents and the actions the users took
    > SELECT userid, contentid, action FROM user_content_activity
  • Q: Give me all the data from the content_insertion table
    > SELECT userid, contentid, timestamp FROM content_insertion
    > SELECT * FROM content_insertion

The symbol * is an easy way to retrieve all columns from the table instead of listing out all the columns. This is particularly useful when you want to see some data from the table. To do this, we recommend using SELECT * FROM content_insertion LIMIT 100 which will give only the first 100 rows of the table. “LIMIT” always come last in the query.

Adding Conditions

SELECT {column1, column2, …} 
FROM {tablename}
WHERE {conditional statements}

We add the “WHERE” clause when we want to add some conditions to the data we are retrieving. In many cases, we will need to use the well-known math symbols =, !=, >, <, >=, <= or logics such as AND, OR, NOT, as well as many others. We’ll go through the most common ones in the following examples.

  • Q: Give me the list of users from the US
    > SELECT userid FROM users WHERE country = ‘United States’

Note that we used ‘United States’ instead of ‘US’ as asked in the question. This is because when we look at the table, US is stored as United States. A great clarifying question should be “Is the United States stored as ‘US’, ‘United States’, or both?”

We used the symbol “=” to denote country exactly matching the symbol on the right. If the data was stored as ‘the United States’ that row will not be retrieved as the string does not match exactly. If we want to retrieve both, then using the clause WHERE country LIKE ‘%United States’ can solve the problem (read more about wildcard characters).

  • Q: Give me the unique list of countries of users who joined in 2018
    > SELECT DISTINCT userid FROM users WHERE join_date BETWEEN ‘2018–01–01’ AND ‘2018–12–31’; or
    > SELECT DISTINCT userid FROM users WHERE join_date >= ‘2018–01–01’ AND join_date <= ‘2018–12–31’

DISTINCT is a useful syntax to retrieve unique values of a column.

There are many date functions, but the most important ones to know are CURRENT_DATE, DATE_ADD, DATE_SUB, DATE_DIFF, DATE_FORMAT. For example,
> SELECT userid FROM users WHERE join_date BETWEEN CURRENT_DATE AND DATE_SUB(‘day’, -6, CURRENT_DATE) gives you users who joined between a week ago and today.

  • Q: Give me the list of contents that are in the form of photos and videos
    > SELECT contentid FROM contents WHERE type IN (‘photo’, ‘video’)

IN or NOT IN are used when we want to match more than one string with the logical statement “OR”. The above query is the same to
> SELECT contentid FROM contents WHERE (type = ‘photo’ OR type = ‘video)
> SELECT contentid FROM contents WHERE type != ‘text’
> SELECT contentid FROM contents WHERE type NOT IN (‘text’)

Aggregate functions

Often we want to find an aggregate of a variable, such as sum, average, counts, min, max. The general syntax for using this is

SELECT column1, SUM(column2), COUNT(column3) 
FROM {tablename} WHERE {conditional statements}
GROUP BY column1

Note that we can have one or many aggregate functions, and if we have any conditional statements, that always come after FROM and before GROUP BY.

  • Q: How many users are there in each country?
    > SELECT country, COUNT(userid) FROM users GROUP BY country

It is generally a good habit to rename the aggregated results. Aliases are used to give a table, or a column in a table, a temporary name. For example, use COUNT(userid) AS number_users

If we believe there are duplicate rows, we can change this by
SELECT country, COUNT(DISTINCT userid) AS number_users FROM users GROUP BY country

  • Q: What’s the contentid that had the most clicks on 2020–01–03
    > SELECT contentid, COUNT(*) AS view FROM user_content_activity WHERE action = ‘click’ AND CAST(timestamp AS DATE) = ‘2020–01–03’ GROUP BY contentid
    ORDER BY COUNT(*) DESC LIMIT 1

ORDER BY is used to sort the result-set in ascending (default) or descending order (by specifying DESC)

  • Q: What are the contents that had more than 100 views
    > SELECT contentid, COUNT(*) AS views FROM user_content_activity WHERE action = ‘view’ GROUP BY contentid HAVING count(*) > 100

HAVING is a great one to use when you are extracting information conditional upon the aggregated results. It always comes after the GROUP BY statement.

So far, we have gone through the basic format of retrieving data using SQL. In Part 2, we will get into more fun stuff of taking advantage of the relational database.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Stephanie A.
Stephanie A.

Written by Stephanie A.

Ph.D. | Data Science Manager | Mother