How to do everything in SQL

Giovanna Fernandes
8 min readAug 30, 2019

--

Most data scientists will agree that Pandas is an absolutely wonderful library. Not only powerful when it comes to data manipulation and analysis, but also great for visualizations — although I must admit I’m partial to Seaborn on top of Matplotlib for generating my pretty plots.

Then, we have SQL. The basics are easy enough: connect to the database, create a cursor, execute simple queries such as SELECT * from SimpleDatabase. One can quickly learn how to group, filter, join, and so and so, but there’s so much more to get out of SQL.

On this article, I’ll walk you through a project focusing on the SQL part of it. I hope you’ll find it useful and please feel free to comment if you spot anything that can be improved on. We all learn better from each other!

What is SQL?

If you’re very new to data science you might be asking this question so before we dive into anything else, let’s get to a quick understanding of what is SQL. SQL is the third most common skill a data scientist is expected to possess, with 51% job postings requiring it as per this Glassdoor analysis. It stands for Structured Query Language and it is a language designed for getting and managing data that is stored in a relational database management system. It’s been around since the 1970s and still very much used in the field of Data Science. Ok, with that, let’s get to some actual SQL work!

First Things First

Photo by Gia Oris on Unsplash
  1. Know your beast

One may think that the first thing to do when you’ll get data using SQL is to connect to the server. In my opinion, that is not the first step. A SQL database will be a collection of different tables that are similar to what you’d see in a csv or excel file, connected to each other by what we call the primary key. It is paramount to know the database structure, or schema, before you do any query. Let us have a look at the database structure for the project, the Northwind Dataset (which you can find here, if you’d like to explore it yourself).

We have many tables with a lot of interesting information for our pretend import-export business. For this project, I had a proposed hypothesis to test, and was also to come up with a few extra hypothesis on my own. One good step for coming up with insights is to get familiar with your data. For that, you have the option to query only a few data entries from each table to have a look at them. It is also common that column names are not in fact exactly the same as the ones listed on the table, so it’s good to check them to ensure your queries will work properly later on.

Now, the first step after having a look at your structure is to connect to the database. We use Python and SQLite on a Jupyter Notebook, but you may as well use other tools like SQLWorkbench, MySQL, Postgres or others for your SQL queries, depending on the work and client.

2. Connect to database

We import the sqlite library and connect to the database. We will also need to import Pandas to make our data into dataframes we can continue to work on in our Jupyter Notebook. The Northwind database has been downloaded and saved into my disk.

import pandas as pd
import sqlite3
# connect to database
conn = sqlite3.connect('Northwind_small.sqlite')
cur = conn.cursor()

Since my first task is to analyze if discounts affect product quantity, I wanted to have a look at the OrderDetail table. If all you want is to have a look at the data you can limit how much data you’re bringing in. That is specially useful for large datasets. The asterisk after SELECT brings all columns from database, and the LIMIT sets the number of rows you’d like to import into your working notebook.

# execute a query
cur.execute("""SELECT *
FROM OrderDetail
LIMIT 5;""")
# fetch data and wrap it into a dataframe
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
# check resulting dataframe
df.head()

We can notice that the column Id is not listed on our data structure, and it comprises both OrderId and ProductId (both of these which are primary keys — they link this table to other tables). When querying a database I am particular in that I only want to bring in what I’ll need, so this is important to know. For my initial hypothesis this table has everything I’ll need, so a simple query will be sufficient, listing only the columns I’ll be working with. This brings us to the next important point.

3. Get only what you’ll need.

When you are doing a query, more is not necessarily better. Too much data that you won’t need can not only hinder your machine performance, but also overshadow your insights. For my second hypothesis I wanted to look at the impact of discounts on revenue for different regions, so I’d need to join some tables. I also was set on doing a lot of the data wrangling within the query as a challenge, so I set to calculate the revenue, make it into a new column and bring only that column in. This was my query:

cur.execute("""SELECT Quantity*(OrderDetail.UnitPrice - (OrderDetail.UnitPrice*Discount))
AS 'Revenue'
,ShipRegion
,OrderDetail.OrderID
AS 'Order_ID'
FROM OrderDetail
LEFT OUTER JOIN `Order`
ON OrderDetail.OrderID = `Order`.Id
WHERE Discount!=0
GROUP BY OrderDetail.OrderID
ORDER BY Revenue DESC;""")

df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df.head()

The first line of my query is not as compact as I like, but I was able to calculate revenue applying the discount and bring in only the results into my dataframe. With SQL you can perform arithmetic operations on your query, as well as use aggregation functions like SUM(), AVG(), MAX(), etc. Isn’t that neat?

You can also create new columns, like I did with the result of my calculations that went into this new column I’ve called ‘Revenue’. We can do this simply by using AS after the column name that we are querying, and this can be helpful to also correct column names that might be wrong, or to give the columns a more clear name that explains what data it holds.

One tip that I’ve learned and adopted ever since is to use the comma that separates the columns names that we are querying at the beginning. This helps with keeping track of the columns and avoiding mistakes in case you change the columns that you’re bringing in.

After the columns names, you state FROM which table you are getting the columns. In this case, I needed information from more than one table, and that is where the JOIN comes handy. I used a LEFT JOIN, which returns all values from the table on the left — which I’ll need (in this case, OrderDetail) even if there are no matches on the table on the right (in this case, Order).

  • Note that we need to wrap the column named Order with brackets (or other notation like parenthesis or ticks) because Order is also a SQL command.

We use ON command after the LEFT JOIN command to instruct which are the primary keys for the joining of these tables. Then, I used a WHERE statement to query with a condition for the joining. I’ll only need the rows where there is a discount applied, so I’m stating that I want all rows where discount is not 0. Again, I’m bringing in only the data that I’ll need for my hypothesis testing.

Furthermore, I’m grouping my data with the command GROUP BY since I’m looking at the total revenue per order (I don’t need the information by product like on my previous analysis).

Lastly, I’m asking that my data is displayed in a descending order using the command ORDER BY and DESC, so that SQL adds up the rows for Revenue when grouping my data by OrderID and displays the total from largest order revenue to smallest.

Voilá! Here’s my ready-to-work-on resulting table:

For my next hypothesis test, I wanted to look at the employees who applied more discounts, and how was the revenue that these employees were bringing in different when compared to the other employees who used less discounts on their orders. For this, my query was the following:

cur.execute("""SELECT FirstName || ' ' || LastName
AS 'Name'
,Region
,Quantity*(OrderDetail.UnitPrice - (OrderDetail.UnitPrice*Discount))
AS 'Revenue'
,Discount
FROM [Order]
LEFT OUTER JOIN Employee
ON [Order].EmployeeId=Employee.Id
LEFT OUTER JOIN OrderDetail
ON [Order].Id=OrderDetail.OrderId
ORDER BY Revenue DESC;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df.info()

This query has some similarity with the previous one, I’ve again calculated Revenue and created a new column, and this time I’m querying primarily from the Order table, making a LEFT JOIN to the Employee table, from where I want the employees’ names.

What is new here is that I wanted the employees full names, and the table has separate columns for first and last names. I could very well fetch both columns and join them with Pandas, but my challenge was to bring in the data exactly as I needed it. For this, I used the notation `||` to concatenate columns, adding a blank space between both strings with " " and created a new column named ‘Name’ to hold that data with full names. Here’s my resulting dataframe:

Good job Nancy and Andrew!

As you can see, SQL offers many powerful tools for our queries to get exactly what we need from a database. There are many other possibilities that I didn’t cover on this article since I didn’t need them for my tasks. I’d advise anyone who wants to explore and learn SQL to look at the documentation — here’s the one for SQLite — and other articles from other data scientists on the internet (I’ve myself found a lot of great information here on Medium). If you’d like to play around, I’ve found a good tool on this website where you can build schemas and test queries.

Hope you have fun with SQL!

--

--

Giovanna Fernandes

I want to positively impact the world using the power of data.