Reproducing SQL queries in Python

Rewrite SQL queries using sqlite3 and pandas with example python codes!

Published in

The Startup

5 min readFeb 1, 2020

This blog is mainly about how to rewrite SQL queries in python codes. I found that sometimes it’s really convenient to do all the work in python instead of switching between python and SQL.

Hope this blog can help the people like me who are struggling with data manipulation.

What is the first thing I need to do to get started?

First, you need to ensure that python is installed on your computer.

Personally I recommend installing anaconda, which is a free and open-source distribution for several programming languages(python, R, etc.). One of the best thing about anaconda is that it provides Jupyter Notebooks and tons of pre-installed packages(including pandas, which is the main package we are going to use).

Jupyter Notebooks is almost the first choice when it comes to data analysis and data manipulation in Python. Here is the introduction that may guide you in getting familiar with it.

Second, if you want to run REAL SQL queries in python, you can install the sqlite3 package. Here is the introduction for sqlite3 using python.

Also, for the following contents, I may assume that you have basic knowledge about SQL queries (at least familiar with the SQL keywords).

What is the general approach for rewriting SQL queries in python?

The basic idea is to use different functions of pandas to implement each step of a SQL query following an executing order.

Each SQL query consists of several keywords and other specific conditions.

SELECT … FROM…
WHERE… JOIN…
GROUP BY… HAVING…
ORDER BY…LIMIT/OFFSET…

For SQL, we write queries in the above form but the actual execution order inside the computer is a little bit different. It’s very important to understand the actual execution order.

When we are writing python codes, we need to ensure the order of codes we write would be exactly the same order as our expected execution order because python won’t re-order the codes for us.

Here is the actual execution orders for basic SQL queries:

FROM…JOIN…
WHERE. …
GROUP BY. …HAVING. …
SELECT …
ORDER BY…LIMIT / OFFSET…

And this is the order that we need to follow when writing python codes!

What is the Python codes corresponding to each SQL keywords?

Here is a general corresponding relationship between basic SQL queries and Python codes!

Basic SQL queries and their corresponding pandas codes; Screenshot by author

Real examples for implementing SQL queries in Python!

You can find the complete codes and datasets on my Github.

Basic query with FROM & WHERE & LIMIT

“ \” is used to separate one line of codes into two lines. Remember there is a blank space before “\”. ; Screenshot by author

When adding several filter conditions in pandas codes, use table_name[(condition_1)&(condition_2)]to ensure every condition is met.

Basic query with WHERE & IN

~ in python means the reverse of boolean value. (Turning True into False or vice-versa)

Thus ~table_name["column_name"].isin([value_list])actually returns the rows that are NOT in the value_list.

Basic query with ORDER BY using different order DESC & ASC

When assigning values to by & ascending in sort_values, ensure that the corresponding orders are correct.

Basic query with WHERE & ORDER BY & selecting specific columns

Pandas uses [column_list] following a data frame to select several columns.

Basic query with UNION & Other key words

In concat()function, axis=0means to concat two tables by columns vertically (normally used when they have the same number of columns). While axis=1means to concat two tables horizontally (normally used when they have the same number of rows).

The axis parameter is a little bit tricky when it comes to rows and columns in python, what I said above only applies to concat() function. You can do more research on it to figure out how it is applied in other functions!

Aggregation query with GROUP BY & SUM & ORDER BY

Use reset_index()to transform the aggregated columns(index) into columns again after using groupby(), so that we can select these columns as outcome.

Aggregation query with GROUP BY & SUM & HAVING & WHERE

For the query containing both HAVING & WHERE, I have split the python codes into two parts to avoid it being too long.
Like I mentioned earlier, the order of codes should be the same as the execution order, which means filter the conditions for WHERE and then filter the conditions HAVING after group by.

Aggregation query with different aggregation functions

Using agg() after groupby() can apply different aggregation functions to different columns.
In python, after applying several aggregation functions to one columns, the columns will be transformed into two-dimension columns. We can use a tuple to select certain column ( e.g: loc[:,("vendor_id","")] ).

Query with INNER JOIN & LEFT JOIN

We can use left_on and right_on in merge() when the column names in two tables are different.

Summary

To sum up, the key point of rewriting SQL queries in python is to implement the steps one by one following the SQL execution order.

My codes are not the only solutions and please feel free to reach out to me if you have anything related to this blog or data analysis that you would like to discuss with me!

If you are interested in data manipulation using Python, you can check my other blogs:

Exhaustive Introduction to Pandas in Python (1) — Guide for selecting single data point in pandas and loc[],iloc[] .
Exhaustive Introduction to Pandas in Python (2) — Guide for selecting multiple data points and handling missing values.