Reproducing SQL queries in Python
Rewrite SQL queries using sqlite3 and pandas with example python codes!
This blog is mainly about how to rewrite SQL queries in python codes. I found that sometimes it’s really convenient to do all the work in python instead of switching between python and SQL.
Hope this blog can help the people like me who are struggling with data manipulation.
What is the first thing I need to do to get started?
First, you need to ensure that python is installed on your computer.
Personally I recommend installing anaconda, which is a free and open-source distribution for several programming languages(python, R, etc.). One of the best thing about anaconda is that it provides Jupyter Notebooks and tons of pre-installed packages(including pandas, which is the main package we are going to use).
Jupyter Notebooks is almost the first choice when it comes to data analysis and data manipulation in Python. Here is the introduction that may guide you in getting familiar with it.
Second, if you want to run REAL SQL queries in python, you can install the sqlite3 package. Here is the introduction for sqlite3 using python.
Also, for the following contents, I may assume that you have basic knowledge about SQL queries (at least familiar with the SQL keywords).
What is the general approach for rewriting SQL queries in python?
The basic idea is to use different functions of pandas to implement each step of a SQL query following an executing order.
Each SQL query consists of several keywords and other specific conditions.
SELECT … FROM…
WHERE… JOIN…
GROUP BY… HAVING…
ORDER BY…LIMIT/OFFSET…
For SQL, we write queries in the above form but the actual execution order inside the computer is a little bit different. It’s very important to understand the actual execution order.
When we are writing python codes, we need to ensure the order of codes we write would be exactly the same order as our expected execution order because python won’t re-order the codes for us.
Here is the actual execution orders for basic SQL queries:
FROM…JOIN…
WHERE. …
GROUP BY. …HAVING. …
SELECT …
ORDER BY…LIMIT / OFFSET…
And this is the order that we need to follow when writing python codes!
What is the Python codes corresponding to each SQL keywords?
Here is a general corresponding relationship between basic SQL queries and Python codes!
Real examples for implementing SQL queries in Python!
You can find the complete codes and datasets on my Github.
Basic query with FROM & WHERE & LIMIT
When adding several filter conditions in pandas codes, use table_name[(condition_1)&(condition_2)]
to ensure every condition is met.
Basic query with WHERE & IN
~
in python means the reverse of boolean value. (Turning True into False or vice-versa)
Thus ~table_name["column_name"].isin([value_list])
actually returns the rows that are NOT in the value_list.
Basic query with ORDER BY using different order DESC & ASC
When assigning values to by
& ascending
in sort_values
, ensure that the corresponding orders are correct.
Basic query with WHERE & ORDER BY & selecting specific columns
Pandas uses [column_list]
following a data frame to select several columns.
Basic query with UNION & Other key words
In concat()
function, axis=0
means to concat two tables by columns vertically (normally used when they have the same number of columns). While axis=1
means to concat two tables horizontally (normally used when they have the same number of rows).
The axis
parameter is a little bit tricky when it comes to rows and columns in python, what I said above only applies to concat()
function. You can do more research on it to figure out how it is applied in other functions!
Aggregation query with GROUP BY & SUM & ORDER BY
Use reset_index()
to transform the aggregated columns(index) into columns again after using groupby()
, so that we can select these columns as outcome.
Aggregation query with GROUP BY & SUM & HAVING & WHERE
For the query containing both HAVING & WHERE, I have split the python codes into two parts to avoid it being too long.
Like I mentioned earlier, the order of codes should be the same as the execution order, which means filter the conditions for WHERE and then filter the conditions HAVING after group by.
Aggregation query with different aggregation functions
Using agg()
after groupby()
can apply different aggregation functions to different columns.
In python, after applying several aggregation functions to one columns, the columns will be transformed into two-dimension columns. We can use a tuple to select certain column ( e.g: loc[:,("vendor_id","")]
).
Query with INNER JOIN & LEFT JOIN
We can use left_on
and right_on
in merge()
when the column names in two tables are different.
Summary
To sum up, the key point of rewriting SQL queries in python is to implement the steps one by one following the SQL execution order.
My codes are not the only solutions and please feel free to reach out to me if you have anything related to this blog or data analysis that you would like to discuss with me!
If you are interested in data manipulation using Python, you can check my other blogs:
- Exhaustive Introduction to Pandas in Python (1) — Guide for selecting single data point in pandas and
loc[],iloc[]
. - Exhaustive Introduction to Pandas in Python (2) — Guide for selecting multiple data points and handling missing values.