Using SQL with Pandas DataFrames

You can use SQL syntax for shaping and analyzing pandas DataFrames with ease.

Josep Ferrer
DataBites
6 min readAug 23, 2022

--

Self-made image. Using SQL with pandas DataFrames.
Self-made image. Using SQL with pandas DataFrames.

While conducting a job search for Data Scientist or Data Analyst, three letters often pop up in the job description section: SQL or also known as Structured Query Language.

This is a standardized programming language that is used to extract and organize stored data in a relational database. Most people who work with data are already familiar with its syntax — and with its simplicity.

This is why if you are dealing with datasets in pandas, I bet you probably have felt like most of the commands take too much effort to manipulate — especially when compared to SQL syntax.

Basic manipulation of data frames and table creation and editing is much easier when written in SQL. Thus this article is your lucky strike!

There is a method for using SQL queries and manipulating the pandas DataFrames within python. Want to know how?

Let’s discover it out together! 👇🏻

Pandasql

pandasqlallows you to query pandas data frames using SQL syntax. pandasql provides a more familiar — and easier — way of manipulating and cleaning data for people new to python or pandas.

#1. How to install?

You can follow the process explained in the pandasql documentation webpage. However, it is as easy as executing the following command.

pip install pandasql

If you are using Jupyter-Notebook, you can directly execute the following command in a cell.

!pip install pandasql

Once we have the library installed in our local environment, we just need to import the library — as you already do with any other library you use — and it will be ready to be used.

import pandasql

#2. How does it work?

The main function used in pandasql is sqldf. The sqldf is a simple wrapper to run SQL (SQLite) queries on Pandas data frame objects. There are some limitations in sqldf which are the limitations that come from SQLite.

sqldf can be directly imported running the following command.

from pandasql import sqldf

sqldf accepts 2 parametrs — a SQL query string — and a set of environment variables (locals() or globals()) which is completely optional (and can be omitted).

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

Any pandas data frames will be automatically detected by pandasql. You can query them as you would do in any regular SQL table. With its syntax sqldf(sql_query) , sqldf gives a pandas data frame as output.

We can query any pandas DataFrame using SQL in the same way as we extract data from any table using SQL.

This means, we just need a data frame to start exploring.

#3. Time to explore — Querying around! 🚀

You can have a first try of the library importing a sample data frame using the load_meat() and load_births() methods.

  1. We first import the library.
import pandasql
from pandasql import sqldf

2. I usually declare the function pysqldf to query in Python, but the default function sqldf can be used directly as well.

pysqldf = lambda q: sqldf(q)

3. pandasqlcontains two functions to load two sample data frames. We import both functions and declare their corresponding data frames.

from pandasql import load_meat, load_births
meats = load_meat()
births = load_births()

The meats data frame looks like follows:

4. Now that we have a sample data frame, we can execute any query towards it. In this case, I am going to display all rows that do contain a non-null turkey value.

pysqldf("SELECT * FROM meats WHERE turkey IS NOT NULL")

We can try more a little more complicated query. In this case, I want to know how much beef, veal and pork has been consumed year by year. To do so, I can extract the year from each date and sum all three kinds of meats while grouping by year.

pysqldf("SELECT strftime('%Y',DATE(date))  AS year, SUM(beef) AS beef ,SUM(veal) AS veal, SUM(pork) AS pork FROM meats GROUP BY year ORDER BY year DESC ")

The following data frame is obtained as the result of the previous query.

We can easily observe the temporal behaviour of the consuption of all three kinds of meat when plotting all three rows using the matplotlib library of python.

However, it is quite hard to understand the queries when they are stacked up on the same line.

This is why, here goes a really important — and useful — tip! 🚨🚨🚨

  • We can use triple comma (“““ ”””) to give structure to our query.
pysqldf("""            SELECT strftime('%Y',DATE(date))  AS year,
SUM(beef) AS beef,
SUM(veal) AS veal,
SUM(pork) AS pork
FROM meat
GROUP BY YEAR
ORDER BY YEAR DESC
""")

I bet you already got the idea right? Let’s see some more examples of queries that can be performed using pandasql.

#EXAMPLE 1: LEFT JOIN

The first example will consist on performing a left join. To do so, we import the library and get the two dataframes — in this case I will use directly both sample meat and births datasets. To perform a left join — we need two tables with a common column, so I grouped by both the evolution of births and consumption of meat by year.

You can observe the left join code below.

The final result is the table containing both meat and birth total numbers by year.

#EXAMPLE 2: CASE WHEN

As you can see in the table obtained in the previous examples, there are some years that do not contain birth data. In this case, I perform a when case query to indicate which years have no data available.

The output query is the following one:

You can find my whole jupyter notebook in the following link. Hope you find it easy to replicate :)

Data always has a better idea — trust it.

Don’t forget to follow ForCode’Sake to get more articles like this one! ✨

You can subscribe to my Medium Newsletter to stay tuned and receive my content. I promise it will be unique!

If you are not a full Medium member yet, just check it out here to support me and many other writers. It really helps :D

You can find me on Twitter and LinkedIn as well!

--

--

Josep Ferrer
DataBites

Crush into the world of data with DataBItes: databites.tech | Outstand using data. ML, SQL, Python and DataViz | 👉🏻Inquiries in rfeers@gmail.com