Using SQL syntax to pull data from Pandas Dataframes in Python

Steven Kyle
Analytics Vidhya
Published in
3 min readAug 30, 2020

*If you know SQL and are comfortable with its syntax, you might find this blog post helpful when working with Pandas Data Frames.*

Gathering data from multiple Pandas Data Frames can be a headache, especially if you are new to Python and unfamiliar with pandas. However, there is a neat package called pandasql that can simplify it all. Kind of like a translator, Pandasql is able to take in queries written in SQL syntax and apply them to Pandas DataFrames.

Setting up pandasql

First things first, pandasql needs to be installed. To do this you can simply run this line in the terminal:

$ pip install -U pandasql

The package can then be imported into python as:

from pandasql import sqldf

sqldf takes 2 parameters to function. The first being the query denoted as q and the second being a set of session/environmental variables (locals() or globals()). A quick function can be made to help make things easier so that we don’t have to keep typing in the second parameter:

pysqldf = lambda q: sqldf(q, globals())

Now we are all set up. Pandasql is automatically able to detect Data Frames, so as long as the names are typed correctly in the query the function should work. The only thing we’ll have to do is write the query and pass it through pysqldf.

So what is a query? A query is basically a request for information. There is a lot of things that can be done using a query but the basic structure can be broken down into 3 things:

  • SELECT — This is the information that we want
  • FROM — This is the Data Frame the information is in
  • WHERE — This is the conditional we want to select with

Now let’s go through some DataFrames to see how it all comes together.

Disclaimer: The information in the Data Frames are made up for the purpose of these examples.

Setting up the first Data Frame

Now let’s write a query that will select all columns (SELECT *) that have a life span equal or greater than 10 (WHERE Lifespan ≥10) from the Data Frame (FROM DogsLifespan).

Great! We see that we were able to select for the correct dog breeds. The query syntax is written exactly the same as how it would be written in a SQL database. Now let’s say we have a new table that has dog sizes and we want to know the lifespan of bigger dogs.

Setting up the second Data Frame

In the query we can use the statement JOIN to combine the Data Frames. The JOIN statement joins Data Frames by USING columns that contain the same type of information. In this example we can use the column named “Breeds”. In the same query we will also SELECT for Breed and Lifespan WHERE dogs are 20 inches or taller.

The above two examples are very simple but you can imagine how using SQL syntax could be a quick and easy way of gathering data from big/multiple Data Frames.

--

--

Steven Kyle
Analytics Vidhya

25 year old Texan in the midst of a career change into DataScience.