Using SQL with Pandas DataFrames

7 min readOct 17, 2023

Introduction

In this tutorial, we will be introducing the Pandas data frame. Data frames are a popular method of storing and accessing tabular data. In addition to learning how to connect Python, SQL, and Pandas to create your data frames, you will learn about the power of using Python with SQL statements! data frame.

Prerequisite/Requirements

Intermediate Python Knowledge
Basic SQL Knowledge
Familiarity with Python Pandas
Basic knowledge of Pandas DataFrames.

Now let’s get started.

SQL

It is a structured programming language for extracting and classifying data from relational databases. Thus, many individuals who work with data are already familiar with its syntax and ease of use.

And that is why, if you work with datasets in Pandas, I’m completely convinced you’ve felt like most commands require too much effort to manipulate — especially when it comes to SQL syntax.

What Can SQL do?

Below are some of the most important use cases of SQL that make it unique:

It can run queries across databases.
It can retrieve information from a database.
A database’s records can be updated with SQL.
It has the ability to delete records from a database.
It can be used to create new tables in an existing database
It can create stored procedures in a database
It may be used to give access to tables, processes, and views.

Python Overview

Python is a powerful, interactive, object-oriented, and interpreted scripting language. Python has been created to be very readable.

Python has a method for using SQL queries and manipulating Pandas DataFrames.

What can Python do?

Python may be used to build web applications on a server.
Python can also be used with other software to create workflows.
It can read and modifies files; Python can also be used to connect to database systems.
Python can also handle big data and perform complex mathematics.
Python may be used for fast prototyping as well as full-fledged program development.
Python syntax is easy to comprehend; below is a simple output code in python:

print("Hello, World!")

PandaSQL Overview

Pandas is a powerful open-source data analysis and manipulation tool., powerful, flexible, and simple to use.

Pandasql allows the use of SQL syntax to query Pandas DataFrames. It functions similarly to sqldf in R. For people new to Python or pandas, pandasql tries to make data manipulation and cleanup more familiar.

You can use pandasql to query Panda data frames using SQL syntax. For people new to python or pandas, pandasql provides a more familiar and easier way of manipulating and cleaning data.

To get started, simply install the pandasql library in Python using the command below!

pip install pandasql

If you are using Jupyter Notebook, type the below command in it and press Shift+Enter.

!pip install pandasql

Sqldf is the primary function in pandasql. As a result, it can be imported directly using the following command line:

from pandasql import sqldf

How does it function?

sqldf is a simple wrapper that allows you to run SQL (SQLite) queries on Pandas data frame objects. There are some limitations in sqldf that are caused by SQLite.

To import sqldf directly, run the following command:

from pandasql import sqldf

A SQL query string and a set of environment variables (locals() or globals()) are the only two parameters that sqldf accepts (and can be omitted).

When called from within a function, Locals() returns all of the names that may be accessed locally from that function. When called from within a function, Globals() returns all of the names that may be accessed globally from that function. Both of these methods return a dictionary.

To avoid having to type it each time you wish to perform a query, you may use the following command:

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

Pandasql will immediately detect any Pandas data frames. They may be queried in the same way as any other standard SQL table. The sqldf command generates a pandas data frame with the syntax sqldf (sql query).

In the same way, we can extract data from any table using SQL, we can query any Pandas DataFrame using SQL.

This indicates that all we need to begin researching is a data frame.

Generate Sample Data

With the codes and commands below, two data frames will be produced for the exhibition.

import pandas as pd

import numpy as np

from pandasql import sqldf

#next in the line we will set a random seed to be 4

np.random.seed(4)

# Gender: 70% male 40% female 20% unknown
# Age from Poisson distribution with lambda=20
# score a random number ranging from 0 to 100

my_df = pd.DataFrame({'gender':np.random.choice(a=['m','f', 'u'], size=18, p=[0.6,0.3, 0.1]),
'age':np.random.poisson(lam=20, size=18),
'score_a':np.random.randint(100, size=18),
'score_b':np.random.randint(100, size=18),
'score_c':np.random.randint(100, size=18)})
gender = pd.DataFrame({'gender':['m','f', 'u'], 'full':['male','female', 'unknown']})

Then, we use the below command line to get the expected data out in tabular form.

my_df.head()

Age.head()

Query The Pandas Data Frames with SQL

Pandas is the traditional data container, and the DataFrame has long been the preferred tabular data storage structure in the Python data environment. The Pandas DataFrame has its own standards for accessing, manipulating, and executing calculations on composite data, specifications that can be mastered with time and persistence due to its Python syntax conformance; if you are intimately familiar with Python, you’re already well on your way to mastering the regular Pandas API.

Below we will try to query Pandas data frames using SQL

Pandasql’s principal function is sqldf. Sqldf takes two arguments.

SQL query string
Set of session/environment variables either(locals() or globals())

Let us calculate the average score of each column in the my_df data frame by gender and number of observations.

pysqldf("""select Gender, count(*) as obs, avg(age) as avg_age, avg(score_a) as avg_score_a,

avg(score_b) as avg_score_b,

avg(score_c) as avg_score_c

from my_df

group by gender""")

Suppose we wish to combine this table with the gender table.

pysqldf("""select * from gender
           inner join (select Gender, count(*) as obs, avg(age) as avg_age, avg(score_a) as avg_score_a,
           avg(score_b) as avg_score_b, 
           avg(score_c) as avg_score_c
           from my_df
           group by gender) b
           on gender.gender = b.gender""")

The output data:

It should be noted that the result is a Pandas Data Frame and that it may be saved:

my_output = pysqldf("""select * from gender
           inner join (select Gender, count(*) as obs, avg(age) as avg_age, avg(score_a) as avg_score_a,
           avg(score_b) as avg_score_b, 
           avg(score_c) as avg_score_c
           from my_df
           group by gender) b
           on gender.gender = b.gender""")
 
my_output

Using The SQLDF

Additionally, we may use the sqldf package, a wrapper for running SQL (SQLite) queries on Pandas objects in a DataFrame (Python).

Installation

The following are the installation requirements:

Python >= 3.5
Pandas >= 1.0

# Install the SQLDF package from PyPI
!pip install sqldf -U

Then, after running the above code, you will get a successful response of “Requirement already up-to-date.”

Now let us import the required library to use the sqldf

# Importing librariesp
import sqldf
import pandas as pd
import numpy as np

Using The SELECT Function

Let us carry the Panda data frame using SQL to demonstrate how the Select function works

# Define a SQL (SQLite3) query
query = """
select Gender, count(*) as obs, avg(age) as avg_age, avg(score_a) as avg_score_a,
avg(score_b) as avg_score_b,
avg(score_c) as avg_score_c
from my_df
group by gender
"""
# Run the query
df_view = sqldf.run(query)

df_view

You’ll see that sqldf has a problem with the parenthesis and that it appears that sub-queries cannot be executed.

Example 2 LEFT JOIN:

## Importing pandasql 
import numpy as np 
import pandas as pd
from pandasql import sqldf

After running the above command, it will import the required library, just as seen above. Then

np.random.seed(0)

The code below imports the NumPy and Pandas libraries and uses the pandasql library to enable SQL queries to be run on Pandas data frames. It sets a random seed for NumPy’s random number generator to ensure reproducibility.

# transactions
left_df = pd.DataFrame({'transaction_id': ['A', 'B', 'C', 'D'], 
                       'user_id': ['Unicorn', 'Davidking', 'Asaolu', 'Topzy'],
                       'value': np.random.randn(4),
                      })
# users
right_df = pd.DataFrame({'user_id': ['Paul', 'Mary', 'John',
                                     'Anna'],
                        'favorite_color': ['brown', 'blue', 'red', 
                                           np.NaN],
                       })
left_df

Note that Unicorn is not in the users table and Topzy doesn’t have a favorite color

The Output:

It looks that adding the user’s favorite color to the transaction database is accurate using a left join on the user id:

left_df.merge(right_df, on='user_id', how='left')

The code above is performing a left outer join on the two dataframes left_df and right_df, using the column ‘user_id’ as the key for the join. This means that all rows from the left_df data frame will be included in the resulting data frame, and any matching rows from the right_df dataframe will be included as well.

Please feel free to view further samples

Conclusion

After using SQL with Pandas DataFrames, it is clear that this method provides a powerful and efficient way to manipulate and analyze large datasets. By utilizing the SQL syntax, complex queries and operations can be easily performed on the data, allowing for quick and accurate analysis. This integration of SQL and Pandas allows for seamless data manipulation and analysis, making it a valuable tool for data professionals.

After going through this article, you will understand how to query Panda data frames using SQL.