Pandas ← → SQL Basics Reference

As someone who started learning SQL after becoming proficient using pandas, I found it helpful to do a comparison for myself of the similar functions between the two. This post will focus on the basics of data querying and manipulation.

While I may prefer pandas methods over the use of SQL, SQL is an optimized data analysis language that is not going away anytime soon and having a strong working ability to manipulate data using SQL is an essential skill for data analysts. I would encourage anyone trying to learn pandas to continuously refer to the pandas documentation and for those learning SQL basics to try out the tutorials by Mode Analytics.

pandas — The fun new kid on the block

Pandas is an open-source, python based data analysis library whose name comes from ‘Panel Data’ — referring to multidimensional time series data. As the name implies, pandas is particularly great for easily manipulating time series data and using this library is likely a mainstay for many data scientists manipulating data using Python.

SQL — The old wise way

SQL (for some reason commonly pronounced “sequel”), or ‘Structured Query Language’ is a programming language designed for manipulating relational data bases. While its syntax may not be as friendly as pandas, it has the benefit of speed and optimization for database queries, and is a necessary step in the data gathering, cleaning, and manipulation process.

No matter your preferred tool set for data manipulation, it’s most likely that your data analysis will work best while using a combination of both SQL and pandas.

Basic pandas ← → SQL

Setup — in Python it is customary to use the following name convention for importing the pandas library

import pandas as pd

To read in a csv to a data frame in pandas and then show the first (default 5) rows:

df = pd.read_csv(csv_file_name_and_pathway)

df.head()

Selecting Data:

SQL

SELECT column1, column2, column3

FROM table_name;

Pandas

df[['column1', 'column2', 'column3']]

Including a boolean statement:

SQL

SELECT column1, column2, column3

FROM table_name

WHERE column2 = filter_value;

If we needed to include a second boolean statement, it would be easy to do so using an AND statement directly after the WHERE clause:

AND column1 > value;

Pandas

df[['column1', 'column2', 'column3']][df['column2'] == filter_value]

Similarly, if we needed a second boolean statement here we could use & directly following the first statement that filters the data frame:

df[['column1', 'column2', 'column3']][(df['column2'] == filter_value) & (df['column1'] > value)]

Note that in the above statement, if we wanted all columns in the data frame we could leave off the initial bracket statement specifying the subset of columns to call.

As these statements become longer and more complex, it may be suggested to use variables that are filtered data frames.

Grouping Column Values and Applying an Aggregation Function:

SQL

SELECT column2, AVG(column1)

FROM table_name

GROUP BY column2;

The above statement will return the average of column1 by specific value of column2, i.e. if column2 has names of groups, it will return the average of the column1 value by column2 group name.

Pandas

df[['column1', 'column2']].groupby('column2').mean()

Joining or Merging Tables:

The following statements would perform an inner join between two tables or data frames, resulting in a new single table or data frame that is the subset of the two tables where the specified data exists in both of the tables with the same column key. Any rows where the data does not exist in both tables will not be shown.

SQL

SELECT *

FROM table1

INNER JOIN table2

ON table1.columnkey = table2.columnkey;

Pandas

pd.merge(table1, table2, on='columnkey')

The above examples of basic data manipulation with SQL and pandas are useful references, however it should be noted that SQL will be better optimized for these basic queries and the real power of Pandas will likely be of most help with more sophisticated functionality and time series manipulations.

Links:

Pandas Documentation Reference