As someone who started learning SQL after becoming proficient using pandas, I found it helpful to do a comparison for myself of the similar functions between the two. This post will focus on the basics of data querying and manipulation.
While I may prefer pandas methods over the use of SQL, SQL is an optimized data analysis language that is not going away anytime soon and having a strong working ability to manipulate data using SQL is an essential skill for data analysts. I would encourage anyone trying to learn pandas to continuously refer to the pandas documentation and for those learning SQL basics to try out the tutorials by Mode Analytics.
pandas — The fun new kid on the block
Pandas is an open-source, python based data analysis library whose name comes from ‘Panel Data’ — referring to multidimensional time series data. As the name implies, pandas is particularly great for easily manipulating time series data and using this library is likely a mainstay for many data scientists manipulating data using Python.
SQL — The old wise way
SQL (for some reason commonly pronounced “sequel”), or ‘Structured Query Language’ is a programming language designed for manipulating relational data bases. While its syntax may not be as friendly as pandas, it has the benefit of speed and optimization for database queries, and is a necessary step in the data gathering, cleaning, and manipulation process.
No matter your preferred tool set for data manipulation, it’s most likely that your data analysis will work best while using a combination of both SQL and pandas.
Basic pandas ← → SQL
Setup — in Python it is customary to use the following name convention for importing the pandas library
import pandas as pd
To read in a csv to a data frame in pandas and then show the first (default 5) rows:
df = pd.read_csv(csv_file_name_and_pathway)
SELECT column1, column2, column3
df[['column1', 'column2', 'column3']]
Including a boolean statement:
SELECT column1, column2, column3
WHERE column2 = filter_value;
If we needed to include a second boolean statement, it would be easy to do so using an
AND statement directly after the
AND column1 > value;
df[['column1', 'column2', 'column3']][df['column2'] == filter_value]
Similarly, if we needed a second boolean statement here we could use
& directly following the first statement that filters the data frame:
df[['column1', 'column2', 'column3']][(df['column2'] == filter_value) & (df['column1'] > value)]
Note that in the above statement, if we wanted all columns in the data frame we could leave off the initial bracket statement specifying the subset of columns to call.
As these statements become longer and more complex, it may be suggested to use variables that are filtered data frames.
Grouping Column Values and Applying an Aggregation Function:
SELECT column2, AVG(column1)
GROUP BY column2;
The above statement will return the average of column1 by specific value of column2, i.e. if column2 has names of groups, it will return the average of the column1 value by column2 group name.
Joining or Merging Tables:
The following statements would perform an inner join between two tables or data frames, resulting in a new single table or data frame that is the subset of the two tables where the specified data exists in both of the tables with the same column key. Any rows where the data does not exist in both tables will not be shown.
INNER JOIN table2
ON table1.columnkey = table2.columnkey;
pd.merge(table1, table2, on='columnkey')
The above examples of basic data manipulation with SQL and pandas are useful references, however it should be noted that SQL will be better optimized for these basic queries and the real power of Pandas will likely be of most help with more sophisticated functionality and time series manipulations.