Fifteen years ago, there were only a few skills a software developer would need to know well, and he or she would have a decent shot at 95% of the listed job positions. Those skills were:
- Object-oriented programming.
- Scripting languages.
SQL was a go-to tool when you needed to get a quick-and-dirty look at some data, and draw preliminary conclusions that might, eventually, lead to a report or an application being written. This is called exploratory analysis.
These days, data comes in many shapes and forms, and it’s not synonymous with “relational database” anymore. You may end up with CSV files, plain text, Parquet, HDF5, and who knows what else. This is where Pandas library shines.
What is Pandas?
Python Data Analysis Library, called Pandas, is a Python library built for data analysis and manipulation. It’s open-source and supported by Anaconda. It is particularly well suited for structured (tabular) data. For more information, see http://pandas.pydata.org/pandas-docs/stable/index.html.
What can I do with it?
All the queries that you were putting to the data before in SQL, and so many more things!
Great! Where do I start?
This is the part that can be intimidating for someone used to expressing data questions in SQL terms.
SQL is a declarative programming language: https://en.wikipedia.org/wiki/List_of_programming_languages_by_type#Declarative_languages.
With SQL, you declare what you want in a sentence that almost reads like English.
Pandas’ syntax is quite different from SQL. In Pandas, you apply operations on the dataset, and chain them, in order to transform and reshape the data the way you want it.
We’re going to need a phrasebook!
The anatomy of a SQL query
A SQL query consists of a few important keywords. Between those keywords, you add the specifics of what data, exactly, you want to see. Here is a skeleton query without the specifics:
SELECT… FROM… WHERE…
GROUP BY… HAVING…
There are other terms, but these are the most important ones. So how do we translate these terms into Pandas?
First we need to load some data into Pandas, since it’s not already in database. Here is how:
I got this data at http://ourairports.com/data/.
SELECT, WHERE, DISTINCT, LIMIT
Here are some SELECT statements. We truncate results with LIMIT, and filter them with WHERE. We use DISTINCT to remove duplicated results.
SELECT with multiple conditions
We join multiple conditions with an &. If we only want a subset of columns from the table, that subset is applied in another pair of square brackets.
By default, Pandas will sort things in ascending order. To reverse that, provide ascending=False.
IN… NOT IN
We know how to filter on a value, but what about a list of values — IN condition? In pandas, .isin() operator works the same way. To negate any condition, use ~.
GROUP BY, COUNT, ORDER BY
Grouping is straightforward: use the .groupby() operator. There’s a subtle difference between semantics of a COUNT in SQL and Pandas. In Pandas, .count() will return the number of non-null/NaN values. To get the same result as the SQL COUNT, use .size().
Below, we group on more than one field. Pandas will sort things on the same list of fields by default, so there’s no need for a .sort_values() in the first example. If we want to use different fields for sorting, or DESC instead of ASC, like in the second example, we have to be explicit:
What is this trickery with .to_frame() and .reset_index()? Because we want to sort by our calculated field (size), this field needs to become part of the DataFrame. After grouping in Pandas, we get back a different type, called a GroupByObject. So we need to convert it back to a DataFrame. With .reset_index(), we restart row numbering for our data frame.
In SQL, you can additionally filter grouped data using a HAVING condition. In Pandas, you can use .filter() and provide a Python function (or a lambda) that will return True if the group should be included into the result.
Top N records
Let’s say we did some preliminary querying, and now have a dataframe called by_country, that contains the number of airports per country:
In the next example, we order things by airport_count and only select the top 10 countries with the largest count. Second example is the more complicated case, in which we want “the next 10 after the top 10”:
Aggregate functions (MIN, MAX, MEAN)
Now, given this dataframe or runway data:
Calculate min, max, mean, and median length of a runway:
A reader pointed out that SQL does not have median function. Let’s pretend you wrote a user-defined function to calculate this statistic (since the important part here is syntactic differences between SQL and Pandas).
You will notice that with this SQL query, every statistic is a column. But with this Pandas aggregation, every statistic is a row:
Nothing to worry about —simply transpose the dataframe with .T to get columns:
Use .merge() to join Pandas dataframes. You need to provide which columns to join on (left_on and right_on), and join type: inner (default), left (corresponds to LEFT OUTER in SQL), right (RIGHT OUTER), or outer (FULL OUTER).
UNION ALL and UNION
Use pd.concat() to UNION ALL two dataframes:
To deduplicate things (equivalent of UNION), you’d also have to add .drop_duplicates().
So far, we’ve been selecting things, but you may need to modify things as well, in the process of your exploratory analysis. What if you wanted to add some missing records?
There’s no such thing as an INSERT in Pandas. Instead, you would create a new dataframe containing new records, and then concat the two:
Now we need to fix some bad data in the original dataframe:
The easiest (and the most readable) way to “delete” things from a Pandas dataframe is to subset the dataframe to rows you want to keep. Alternatively, you can get the indices of rows to delete, and .drop() rows using those indices:
I need to mention one important thing — immutability. By default, most operators applied to a Pandas dataframe return a new object. Some operators accept a parameter inplace=True, so you can work with the original dataframe instead. For example, here is how you would reset an index in-place:
However, the .loc operator in the UPDATE example above simply locates indices of records to updates, and the values are changed in-place. Also, if you updated all values in a column:
or added a new calculated column:
these things would happen in-place.
The nice thing about Pandas is that it’s more than just a query engine. You can do other things with your data, such as:
- Export to a multitude of formats:
- Plot it:
to see some really nice charts!
- Share it.
The best medium to share Pandas query results, plots and things like this is Jupyter notebooks (http://jupyter.org/). In facts, some people (like Jake Vanderplas, who is amazing), publish the whole books in Jupyter notebooks: https://github.com/jakevdp/PythonDataScienceHandbook.
It’s that easy to create a new notebook:
- navigate to localhost:8888
- click “New” and give your notebook a name
- query and display the data
- create a GitHub repository and add your notebook (the file with .ipynb extension).
GitHub has a great built-in viewer to display Jupyter notebooks with Markdown formatting.
And now, your Pandas journey begins!
I hope you are now convinced that Pandas library can serve you as well as your old friend SQL for the purposes of exploratory data analysis — and in some cases, even better. It’s time to get your hands on some data to query!