How to rewrite your SQL queries in Pandas, and more

Irina Truong
Mar 4, 2018 · 6 min read
Image for post
Image for post

Fifteen years ago, there were only a few skills a software developer would need to know well, and he or she would have a decent shot at 95% of the listed job positions. Those skills were:

  • Object-oriented programming.
  • Scripting languages.
  • JavaScript, and…
  • SQL.

SQL was a go-to tool when you needed to get a quick-and-dirty look at some data, and draw preliminary conclusions that might, eventually, lead to a report or an application being written. This is called exploratory analysis.

These days, data comes in many shapes and forms, and it’s not synonymous with “relational database” anymore. You may end up with CSV files, plain text, Parquet, HDF5, and who knows what else. This is where Pandas library shines.

What is Pandas?

What can I do with it?

Great! Where do I start?

SQL is a declarative programming language: https://en.wikipedia.org/wiki/List_of_programming_languages_by_type#Declarative_languages.

With SQL, you declare what you want in a sentence that almost reads like English.

Pandas’ syntax is quite different from SQL. In Pandas, you apply operations on the dataset, and chain them, in order to transform and reshape the data the way you want it.

We’re going to need a phrasebook!

The anatomy of a SQL query

SELECT… FROM… WHERE…

GROUP BY… HAVING…

ORDER BY…

LIMIT… OFFSET…

There are other terms, but these are the most important ones. So how do we translate these terms into Pandas?

First we need to load some data into Pandas, since it’s not already in database. Here is how:

I got this data at http://ourairports.com/data/.

SELECT, WHERE, DISTINCT, LIMIT

SELECT with multiple conditions

ORDER BY

IN… NOT IN

GROUP BY, COUNT, ORDER BY

Below, we group on more than one field. Pandas will sort things on the same list of fields by default, so there’s no need for a .sort_values() in the first example. If we want to use different fields for sorting, or DESC instead of ASC, like in the second example, we have to be explicit:

What is this trickery with .to_frame() and .reset_index()? Because we want to sort by our calculated field (size), this field needs to become part of the DataFrame. After grouping in Pandas, we get back a different type, called a GroupByObject. So we need to convert it back to a DataFrame. With .reset_index(), we restart row numbering for our data frame.

HAVING

Top N records

Image for post
Image for post

In the next example, we order things by airport_count and only select the top 10 countries with the largest count. Second example is the more complicated case, in which we want “the next 10 after the top 10”:

Aggregate functions (MIN, MAX, MEAN)

Image for post
Image for post

Calculate min, max, mean, and median length of a runway:

A reader pointed out that SQL does not have median function. Let’s pretend you wrote a user-defined function to calculate this statistic (since the important part here is syntactic differences between SQL and Pandas).

You will notice that with this SQL query, every statistic is a column. But with this Pandas aggregation, every statistic is a row:

Image for post
Image for post

Nothing to worry about —simply transpose the dataframe with .T to get columns:

Image for post
Image for post

JOIN

UNION ALL and UNION

To deduplicate things (equivalent of UNION), you’d also have to add .drop_duplicates().

INSERT

There’s no such thing as an INSERT in Pandas. Instead, you would create a new dataframe containing new records, and then concat the two:

UPDATE

DELETE

Immutability

However, the .loc operator in the UPDATE example above simply locates indices of records to updates, and the values are changed in-place. Also, if you updated all values in a column:

or added a new calculated column:

these things would happen in-place.

And more!

  • Export to a multitude of formats:
  • Plot it:

to see some really nice charts!

Image for post
Image for post
  • Share it.

The best medium to share Pandas query results, plots and things like this is Jupyter notebooks (http://jupyter.org/). In facts, some people (like Jake Vanderplas, who is amazing), publish the whole books in Jupyter notebooks: https://github.com/jakevdp/PythonDataScienceHandbook.

It’s that easy to create a new notebook:

After that:
- navigate to localhost:8888
- click “New” and give your notebook a name
- query and display the data
- create a GitHub repository and add your notebook (the file with .ipynb extension).

GitHub has a great built-in viewer to display Jupyter notebooks with Markdown formatting.

And now, your Pandas journey begins!

j-bennet codes

On coding and data analysis, by Irina Truong…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store