From Pandas-wan to Pandas-master

Rudolf Höhn
Unit8 - Big Data & AI
9 min readJul 9, 2019

--

At Unit8 we support our clients and create projects from structuring data sources to leveraging state-of-the-art machine-learning models to give the business the right tools for making an impact. In our journey to succeeding in our mission, we use different tools. One of these tools is the Python library pandas.

With this article, you will hopefully discover one — or two — or more — new ways to code with pandas.

This blog post is about best practices and it is aimed at anyone who uses, frequently or not or wants to use Pandas. An even if you’ve never used it, it is never too late to start, right?

The scope of this article is to cover the following:

  • Current development status of Pandas
  • Memory optimization
  • Indexing
  • Method chaining
  • Random tips

While reading the article, I recommend that you read the docstrings of every function you do not know. A simple Google search and a few seconds reading Pandas documentation will make the reading much more enjoyable.

Pandas definition and status

So what is Pandas?

Pandas is an “an open source, BSD-licensed library providing high-performance, easy-to-use data structures, and data analysis tools for the Python programming language* (*source Pandas website). In summary, it serves data abstractions, called DataFrame and Series (for those who uses Panel, they have been deprecated). Overall, it manages indexes to access the data fast, performs analytical and transformation operations and even plots figures (using matplotlib backend).

As of the time of writing this article, the last release is v0.24.2.

Right, Pandas is working its way up to version 1.0 and to get there, a few things have to change on how people got used to it. There is a very interesting talk, “Towards Pandas 1.0” given by Marc Garcia, one of the core contributor of Pandas.

The next release v0.25 is set to be released in July 2019 ( v0.25rc0 has been released on the 4th July) and it is the same codebase as v1.0 but with warning messages for soon-to-be-deprecated methods. So, if you are planning to go for v1.0, you should be careful with all the deprecation warnings when you run your codebase with v0.25.

As a “one-sentence” summary, Pandas v1.0 mainly improves stability (e.g. Times Series) and removes unused code base(e.g. SparseDataFrame).

Data

Let’s do it. As a toy dataset, the chosen dataset is “Suicides rates per country between 1985 and 2016” (from Kaggle). Simple yet sufficient to get your hands dirty with Pandas.

Before diving into the code, if you want to reproduce the results, here is the small data preparation to perform to ensure that you have the right column names and types.

Tip: if you read a large file, use the argument chunksize=N in read_csv(), it will return an iterator that outputs DataFrame objects

Here is some description of the dataset.

>>> df.columns
Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
'suicides_per_100k', 'country_year', 'HDI for year', 'gdp_year',
'gdp_capita', 'generation'],
dtype='object')

There are 101 countries, years ranging from 1985 to 2016, two genders, six generations, and six age groups. To get that information, a couple of useful methods might be handy.

  • unique() and nunique() to get unique values (or the number of unique values) inside columns.
>>> df['generation'].unique()
array(['Generation X', 'Silent', 'G.I. Generation', 'Boomers',
'Millenials', 'Generation Z'], dtype=object)
>>> df['country'].nunique()
101
  • describe() to output different statistical numbers (e.g. min, max, mean, count) for every numerical column, and — if specified with include=’all' — the number of unique elements and the top element (i.e. most frequent) for every object columns.
  • head() and tail() to visualize small portions of the data frame.

With all of these methods, you can have insights pretty quickly about the tabular file you are analyzing.

Memory optimization

Before handling data, an important step is to understand the data and choosing the right type for the columns of our data frame.

Internally, Pandas stores data frames as numpy arrays, for every different type (e.g. one float64 matrix, one int32 matrix).

Here are two methods that can drastically lower your memory consumption.

Pandas comes with a method memory_usage() that analyzes the memory consumption of a data frame. In the code, deep=True is specified to make sure that the actual system usage is taken into account.

Understanding column types is important. It can save you up to 90% of the memory usage with two simple things:

  1. Knowing which types your data frame is using
  2. Knowing which types exist that could be used by your data frame to reduce memory usage (e.g. float64 might create unnecessary overhead if it is used in a price column where values range between 0 and 59 with one decimal)

Apart from lowering the size of a numerical type, you are using ( int32 instead of int64), Pandas comes with a categorical type.

If you are a R developper, you will recognize it as being the same thing as factor type.

This categorical type allows to replace repetitive values with an index and to store the actual values someplace else. The schoolbook example would be countries. Instead of storing many times the same string “Switzerland” or “Poland”, why not simply replace them by 0 and 1, and just store a dictionary.

categorical_dict = {0: 'Switzerland', 1: 'Poland'}

Pandas does pretty much the same while adding all of the methods to actually be able to use this type and still having the country names displayed.

Back to our method convert_df(), it automatically converts the column’s type tocategory if less than 50% of the values in the column are unique. This number is arbitrary but since casting types in a data frame means moving data between numpy arrays, the gain needs to be worth it.

Let’s see what happens with our data.

>>> mem_usage(df)
10.28 MB
>>> mem_usage(df.set_index(['country', 'year', 'sex', 'age']))
5.00 MB
>>> mem_usage(convert_df(df))
1.40 MB
>>> mem_usage(convert_df(df.set_index(['country', 'year', 'sex', 'age'])))
1.40 MB

By using our “smart” converter, the data frame is using almost 10x (7.34x to be precise) less memory.

Indexing

Pandas is powerful but it costs something. When you load a DataFrame, it creates indices and stores the data inside numpy arrays. So what does it mean? Once the data frame is loaded, you can access the data incredibly fast if you manage your indices properly.

There are mainly two ways to access data, namely by indexing and by querying. Depending on the situation, you will rather go for one or another. However, Index (and MultiIndex) are, in most cases, the best choice. Let’s take the following example.

>>> %%time
>>> df.query('country == "Albania" and year == 1987 and sex == "male" and age == "25-34 years"')
CPU times: user 7.27 ms, sys: 751 µs, total: 8.02 ms
# ==================
>>> %%time
>>> mi_df.loc['Albania', 1987, 'male', '25-34 years']
CPU times: user 459 µs, sys: 1 µs, total: 460 µs

What? 20x speedup?

You’re going to ask yourselves, how much time does it take to create this MultiIndex?

%%time
mi_df = df.set_index(['country', 'year', 'sex', 'age'])
CPU times: user 10.8 ms, sys: 2.2 ms, total: 13 ms

This is 1.5x time to execute the query. If you just want to retrieve data only once (which is rarely the case), query is the right method. Otherwise, stick with indices, your CPU will thank you for that.

.set_index(drop=False) allows to not drop the column(s) used as the new index

.loc[] / .iloc[] methods are performing really good when you want to read a data frame, but not to modify it. If you need to construct by hand (e.g. using loops), consider another data structure (e.g. dictionary, list) and then create your DataFrame once you have all the data ready. Otherwise, for every new row in your DataFrame, Pandas will update the index, which is not a simple hashmap.

>>> (pd.DataFrame({'a':range(2), 'b': range(2)}, index=['a', 'a'])
.loc['a']
)
a b
a 0 0
a 1 1

Due to this, an unsorted index can reduce performances. In order to check if an index is sorted and to sort it, there are mainly two methods.

%%time
>>> mi_df.sort_index()
CPU times: user 34.8 ms, sys: 1.63 ms, total: 36.5 ms
>>> mi_df.index.is_monotonic
True

To read for more details

Method chaining

Method chaining with DataFrame is an act of chaining multiple methods that return a DataFrame and therefore are methods from DataFrame class. In the current version of Pandas, the reason to use method chaining is to not store intermediate variables and to avoid the following situation:

And replace it with the following chain.

Let’s be honest, the second code snippet is way nicer and cleaner.

The toolbox of method-chainers consists of different methods (e.g. apply ; assign ; loc ; query ; pipe ; groupby ; agg) that output DataFrame or Series objects (or DataFrameGroupBy).

The best way to understand what they do is to actually use them. Let’s start with some simple examples.

Simple chain to get all unique generations labels per age range
Resulting data frame, “age” column is the index

Apart from learning that “Generation X” covers three age groups, let’s decompose the chain. The first step is to group the age groups. This method returns a DataFrameGroupBy object where each group is aggregated by selecting group’s unique generation labels.

The aggregate method is in this case a 'unique' method but it can also receive any (lambda) functions.

In the latest release (v0.25), Pandas introduced a new way to use agg.

Most suicides per country and year using sort_values and head
Most suicides per country and year using nlargest

In these examples, the output is the same: a DataFrame with a two-level MultiIndex (country and year) and a new column suicides_sum containing 10 largest values sorted.

Columns “country” and “year” are the indexes

nlargest(10) is more efficient than sort_values(ascending=False).head(10)

Another interesting method is unstack which allows pivoting an index level.

“age” is the index and columns “suicides_no” and “population” have a second level column “sex”

The next method, pipe, is one of the most versatile. It allows to pipe operations — like in shell scripts — and perform even more operations with chains.

One simple but powerful usage of pipes is to log different information.

Different log functions to use with pipe

For example, we want to validate that the column country_year is correct if compared to column year.

Pipeline to validate the year in “country_year” column

The output of this pipeline is a DataFrame but it also prints in the standard output (console / REPL).

shape = (0, 13)

You can also put different pipe in one chain.

Which year, per country, had the most suicides among females

The resulting DataFrame is the following.

The indexes are “year” and “country”

The print in the standard output is the following.

shape = (27820, 12)
shape = (2321, 1)

Apart from logging to console, pipe can also be used to apply functions directly on data frames’ columns.

Is a decrease in GDP correlated with an increase in suicide number? Is it gender related?

The code above prints in the console the following lines.

male
suicides_per_100k_sum gdp_year_mean
suicides_per_100k_sum 1.000000 0.421218
gdp_year_mean 0.421218 1.000000


female
suicides_per_100k_sum gdp_year_mean
suicides_per_100k_sum 1.000000 0.452343
gdp_year_mean 0.452343 1.000000

Let’s dive into the code. norm_df() takes as an input a DataFrame and a list of columns to scale using MinMaxScaling. Using dictionary comprehension, it creates a dictionary {column_name:method, …}, which then gets unpacked to function arguments (column_name=method, ...) for assign().

In this particular case, min-max scaling does not change the output of the correlation, it is only used for the sake of argument :)

In the (far?) future, lazy evaluation could appear for method chaining, so investing in chains is a good idea.

Last (random) tips

The following tips are incredibly useful but did not fit in any of the previous sections.

  • itertuples() is much more efficient to iterate through the rows of a data frame.
>>> %%time
>>> for row in df.iterrows(): continue
CPU times: user 1.97 s, sys: 17.3 ms, total: 1.99 s
>>> for tup in df.itertuples(): continue
CPU times: user 55.9 ms, sys: 2.85 ms, total: 58.8 ms

Note: tup is a namedtuple

  • join() uses merge().
  • In Jupyter notebooks, %%time at the beginning of a cell will measure time effectively.
  • UInt8 dtype supports NaN values with integers.
  • Remember that for now, any intensive I/O (e.g. flattening a big CSV dump) will perform much better with lower-level methods (use of Python core functions as much as possible).

There are also useful methods / data structure not covered by this article that are worth taking the time to understand.

Summary

Thanks to this short article, hopefully, you are able to better understand how Pandas works behind the scenes and what is the current development status of the library. You were also shown different tools to optimize the memory usage of data frames and to get quick insights. Hopefully, indexing and querying are now less obscure. And finally, you are now trying to write longer chain with method chaining.

As support document, here is a notebook which contains, apart from all the code present in this article, timing measures comparing performances between simple numerical index data frame (df) and multi-indexed data frame (mi_df).

Practice makes the difference so keep working on your skills and help us build a better world.

PS: sometimes a pure NumPy implementation is faster (Julien Herzen ;) )

--

--