Pandas is a Friend. But Are You Doing Justice to it?

Gatha Varma, PhD
WiCDS
Published in
4 min readFeb 21, 2021

The people in your life should lessen your stress and not be the cause of it.

Pandas, the Python package that welcomed you into the world of data science. The friend that eased you into data-crunching and story-telling. Sure, it is easy to use and helps with getting your way around. But are you sure that you are using it the right way?

The function apply() is used to construct new columns or update existing ones based on other columns present in the dataframe. It is very easy to declare a lambda function and apply it to the dataframe. One single call statement can work on thousands of rows. Remember your dataframe can have thousands of rows. So while the call statement might look small, the inner workings will iterate over each row.

The next time you start to write an apply() call for that snazzy lambda function, how about you check if it can be done using built-in functionalities? Is the column data numeric? The operation be done using vectorization. The concept of vectorization treats columns as arrays (they are series anyways!) and Pandas offers vectorized functions for numbers, strings or even aggregation operation. You may also achieve more elegant and faster code using list comprehension.

You might be wondering if all the functions that are shipped under the hood of Pandas are efficient. The answer depends on the availability of alternatives as well, and you my friend, will have to play around to decide what will suit your requirement. For instance, Pandas offers explode() function to literally explode lists/series/tuples/ndarray column of each row into subset columns. It can handle a mix of list and scalar values as well as empty lists, but it can only be called for a single column at a time. Again a generic vectorized function such as this one can be used to implement a more elegant solution for multiple columns.

Yes, Pandas dataframes can be iterated using loops. But are they a good idea? No. Real-world data is multi-dimensional and has a large number of observations, or in simple words, lots of columns and rows. Using an iterator is not a good idea, especially if the same output can be achieved using built-in calls.

Python is a dynamic language. It can detect the type of a variable based on the initialization value passed during the declaration. Neat. While declaring a Pandas dataframe, the use of type dictionary is optional but a good practice to follow. While Pandas expects hybrid data types to be present, a type dictionary would ensure faster loading of the content. Moreover, it would help in better management of memory space since allocation strategy will be driven by the list of types passed during the dataframe declaration.

Now you may ask how the use of dtype can promote better memory management? The answer stems from the fact that Pandas treats numeric values as Numpy ndarrays and allocates to them continuous blocks of memory. If we talk about the basic types in Python, then they are made up of different subtypes. For instance, there are three subtypes namely float16, float32, and float64 that lie under the umbrella of type float. These subtypes need 2, 4 and 8 bytes of storage respectively. If you are already aware that value of a certain attribute can be stored well within the space of a float16 type, then a declaration of the same in the dtype dictionary would save space during allocation.

Pretty much an intuitive advantage. Apart from plot(), you have other calls like area(), bar(), barh(), box(), hexbin(), hist(), kde(), density(), line(), pie() and scatter() that save on extra code writing. Why not go for it!

The beauty of open-source is the whole ecosystem that developers build around a technology. Pandas in itself has been an exciting development and many other packages have been added to help you leverage it to the best.

Hope you enjoyed viewing the images as much as I enjoyed making them. Looking forward to your comments and learning more in the process.

--

--

WiCDS
WiCDS

Published in WiCDS

A collaborative community for Women in Data Science and Programming to learn and grow

Gatha Varma, PhD
Gatha Varma, PhD

Written by Gatha Varma, PhD

Reseach Scientist @Censius Inc. Find more of my ramblings at: gathavarma.com