Top 3 Pandas Tricks I Wish I Knew Before

Important tips and tricks on some useful functions in Pandas

Hemant Rattey
Nerd For Tech
5 min readMay 23, 2021

--

Photo by Christopher Gower on Unsplash

Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language.

A lot of data analysts and scientists have been using Python and generally use a lot of Pandas during their cleaning and pre-processing step.

Hence, I thought of writing this article because I myself struggled a lot finding out solutions to these problems and wish I knew how to solve them before.

1. Percentage within each group

One of the most frustrating things I struggled with was how to find the percentage of each values within a group. After going through answers on Stack Overflow and going through a lot of articles, I found the solution using apply and lambda functions.

Consider the DataFrame below :

Source : Image by Author

Now, given the DataFrame, I would like to find out the percentage of each row in a group i.e. percentage of each item out of the totals. This can be done using the following apply() and lambda commands.

Source: Image by Author

As you can see, we first group by each Name i.e. 4 groups namely Hemant, John, ABC, and XYZ. Then, instead of aggregating the data, we apply the lambda function x/x.sum() on the Order Cost column. For example, for the group Hemant, the total order cost is 9 + 2 + 4 + 1 = 16. And the percentage is calculated as 9/16 = 56%, 2/16 = 12%, 4/16 = 25% and 1/16 = 6% respectively. For each row (x) it will divide it by the sum of the group.

2. Convert a column of dictionaries into separate columns

A lot of times, especially when working with JSON, the data you get might have dictionaries stored in the columns. For example, the Pollutants column has a dictionary with 3 keys i.e. a, b, and c.

Source: Image by Author

We can convert this dictionary column to multiple columns with the keys as the headers using two methods i.e. pd.json_normalize() and .apply(pd.Series) . Let’s see this in action below :

a) Using pd.json_normalize()

Source: Image by Author

As you can see, after json_normalize(), wherever the key doesn’t exist, the value will be NaN, and everywhere else the values will be filled accordingly.

Since it returns a DataFrame back, we can concatenate it back to the original data frame.

Source: Image by Author

b) Using .apply(pd.Series)

Similarly, the same results can be obtained if we apply the pd.Series function to the column containing the dictionary values. This will give the same result as above.

Another way to add this to the original DataFrame is to assign it in the traditional way i.e. select the number of columns equal to the number of keys.

3. Grouping and resampling time series data

Resampling is very similar to grouping in the sense that you get groups after resampling data. But resample is used for the DateTime data type.

Whereas, when grouping the DataFrame, you group by some column which can be any data type.

It can be an index or any column in the DataFrame but resample() is always used for DateTime objects. Also, most of the time when working with Time Series data, you will set the index to be the DateTime column as it helps in making analysis easier.

The basics of resampling are that you need to pass a frequency like ‘D’ for Days, ‘M’ for Months, etc. This information can be found in the documentation as well.

Let's look at an example of how resampling works. Consider the following data :

This gives us a DataFrame having 31 days and some random values for the Values column. Below is a snapshot of the data :

Source: Image by Author

a) Upsampling the Data

Upsampling is the process of increasing the frequency of the DateTime object i.e. you convert higher units like Years to a lower unit like Days. So for example if you have periodic data every year and if you want the data at the month level then you can pass a frequency of months and pandas will insert values for each month between the yearly intervals.

Let's say we want to add some hourly data to all days, we could do this by resampling hourly and wherever the value doesn’t exist, we will get NaN values.

Source: Image by Author

Here, since each day starts at 00:00, we get hourly intervals inserted and we use ffill() to fill the NANs in the dataset. Usually, you would end up interpolating of filling the NANs when you upsample the data.

b) Downsampling the Data

Downsampling is the reverse of upsampling where you apply some aggregate function to reduce the frequency i.e., go from days to months and group the data together by applying some aggregate function.

Let's say we want to downsample the daily data to weekly data and get the average value weekly.

Source: Image by Author

Here, we have grouped the index weekly and applied mean() as the aggregate function.

These are some of the most frequent tips which I wish I knew before. I hope this helps you out when you get stuck.

Please feel free to give me feedback regarding the same.

Come say Hi! to me on Twitter.

--

--

Hemant Rattey
Nerd For Tech

Data Scientist | Writing about Deep Learning and NLP | Portfolio: hemantrattey.github.io