Top 3 Pandas Tricks I Wish I Knew Before

Important tips and tricks on some useful functions in Pandas

Published in

Nerd For Tech

5 min readMay 23, 2021

Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language.

A lot of data analysts and scientists have been using Python and generally use a lot of Pandas during their cleaning and pre-processing step.

Hence, I thought of writing this article because I myself struggled a lot finding out solutions to these problems and wish I knew how to solve them before.

1. Percentage within each group

One of the most frustrating things I struggled with was how to find the percentage of each values within a group. After going through answers on Stack Overflow and going through a lot of articles, I found the solution using apply and lambda functions.

Consider the DataFrame below :

names = ['Hemant', 'John', 'ABC', 'XYZ', 'Hemant', 'John', 'Hemant', 'Hemant', 'John', 'ABC']cost = [9, 10, 1, 5, 2, 6, 4, 1, 12, 7]df = pd.DataFrame({'Name': names, 'Order_Cost': cost})

Now, given the DataFrame, I would like to find out the percentage of each row in a group i.e. percentage of each item out of the totals. This can be done using the following apply() and lambda commands.

perc_order_cost = df.groupby('Name')['Order_Cost'].apply(lambda x : x/x.sum()*100)df['perc_order_cost'] = perc_order_cost

As you can see, we first group by each Name i.e. 4 groups namely Hemant, John, ABC, and XYZ. Then, instead of aggregating the data, we apply the lambda function x/x.sum() on the Order Cost column. For example, for the group Hemant, the total order cost is 9 + 2 + 4 + 1 = 16. And the percentage is calculated as 9/16 = 56%, 2/16 = 12%, 4/16 = 25% and 1/16 = 6% respectively. For each row (x) it will divide it by the sum of the group.

2. Convert a column of dictionaries into separate columns

A lot of times, especially when working with JSON, the data you get might have dictionaries stored in the columns. For example, the Pollutants column has a dictionary with 3 keys i.e. a, b, and c.

We can convert this dictionary column to multiple columns with the keys as the headers using two methods i.e. pd.json_normalize() and .apply(pd.Series) . Let’s see this in action below :

a) Using pd.json_normalize()

pd.json_normalize(df['Pollutants'])

As you can see, after json_normalize(), wherever the key doesn’t exist, the value will be NaN, and everywhere else the values will be filled accordingly.

Since it returns a DataFrame back, we can concatenate it back to the original data frame.

pd.concat([df, pd.json_normalize(df['Pollutants'])], axis = 1)

b) Using .apply(pd.Series)

df['Pollutants'].apply(pd.Series)

Similarly, the same results can be obtained if we apply the pd.Series function to the column containing the dictionary values. This will give the same result as above.

Another way to add this to the original DataFrame is to assign it in the traditional way i.e. select the number of columns equal to the number of keys.

df[['a', 'b', 'c']] = df['Pollutants'].apply(pd.Series)

3. Grouping and resampling time series data

Resampling is very similar to grouping in the sense that you get groups after resampling data. But resample is used for the DateTime data type.

Whereas, when grouping the DataFrame, you group by some column which can be any data type.

It can be an index or any column in the DataFrame but resample() is always used for DateTime objects. Also, most of the time when working with Time Series data, you will set the index to be the DateTime column as it helps in making analysis easier.

The basics of resampling are that you need to pass a frequency like ‘D’ for Days, ‘M’ for Months, etc. This information can be found in the documentation as well.

Let's look at an example of how resampling works. Consider the following data :

dates = pd.date_range(start = '2021-05-01', end = '2021-05-31')values = np.random.randint(100, size = 31)df = pd.DataFrame({'date': dates, 'Value': values})df.set_index('date', inplace=True)

This gives us a DataFrame having 31 days and some random values for the Values column. Below is a snapshot of the data :

a) Upsampling the Data

Upsampling is the process of increasing the frequency of the DateTime object i.e. you convert higher units like Years to a lower unit like Days. So for example if you have periodic data every year and if you want the data at the month level then you can pass a frequency of months and pandas will insert values for each month between the yearly intervals.

Let's say we want to add some hourly data to all days, we could do this by resampling hourly and wherever the value doesn’t exist, we will get NaN values.

hourly = df.resample('H').ffill()

Here, since each day starts at 00:00, we get hourly intervals inserted and we use ffill() to fill the NANs in the dataset. Usually, you would end up interpolating of filling the NANs when you upsample the data.

b) Downsampling the Data

Downsampling is the reverse of upsampling where you apply some aggregate function to reduce the frequency i.e., go from days to months and group the data together by applying some aggregate function.

Let's say we want to downsample the daily data to weekly data and get the average value weekly.