5 amazing things about pandas!

Published in

Analytics Vidhya

6 min readJul 18, 2020

NO! not this one. I am talking about the python library that has changed the way we work with data, “Pandas”. With the growing popularity and adaptation of python, pandas came in as an efficient and handy tool for working with data. Over the past decade, this open-source tool has developed immensely offering better code intuitive and user-friendly environment to developers. While every functionality offered by pandas is unique in itself, there are some lesser-known and implemented functions that often come in handy. You probably will not use these functions every time, but they might save your efforts and time to solve any critical task in some rare projects. Without further ado, let's jump right onto them.

1. Pivot

The ‘pivot’ function helps the user to reshape the dataframe and make the resultant dataframe focus on a particular column. Hence the target column becomes our axis and navigating data becomes much simpler. Let's try understanding this with an example, assume we have the following data

As we can see in the above data there are 5 unique students and their scores in 3 subjects are recorded above. While the data is structured navigating through it is tedious. Using the pivot function we can convert this data into a data frame that is focused on students and makes reading the data very easy.

data.pivot(index='Student_name',columns='Subject',values='Marks')

The pivot function takes 3 values

index: The target column, the dataframe is reshaped on its basis. Every unique value in this column will become axes of the resultant dataframe.
columns: The column that will act as a header to the new dataframe.
values: the respective resultant data of the index and columns that will populate the data frame, in case there is no data the result will be ‘NaN’.

Hence the result of pivot action on our input data will be like this:

You can see the result is much easier to navigate and understand. The axes of resultant dataframe are the unique values from the specified ‘index’, while the unique values listed in specified ‘columns’ become the header. ‘values’ describe the relationship between index and columns. You can also add multiple values to the resultant dataframe like this

data.pivot(index='Student_name',columns='Subject',values=['Marks','Grade'])

The output dataframe will have sections for each column specified in values like this

One important thing to note is “pivot DOESNT SUPPORT AGGREGATION, hence multiple values will result in multiple indexes”.

2. Pivot table

The “pivot table” function could be understood as an extension to the “pivot” function, but it does more than just reshaping the dataframe. Using pivot_table we can summarize the data in a more meaningful approach. Unlike the pivot function which is a method of class dataframe, the pivot table is a method of pandas itself. The pivot table is used to perform tasks similar to pivot, with an additional benefit of aggregating the data. Let's understand using the same above example

pd.pivot_table(data,values='Marks',index='Subject',columns='Grade',
aggfunc='sum')

The parameters values, index, and columns are the same as the pivot function, ‘data’ is the object that holds the dataframe and ‘aggfunc’ is used to specify the data aggregation method to be implemented. We can use basic aggregation methods used are ‘sum’, ‘mean’, ‘median’, etc. The output dataframe will look something like this

The dataframe adds all the marks of students falling in the respective grade category for each subject. Hence we can generate further insights into the data by using pivot_table and aggregation function. Unlike pivot function that requires all 3 parameters( index, columns, and values), pivot_table can be used without values and columns

pd.pivot_table(data,values='Marks',index='Subject',aggfunc='sum')
(or)
pd.pivot_table(data,index='Subject',aggfunc='sum')

The output in both cases will be the same, like this

The dataframe sums all the marks of all subject

Similarly, unlike pivot, we can use pivot_table with multiple indexes. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

pd.pivot_table(data, values='Marks',
index=['Subject','Student_name'], aggfunc='sum')

The multi-index output will be

Apart from these, pivot_table also offers features like ‘dropna’ and ‘fill_value’ that are worth experimenting with.

3. Arithmetic operation on dataframe

As the self-explanatory heading suggests, often we are required to perform various arithmetic operations of the entire dataframe or some specified columns. Instead of looping and iterating over each element, implementing arithmetic operations directly to the target column makes the task easy and saves a lot of time. Let's consider the same student-mark example, what if you were to add another column ‘percentage’ to the dataframe that consist percentage obtained in that specific subject, it could be done as simply as

data['Percentage'] = data['Marks']/100

This will add a new column percentage to our dataframe, and the result will look like:

While such arithmetic operation can be done directly using common mathematical symbols, pandas dataframe have following inbuilt method to do the same

dataframe.add()
dataframe.sub()
dataframe.mul()
dataframe.div()

So what’s the difference between mathematical operators and built-in methods? The built-in methods help to substitute a “fill_value” for missing data.

Both methods enable us to perform arithmetic operations between two dataframes, dataframe and series, dataframe and sequence, and dataframe and scaler. Pandas also offer more such arithmetic and comparative operators.

4. Apply

“apply” is a dataframe method that allows users to implement any pre-defined or user-defined functions on any row or column. This provides the same functionality as offered in Microsoft Excel, where users can define a formula that takes available data as input and generates results to populate the specified column. This functionality is remarkably a time saver. Let's understand this using the same example. We will generate a new column called ‘Notes’ by performing conditional formating using students’ grades. For this purpose, we will define our own function

def notes(x):
   if str(x) == 'C':
       return 'Poor result'
   elif str(x) == 'B':
       return 'Good'
   elif str(x) == 'A':
       return 'Excellent'data['Notes'] = data['Grade'].apply(notes)

The resultant dataframe will have an additional column ‘Notes’ like this

We can also integrate “apply” with “lambda”. If you haven't heard about (or forgotten) what ‘lambda’ is, lambda is a keyword used to define anonymous function ie; function without any name, in python. Lambda is a single line function that can take any number of arguments but it can have only one expression. It behaves exactly like the function we define using keyword def. Lambda is generally used in situations where the function is small and probably not going to be used again in code. We can perform the same task as above using lambda in apply

data['Notes'] = data['Grade'].apply(lambda x: 
'Poor result' if x == 'C' else ('Good' if x == 'B' else'Excellent'))

Note: the only way to implement “elif” in case if lambda is by using nested if-else condition

5. Is Unique

Now, this is something, maybe, many of you are familiar with. “is_unique” method takes columns name as input and returns True if all the values in the column are unique, or else False. It’s a really simple yet useful function, often while working on ETL processes or data modeling you are required to identify columns that are unique. Depending on the task on hand such columns can be either dropped or counted in. The is_unique function works with panda’s series datatype.

test = pd.Series(data=[1,2,3,4,5,6,7])
test.is_unique#output
#True

We can also take any specific column from the dataframe and test if it’s unique.

data['Student_name'].is_unique#output
#False

So these were some simple, useful, and time-saving functionality in pandas that will make help you optimize your code. Such an efficient and code intuitive eco-system helps users to focus more on implementing rather than coding.

References

https://pandas.pydata.org/

5 amazing things about pandas!

1. Pivot

2. Pivot table

3. Arithmetic operation on dataframe

4. Apply

5. Is Unique

Published in Analytics Vidhya

Written by om patel

No responses yet