5 amazing things about pandas!
NO! not this one. I am talking about the python library that has changed the way we work with data, “Pandas”. With the growing popularity and adaptation of python, pandas came in as an efficient and handy tool for working with data. Over the past decade, this open-source tool has developed immensely offering better code intuitive and user-friendly environment to developers. While every functionality offered by pandas is unique in itself, there are some lesser-known and implemented functions that often come in handy. You probably will not use these functions every time, but they might save your efforts and time to solve any critical task in some rare projects. Without further ado, let's jump right onto them.
1. Pivot
The ‘pivot’ function helps the user to reshape the dataframe and make the resultant dataframe focus on a particular column. Hence the target column becomes our axis and navigating data becomes much simpler. Let's try understanding this with an example, assume we have the following data
As we can see in the above data there are 5 unique students and their scores in 3 subjects are recorded above. While the data is structured navigating through it is tedious. Using the pivot function we can convert this data into a data frame that is focused on students and makes reading the data very easy.
data.pivot(index='Student_name',columns='Subject',values='Marks')
The pivot function takes 3 values
- index: The target column, the dataframe is reshaped on its basis. Every unique value in this column will become axes of the resultant dataframe.
- columns: The column that will act as a header to the new dataframe.
- values: the respective resultant data of the index and columns that will populate the data frame, in case there is no data the result will be ‘NaN’.
Hence the result of pivot action on our input data will be like this:
You can see the result is much easier to navigate and understand. The axes of resultant dataframe are the unique values from the specified ‘index’, while the unique values listed in specified ‘columns’ become the header. ‘values’ describe the relationship between index and columns. You can also add multiple values to the resultant dataframe like this
data.pivot(index='Student_name',columns='Subject',values=['Marks','Grade'])
The output dataframe will have sections for each column specified in values like this
One important thing to note is “pivot DOESNT SUPPORT AGGREGATION, hence multiple values will result in multiple indexes”.
2. Pivot table
The “pivot table” function could be understood as an extension to the “pivot” function, but it does more than just reshaping the dataframe. Using pivot_table we can summarize the data in a more meaningful approach. Unlike the pivot function which is a method of class dataframe, the pivot table is a method of pandas itself. The pivot table is used to perform tasks similar to pivot, with an additional benefit of aggregating the data. Let's understand using the same above example
pd.pivot_table(data,values='Marks',index='Subject',columns='Grade',
aggfunc='sum')
The parameters values, index, and columns are the same as the pivot function, ‘data’ is the object that holds the dataframe and ‘aggfunc’ is used to specify the data aggregation method to be implemented. We can use basic aggregation methods used are ‘sum’, ‘mean’, ‘median’, etc. The output dataframe will look something like this
The dataframe adds all the marks of students falling in the respective grade category for each subject. Hence we can generate further insights into the data by using pivot_table and aggregation function. Unlike pivot function that requires all 3 parameters( index, columns, and values), pivot_table can be used without values and columns
pd.pivot_table(data,values='Marks',index='Subject',aggfunc='sum')
(or)
pd.pivot_table(data,index='Subject',aggfunc='sum')
The output in both cases will be the same, like this
The dataframe sums all the marks of all subject
Similarly, unlike pivot, we can use pivot_table with multiple indexes. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
pd.pivot_table(data, values='Marks',
index=['Subject','Student_name'], aggfunc='sum')
The multi-index output will be
Apart from these, pivot_table also offers features like ‘dropna’ and ‘fill_value’ that are worth experimenting with.
3. Arithmetic operation on dataframe
As the self-explanatory heading suggests, often we are required to perform various arithmetic operations of the entire dataframe or some specified columns. Instead of looping and iterating over each element, implementing arithmetic operations directly to the target column makes the task easy and saves a lot of time. Let's consider the same student-mark example, what if you were to add another column ‘percentage’ to the dataframe that consist percentage obtained in that specific subject, it could be done as simply as
data['Percentage'] = data['Marks']/100
This will add a new column percentage to our dataframe, and the result will look like:
While such arithmetic operation can be done directly using common mathematical symbols, pandas dataframe have following inbuilt method to do the same
- dataframe.add()
- dataframe.sub()
- dataframe.mul()
- dataframe.div()
So what’s the difference between mathematical operators and built-in methods? The built-in methods help to substitute a “fill_value” for missing data.
Both methods enable us to perform arithmetic operations between two dataframes, dataframe and series, dataframe and sequence, and dataframe and scaler. Pandas also offer more such arithmetic and comparative operators.
4. Apply
“apply” is a dataframe method that allows users to implement any pre-defined or user-defined functions on any row or column. This provides the same functionality as offered in Microsoft Excel, where users can define a formula that takes available data as input and generates results to populate the specified column. This functionality is remarkably a time saver. Let's understand this using the same example. We will generate a new column called ‘Notes’ by performing conditional formating using students’ grades. For this purpose, we will define our own function
def notes(x):
if str(x) == 'C':
return 'Poor result'
elif str(x) == 'B':
return 'Good'
elif str(x) == 'A':
return 'Excellent'data['Notes'] = data['Grade'].apply(notes)
The resultant dataframe will have an additional column ‘Notes’ like this
We can also integrate “apply” with “lambda”. If you haven't heard about (or forgotten) what ‘lambda’ is, lambda is a keyword used to define anonymous function ie; function without any name, in python. Lambda is a single line function that can take any number of arguments but it can have only one expression. It behaves exactly like the function we define using keyword def. Lambda is generally used in situations where the function is small and probably not going to be used again in code. We can perform the same task as above using lambda in apply
data['Notes'] = data['Grade'].apply(lambda x:
'Poor result' if x == 'C' else ('Good' if x == 'B' else'Excellent'))
Note: the only way to implement “elif” in case if lambda is by using nested if-else condition
5. Is Unique
Now, this is something, maybe, many of you are familiar with. “is_unique” method takes columns name as input and returns True if all the values in the column are unique, or else False. It’s a really simple yet useful function, often while working on ETL processes or data modeling you are required to identify columns that are unique. Depending on the task on hand such columns can be either dropped or counted in. The is_unique function works with panda’s series datatype.
test = pd.Series(data=[1,2,3,4,5,6,7])
test.is_unique#output
#True
We can also take any specific column from the dataframe and test if it’s unique.
data['Student_name'].is_unique#output
#False
So these were some simple, useful, and time-saving functionality in pandas that will make help you optimize your code. Such an efficient and code intuitive eco-system helps users to focus more on implementing rather than coding.
References