Fastest Way to Learn Pandas — A Practical Guide — Part 2
Let’s finish and get ready for real implementation
Pandas is an essential package for data engineers, data analysts, and data scientists. Pandas is an easy to use python package library for data manipulation and analysis.
This is the second part of the series Pandas — A Practical Guide. In Part 1, you have learned about how to read data, get the structure information, filtering, and sorting data. You can check part 1 of this series here:
I will use the same dataset as in previous post. The dataset contains 10 columns about provinces in Indonesia
Summarizing Data
Pandas provides aggregation statistical functions, like count, sum, min, max and more. These functions can be applied to columns. For example count() function. This function can be used to count the number of records for each column.
But keep in mind, count() only counts for records that are not null. In the example, the number of records for each column is the same, which is 34 and none of them has a null value.
If we want to count the number of records for a specific column, the count() function can be used as below
or it could be written as follows
In the same way, we can use different statistical functions like sum(), min(), max(), mean() and so on.
In the output above, you can see province and island column have slightly different result. This happens because those columns are not numeric type
Another usage of statistical function
Grouping
Like SQL, Pandas has groupby() function to summarize columns value based on unique values according to the selected column.
For example, we want to count the number of records in time_zone grouped by their unique values
From the displayed data above, it can be seen that time_zone has 3 unique values: 1, 2 and 3
Other summary functions such as sum, min, max, mean, first, last, can be used in groupby() to get the statistical value for each group. Suppose we want to get the first value for each time_zone
Calculation total amount of rainfall for each time_zone can be done as below
Calculation total amount of rainfall and expenses_food_urban for each time_zone can be done as below
To perform multiple statistical calculations grouped based on the unique value of a column, you can combine groupby() and agg() functions.
Calculate each time_zone with a different summary function for each column, shown as below
NamedAgg() function can be used to change the name of a column, making it easier to understand
We can do it in another way
Column Transformation
Another thing that is often done is to perform column transformation. For example adding new columns from certain calculated results. We will add a new column called expenses_urban from the sum of expenses_food_urban and expenses_other_urban
To delete column, you can do this with drop() function
Changing the column order can be done in a simple way as follows
Join With Reference Data
In many cases, we often add columns with new data using reference data. The use of reference data is usually to make data maintenance easier, so we don’t need to change the code. As an example, we will add a zone time name, from the timezone reference data.
In real case, we don’t use reference data which is generated using code like above. Reference data using files obtained from the database. The Dataframe above has 2 columns, time_zone which has the same type with time_zone column on df dataframe. The second column is the name of time zone.
In this example, we will transform the time_zone in df dataframe by adding a new column, zone_name. The function used is merge with how = ‘left’ parameter. This means that we will do a left join between df and timezone_df
Before You Go
Reading data, checking, filtering, transforming and joining using pandas package in python is important and often done by data scientist, data analyst and data engineer.
Trying with more complex datasets as an exercise that can help get used to processing data using the Pandas package.
For further lessons visit
https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html
Notebook in this post available in my github.
Please connect on LinkedIn if you want to have a chat!
Enjoy learning and have fun with data !