# Fastest Way to Learn Pandas — A Practical Guide — Part 2

Let’s finish and get ready for real implementation

Pandas is an essential package for data engineers, data analysts, and data scientists. Pandas is an easy to use python package library for data manipulation and analysis.

This is the second part of the series Pandas — A Practical Guide. In Part 1, you have learned about how to read data, get the structure information, filtering, and sorting data. You can check part 1 of this series here:

I will use the same dataset as in previous post. The dataset contains 10 columns about provinces in Indonesia

# Summarizing Data

Pandas provides aggregation statistical functions, like count, sum, min, max and more. These functions can be applied to columns. For example **count()** function. This function can be used to count the number of records for each column.

But keep in mind, **count()** only counts for records that are not null. In the example, the number of records for each column is the same, which is 34 and none of them has a null value.

If we want to count the number of records for a specific column, the **count()** function can be used as below

or it could be written as follows

In the same way, we can use different statistical functions like **sum()**, **min()**, **max()**, **mean()** and so on.

In the output above, you can see **province** and **island** column have slightly different result. This happens because those columns are not numeric type

Another usage of statistical function

# Grouping

Like SQL, Pandas has **groupby()** function to summarize columns value based on unique values according to the selected column.

For example, we want to count the number of records in **time_zone** grouped by their unique values

From the displayed data above, it can be seen that **time_zone **has 3 unique values: 1, 2 and 3

Other summary functions such as sum, min, max, mean, first, last, can be used in **groupby()** to get the statistical value for each group. Suppose we want to get the first value for each **time_zone**

Calculation total amount of rainfall for each **time_zone **can be done as below

Calculation total amount of rainfall and **expenses_food_urban **for each **time_zone **can be done as below

To perform multiple statistical calculations grouped based on the unique value of a column, you can combine **groupby()** and **agg()** functions.

Calculate each time_zone with a different summary function for each column, shown as below

**NamedAgg()** function can be used to change the name of a column, making it easier to understand

We can do it in another way

# Column Transformation

Another thing that is often done is to perform column transformation. For example adding new columns from certain calculated results. We will add a new column called **expenses_urban **from the sum of **expenses_food_urban** and **expenses_other_urban**

To delete column, you can do this with **drop()** function

Changing the column order can be done in a simple way as follows

# Join With Reference Data

In many cases, we often add columns with new data using reference data. The use of reference data is usually to make data maintenance easier, so we don’t need to change the code. As an example, we will add a zone time name, from the timezone reference data.

In real case, we don’t use reference data which is generated using code like above. Reference data using files obtained from the database. The Dataframe above has 2 columns, **time_zone **which has the same type with **time_zone** column on **df** dataframe. The second column is the name of time zone.

In this example, we will transform the **time_zone **in **df** dataframe by adding a new column, zone_name. The function used is **merge** with how = ‘left’ parameter. This means that we will do a **left join** between df and timezone_df

# Before You Go

Reading data, checking, filtering, transforming and joining using pandas package in python is important and often done by data scientist, data analyst and data engineer.

Trying with more complex datasets as an exercise that can help get used to processing data using the Pandas package.

For further lessons visit

https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html

Notebook in this post available in my github.

Please connect on LinkedIn if you want to have a chat!*Enjoy learning and have fun with data !*