Powering Up Your Pandas Part I — Understanding The Essentials

Published in

Data Folks Indonesia

5 min readJan 11, 2022

Pandas are an essential package built for the data scientist to play with his data. Usually, a data scientist will use pandas with NumPy and Matplotlib. Before, we typically use rows and columns when organizing data. When it comes to big data, this is known as rectangular/tabular data, or we usually call it Dataframe.

Dataframe is a two-dimensional data structure that consists of rows and columns in the form of tabular shape. The most common example for data scientist beginners is ‘Iris Flower Dataset’ You could see the data from Kaggle.

As you can see from the picture above, every species has four characteristics. Well, that is only 150 data with five columns. Maybe we can still understand easily. But, do you ever try organizing data that consist of more than ten thousand? Doing it manually can waste your time and energy; Because there are packages named Pandas that could help a lot.

In this story, I want to share anything about pandas. But I will separate it into parts because we couldn’t build Eiffel Tower in just one night, right? To become master at anything you want, you should take step by step. 😊

In this part 1, let’s understand the essential function used in pandas. You can code along with reading the article by accessing this notebook.

Basic functions

Importing the pandas

To use pandas package, usually, all data scientists called it as ‘pd’ like this.

import pandas as pd

Reading the dataset

After importing the packages, you can read the data you need using the read command. Pandas support reading some types of files, such as:

The most common format files used are CSV (Comma Separated Values) and the data saved in ‘df’ variables.

df = pd.read_csv('IRIS.csv')

Display all data

After reading data using pandas, the data automatically becomes tabular, so you can display all columns by calling the variable you use.

df

Head()

Head is a simple method to display the first five columns from the data frame.

df.head() # Displaying first 5 columns

The five is just the default number for the methods, and you can change it with the number you want.

df.head(10) # Displaying first 10 columns

Tail()

Tail is a simple method to display the last five columns from the data frame.

df.tail() # Displaying last 5 columns

Just the same with head, you can adjust the number of columns you want to display by changing the methods’ arguments.

df.tail(10) # Displaying last 10 columns

Shape

Just I said before; a data frame consists of rows and columns. You can quickly get the number of rows and columns by using this attribute.

df.shape

Columns

Sometimes, a data frame could have more than ten columns and our monitor couldn’t display all well. To grab all columns names, you can read it using this attribute.

df.columns

Isnull() / isna()

Data is not always perfect. There is a time when the information is missing, and it negatively affects your machine learning model. To display all missing data, you can use isnull() or isna() methods and combine it with sum() methods.

df.isnull().sum() or df.isna().sum()

Duplicates()

Again, data is not always perfect, and sometimes there are missing values. In another case, the data can be duplicated. Both of them has a bad effect on the model. Just like finding the missing values, by combining this method with the sum() method, we can understand the duplicated values from the data.

df.duplicated().sum()

Dtypes

To understand the data types in data frames, you can use these attributes. There are several types that usually used in a data frames, such as:

object: This types consist of mixed types of data that stored in a columns, usually string will refer as object (“Cold”, “Old”, “New”, etc).
float64: This types consist of floating values (0.5, 4.1, 2.2, etc).
int64: This type consist of integer values (1, 100, 5000, etc).
datetime64[ns]: This type consist of datetime format (‘2021–03–10", “2021–04–15", “2021–05–20”, etc).

df.dtypes

Values

Dataframe is a two-dimensional array; to access the value from a data, you can use values attributes to print it.

df.values

Unique()

A column sometimes has many duplicate values; if you want to seek the unique value from columns, you can access it using the unique() method.

df['column_names'].unique() or df.column_names.unique()

With a unique() method, it is easier to determine the same things in one column.

Info()

If you are an efficient person that wants simplicity, you can run this method. By running it, you can see the shape of your data frame (rows and columns), the missing values, the names of the columns, the types of data, and the memory usage from the data.

df.info()

Describe()

Pandas support statistical summary for every numeric column, such as mean, standard deviation, min value, max value, percentile, etc. Simply call the describe() methods to get used to it.

df.describe()

Conclusion

We have already finished the first part of powering up your pandas. I hope you can use the functions wisely and save your time and energy from this. If you want to read with the notebook, access my notebooks here.

Next Work

In the following work, I want to share something called sorting and subsetting the data frames. Nice to meet you there.

Source

DataCamp