Powering Up Your Pandas Part I — Understanding The Essentials
Pandas are an essential package built for the data scientist to play with his data. Usually, a data scientist will use pandas with NumPy and Matplotlib. Before, we typically use rows and columns when organizing data. When it comes to big data, this is known as rectangular/tabular data, or we usually call it Dataframe.
Dataframe is a two-dimensional data structure that consists of rows and columns in the form of tabular shape. The most common example for data scientist beginners is ‘Iris Flower Dataset’ You could see the data from Kaggle.
As you can see from the picture above, every species has four characteristics. Well, that is only 150 data with five columns. Maybe we can still understand easily. But, do you ever try organizing data that consist of more than ten thousand? Doing it manually can waste your time and energy; Because there are packages named Pandas that could help a lot.
In this story, I want to share anything about pandas. But I will separate it into parts because we couldn’t build Eiffel Tower in just one night, right? To become master at anything you want, you should take step by step. 😊
In this part 1, let’s understand the essential function used in pandas. You can code along with reading the article by accessing this notebook.
Importing the pandas
To use pandas package, usually, all data scientists called it as ‘pd’ like this.
import pandas as pd
Reading the dataset
After importing the packages, you can read the data you need using the read command. Pandas support reading some types of files, such as:
The most common format files used are CSV (Comma Separated Values) and the data saved in ‘df’ variables.
df = pd.read_csv('IRIS.csv')
Display all data
After reading data using pandas, the data automatically becomes tabular, so you can display all columns by calling the variable you use.
Head is a simple method to display the first five columns from the data frame.
df.head() # Displaying first 5 columns
The five is just the default number for the methods, and you can change it with the number you want.
df.head(10) # Displaying first 10 columns
Tail is a simple method to display the last five columns from the data frame.
df.tail() # Displaying last 5 columns
Just the same with head, you can adjust the number of columns you want to display by changing the methods’ arguments.
df.tail(10) # Displaying last 10 columns
Just I said before; a data frame consists of rows and columns. You can quickly get the number of rows and columns by using this attribute.
Sometimes, a data frame could have more than ten columns and our monitor couldn’t display all well. To grab all columns names, you can read it using this attribute.
Isnull() / isna()
Data is not always perfect. There is a time when the information is missing, and it negatively affects your machine learning model. To display all missing data, you can use
isna() methods and combine it with
df.isnull().sum() or df.isna().sum()
Again, data is not always perfect, and sometimes there are missing values. In another case, the data can be duplicated. Both of them has a bad effect on the model. Just like finding the missing values, by combining this method with the
sum() method, we can understand the duplicated values from the data.
To understand the data types in data frames, you can use these attributes. There are several types that usually used in a data frames, such as:
- object: This types consist of mixed types of data that stored in a columns, usually string will refer as object (“Cold”, “Old”, “New”, etc).
- float64: This types consist of floating values (0.5, 4.1, 2.2, etc).
- int64: This type consist of integer values (1, 100, 5000, etc).
- datetime64[ns]: This type consist of datetime format (‘2021–03–10", “2021–04–15", “2021–05–20”, etc).
Dataframe is a two-dimensional array; to access the value from a data, you can use values attributes to print it.
A column sometimes has many duplicate values; if you want to seek the unique value from columns, you can access it using the
df['column_names'].unique() or df.column_names.unique()
unique() method, it is easier to determine the same things in one column.
If you are an efficient person that wants simplicity, you can run this method. By running it, you can see the shape of your data frame (rows and columns), the missing values, the names of the columns, the types of data, and the memory usage from the data.
Pandas support statistical summary for every numeric column, such as mean, standard deviation, min value, max value, percentile, etc. Simply call the
describe() methods to get used to it.
We have already finished the first part of powering up your pandas. I hope you can use the functions wisely and save your time and energy from this. If you want to read with the notebook, access my notebooks here.
In the following work, I want to share something called sorting and subsetting the data frames. Nice to meet you there.