Introduction to Pandas

Thangarajnagulraj
featurepreneur
Published in
6 min readSep 27, 2022

Pandas is an API used to analyze, organize, and structure data. It is widely accepted among the Python community and is used in many other packages, frameworks, and modules. Pandas has a wide variety of use-cases and is hugely flexible for preparing your input data for machine learning, deep learning, and neural network models. With tools like DataFrames and Series you can create collections and sequences that are multi-dimensional. This allows greater indexing and slicing abilities. Pandas is open source and a BSD-licensed library. So, let’s see how you can structure, manipulate, and organize your data with Pandas.

Table of Contents

Installing Pandas

Pandas installs through the PIP

pip install pandas

Pandas integrates with many tabular file types and is good at working with these file types by converting them into Pandas DataFrames. A DataFrames in Pandas is a specific type of data table. You can do things like manipulate data, handle timeseries data, plot graphs, reshape data structures, combine data, sort data, and filter data.

You can import Pandas into your Python code as follows:

import pandas as pd

There are multiple ways to create Pandas DataFrames. You can convert dictionaries, lists, tabular data, and Pandas Series objects into DataFrames or you can create them using the pd.DataFrame() method. The Series object is like a one column DataFrame. So, as you can imagine DataFrames are a collection of one or more Series. You can create a series object using pd.Series().

Exploring Pandas Series

Basic Pandas Series construction looks like this.

pd.Series([1, 2, 3, 4])

Usually you will want to name your series by assigning a variable name.

my_series = pd.Series([1, 2, 3, 4])

You can assign named Pandas Series to a DataFrame as follows.

first_series = pd.Series([1, 2, 3, 4])
second_series = pd.Series(['one', 'two', 'three', 'four'])

Exploring Pandas DataFrames

You can convert many file-types to Pandas DataFrames. Here are some of the more common methods used for this.

read_csv(filepath_or_buffer[, sep, …])
read_excel(*args, **kwargs)
read_json(*args, **kwargs)
read_html(*args, **kwargs)

You can also manipulate data from file-types with some of these more used methods.

ExcelFile.parse([sheet_name, header, names, …])
ExcelWriter(path[, engine])

You also have methods you can use on DataFrames to help structure and manipulate your data correctly. I recommend reading the docs to understand all of what you can do with DataFrames, but these are some of the more used methods and attributes.

# constructor
DataFrame
([data, index, columns, dtype, copy])

DataFrames have five parameters data, index, columns, dtype, and copy.

DataFrames are used as input for many machine learning, deep learning, and neural network models. It is also good for EDA (Exploratory Data Analysis). Knowing and using at least the basics in Pandas is a must for most Data Science with Python projects.

Working with Data in Pandas (Example)

I will do examples on a customer churn dataset that is available on Kaggle.

Let’s start by reading the csv file into a pandas dataframe.

import numpy as np
import pandas as pd
df = pd.read_csv("/content/churn.csv")df.shape(10000,14)df.columns
Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard','IsActiveMember','EstimatedSalary', 'Exited'], dtype='object')

1. Dropping columns

The drop function is used to drop columns and rows. We pass the labels of rows or columns to be dropped.

df.drop(['RowNumber', 'CustomerId', 'Surname', 'CreditScore'], axis=1, inplace=True) df.shape
(10000,10)

The axis parameter is set as 1 to drop columns and 0 for rows. The inplace parameter is set as True to save the changes. We dropped 4 columns so the number of columns reduced to 10 from 14.

2. Select particular columns while reading

We can read only some of the columns from the csv file. The list of columns is passed to the usecols parameter while reading. It is better than dropping later on if you know the column names beforehand.

df_spec = pd.read_csv("/content/churn.csv", usecols=['Gender', 'Age', 'Tenure', 'Balance'])df_spec.head()

3. Reading a part of the dataframe

The read_csv function allows reading a part of the dataframe in terms of the rows. There are two options. The first one is to read the first n number of rows.

df_partial = pd.read_csv("/content/churn.csv", nrows=5000)df_partial.shape
(5000,14)

Using the nrows parameters, we created a dataframe that contains the first 5000 rows of the csv file.

We can also select rows from the end of the file by using the skiprows parameter. Skiprows=5000 means that we will skip the first 5000 rows while reading the csv file.

4. Sample

After creating a dataframe, we may want to draw a small sample to work. We can either use the n parameter or frac parameter to determine the sample size.

  • n: The number of rows in the sample
  • frac: The ratio of the sample size to the whole dataframe size
df_sample = df.sample(n=1000)df_sample.shape
(1000,10)df_sample2 = df.sample(frac=0.1)
df_sample2.shape
(1000,10)

5. Checking the missing values

The isna function determines the missing values in a dataframe. By using the isna with the sum function, we can see the number of missing values in each column.

df.isna().sum()

(image by author)

There are no missing values.

6. Adding missing values using loc and iloc

I’m doing this example to practice the “loc” and “iloc”. These methods select rows and columns based on index or label.

  • loc: selects with label
  • iloc: selects with index

Let’s first create 20 random indices to select.

missing_index = np.random.randint(10000, size=20)

We will use these indices to change some values as np.nan (missing value).

df.loc[missing_index, ['Balance','Geography']] = np.nan

There are 20 missing values in the “Balance” and “Geography” columns. Let’s do another example using the indices instead of labels.

df.iloc[missing_index, -1] = np.nan

“-1” is the index of the last column which is “Exited”.

Although we’ve used different representations of columns for loc and iloc, row values have not changed. The reason is that we are using numerical index labels. Thus, both label and index for a row are the same.

The number of missing values have changed.7. Filling missing values

The fillna function is used to fill the missing values. It provides many options. We can use a specific value, an aggregate function (e.g. mean), or the previous or next value.

For the geography column, I will use the most common value.

mode = df['Geography'].value_counts().index[0]
df['Geography'].fillna(value=mode, inplace=True)

Similarly, for the balance column, I will use the mean of the column to replace missing values.

avg = df['Balance'].mean()
df['Balance'].fillna(value=avg, inplace=True)

The method parameter of the fillna function can be used to fill missing values based on the previous or next value in a column (e.g. method=’ffill’). It can be pretty useful for sequential data (e.g. time series).

8. Dropping missing values

Another way to handle missing values is to drop them. There are still missing values in the “Exited” column. The following code will drop rows that have any missing value.

df.dropna(axis=0, how='any', inplace=True)

The axis=1 is used to drop columns with missing values. We can also set a threshold value for the number of non-missing values required for a column or row to have. For instance, thresh=5 means that a row must have at least 5 non-missing values not to be dropped. The rows that have 4 or fewer missing values will be dropped.

The dataframe does not have any missing values now.

df.isna().sum().sum()
0

9. Selecting rows based on conditions

In some cases, we need the observations (i.e. rows) that fit some conditions. For instance, the below code will select customers who live in France and have churned.

france_churn = df[(df.Geography == 'France') & (df.Exited == 1)]france_churn.Geography.value_counts()
France 808

Conclusion

The Pandas library is really an amazing tool to have in Python. This article just goes over the tip of the iceberg as to what you can accomplish with the Pandas API. You can begin to see the true capabilities that Pandas has to offer when starting to work with data in Python. Learning Pandas and how it works will improve your Python experience with Data Science by allowing you to have more control over your input data. This will not only give you more flexibility and power when exploring data, but also when working directly with it to achieve your programmatic, computational, or scientific goals. I hope this helps anyone wanting to learn more about the Pandas API in Python. Happy coding!

--

--