Getting started with Pandas!

Akshada Gaonkar
Analytics Vidhya
Published in
5 min readMay 19, 2020

Pandas is a Python package widely used to work with structured data.

In this blog, we will discuss some of the very useful methods in Pandas for analyzing, transforming and generating basic statistics from the data. We will be using a dataset from Kaggle named Insurance_Dataset.

Let’s start with importing the Pandas library.

import pandas as pd

Reading data

Now, let’s read the dataset into a Pandas dataframe.

Pandas dataframe is a tabular form of data with labelled axes (rows & columns).

read_csv() method is used to read a Comma Separated file into Pandas. If the data is not separated by commas but by any other character(say by | ), that character should be passed as delimiter to the read_csv() function.

eg. read_csv(“insurance.csv”, delimiter = “|”)

insurance = pd.read_csv("insurance.csv")

Too much data to go through?

Check out the first 5 or the last 5 rows using head() or tail() methods respectively. You can also pass the number of rows ‘n’ you wish to see to both the functions. Default number of rows is 5.

insurance.head()
source: author
insurance.head(n = 10)
source: author
insurance.tail()
source: author

You can find data type of each column in a dataframe by using dtypes attribute.

The default data types in python are float64, int64, object, bool, category, timedelta64, datetime64.

insurance.dtypes
source: author

Use select_dtypes() to view columns only with data types you wish.

insurance.select_dtypes(include = ['float64', 'int64'])
source: author

Summary of dataframe

Let’s check the number of rows and columns present in the dataframe.

insurance.shape
source: author

So this dataframe consists of 1338 rows and 7 columns.

Pandas describe() is used to view some basic statistical details like mean, median, std, etc. of a dataframe.

insurance.describe()
source: author

Copying the dataframe

To avoid making changes to the original dataframe, it’s always better to create a copy using the copy() method and working on it.

df = insurance.copy()

Here, if we change the sex of 4th row and region of 2nd row:

df.loc[3, ‘sex’] = ‘female’
df.loc[1, ‘region’] == ‘northwest’
source: author

It’ll not affect the original dataframe.

insurance.head()
source: author

Missing values

Let’s check the number of null values in each of the columns.

Method isnull() tells us if values in the dataframe are null or missing and sum() method calculates the number of such values in a column.

print(df.isnull().sum())
source: author

We can see there are no null or missing values in our dataframe. But if there were, we could use the dropna() method to remove the null values or the fillna() method to replace them with a desired value.

Removing unwanted columns

What if there’s a column that is no use to you in the analysis?

Get rid of such column(s) using the drop() method.

For example, let’s drop the region column.

df.drop(“region”, axis = 1, inplace = True)
source: author

The inplace parameter is used to make the changes in the dataframe permanent.

Filtering and Aggregating Data

What is the count of number of male and female smokers and non-smokers?

df.groupby(['sex'])['smoker'].value_counts()
source: author

groupby() is one of the most useful pandas method.

value_counts() counts the number of entries for each value in the column it is applied on.

Now, we’ll take a look at the average insurance charges of these people categorized on the basis of sex and smoker or not.

df.groupby(['sex', 'smoker'])['charges'].mean()
source: author

Which group has maximum people with a low bmi?

Let’s filter the data to include only those people with a low bmi (consider it to be below 19).

low_bmi = df[df['bmi'] <= 19]
low_bmi.head()
source: author

Notice that the indexes are not in order. To fix this we can use reset_index() method.

low_bmi.reset_index(drop = True)
low_bmi.head()
source: author

We’ll now check which age group and and gender suffers from a low bmi.

low_bmi.groupby(['sex'])['age'].value_counts().idxmax()
source: author

idxmax() is another very useful function that helps us determine the category with maximum frequency, i.e. the maximum count.

print(df['bmi'].min())
print(df['bmi'].max())
print(df['bmi'].mean())
print(df['bmi'].std())
source: author

Saving to new file

Finally, if you want to save changes made to the dataframe to a separate file, it can be done by using the to_csv() method.

df.to_csv("new_insurance.csv")

Thanks for reading! Hope this post helped you!

LinkedIn: https://www.linkedin.com/in/akshada-gaonkar-9b8886189/

--

--

Akshada Gaonkar
Analytics Vidhya

Intern at SAS • MTech Student at NMIMS • Data Science Enthusiast!