Getting started with Pandas!

Published in

Analytics Vidhya

5 min readMay 19, 2020

Pandas is a Python package widely used to work with structured data.

In this blog, we will discuss some of the very useful methods in Pandas for analyzing, transforming and generating basic statistics from the data. We will be using a dataset from Kaggle named Insurance_Dataset.

Let’s start with importing the Pandas library.

import pandas as pd

Reading data

Now, let’s read the dataset into a Pandas dataframe.

Pandas dataframe is a tabular form of data with labelled axes (rows & columns).

read_csv() method is used to read a Comma Separated file into Pandas. If the data is not separated by commas but by any other character(say by | ), that character should be passed as delimiter to the read_csv() function.

eg. read_csv(“insurance.csv”, delimiter = “|”)

insurance = pd.read_csv("insurance.csv")

Too much data to go through?

Check out the first 5 or the last 5 rows using head() or tail() methods respectively. You can also pass the number of rows ‘n’ you wish to see to both the functions. Default number of rows is 5.

insurance.head()

insurance.head(n = 10)

insurance.tail()

You can find data type of each column in a dataframe by using dtypes attribute.

The default data types in python are float64, int64, object, bool, category, timedelta64, datetime64.

insurance.dtypes

Use select_dtypes() to view columns only with data types you wish.

insurance.select_dtypes(include = ['float64', 'int64'])

Summary of dataframe

Let’s check the number of rows and columns present in the dataframe.

insurance.shape

So this dataframe consists of 1338 rows and 7 columns.

Pandas describe() is used to view some basic statistical details like mean, median, std, etc. of a dataframe.

insurance.describe()

Copying the dataframe

To avoid making changes to the original dataframe, it’s always better to create a copy using the copy() method and working on it.

df = insurance.copy()

Here, if we change the sex of 4th row and region of 2nd row:

df.loc[3, ‘sex’] = ‘female’
df.loc[1, ‘region’] == ‘northwest’

It’ll not affect the original dataframe.

insurance.head()

Missing values

Let’s check the number of null values in each of the columns.

Method isnull() tells us if values in the dataframe are null or missing and sum() method calculates the number of such values in a column.

print(df.isnull().sum())

We can see there are no null or missing values in our dataframe. But if there were, we could use the dropna() method to remove the null values or the fillna() method to replace them with a desired value.

Removing unwanted columns

What if there’s a column that is no use to you in the analysis?

Get rid of such column(s) using the drop() method.

For example, let’s drop the region column.

df.drop(“region”, axis = 1, inplace = True)

The inplace parameter is used to make the changes in the dataframe permanent.

Filtering and Aggregating Data

What is the count of number of male and female smokers and non-smokers?

df.groupby(['sex'])['smoker'].value_counts()

groupby() is one of the most useful pandas method.

value_counts() counts the number of entries for each value in the column it is applied on.

Now, we’ll take a look at the average insurance charges of these people categorized on the basis of sex and smoker or not.

df.groupby(['sex', 'smoker'])['charges'].mean()

Which group has maximum people with a low bmi?

Let’s filter the data to include only those people with a low bmi (consider it to be below 19).

low_bmi = df[df['bmi'] <= 19]
low_bmi.head()

Notice that the indexes are not in order. To fix this we can use reset_index() method.

low_bmi.reset_index(drop = True)
low_bmi.head()

We’ll now check which age group and and gender suffers from a low bmi.

low_bmi.groupby(['sex'])['age'].value_counts().idxmax()

source: author

idxmax() is another very useful function that helps us determine the category with maximum frequency, i.e. the maximum count.

print(df['bmi'].min())
print(df['bmi'].max())
print(df['bmi'].mean())
print(df['bmi'].std())

Saving to new file

Finally, if you want to save changes made to the dataframe to a separate file, it can be done by using the to_csv() method.

df.to_csv("new_insurance.csv")

Thanks for reading! Hope this post helped you!

LinkedIn: https://www.linkedin.com/in/akshada-gaonkar-9b8886189/