MLearning.ai
Published in

MLearning.ai

Juypter Notebook — Part 2

Pandas library in Python to explore data

Explore data using Pandas in Jupyter Notebook

Prerequisite: Please review Jupyter Notebook — Part 1 to setup your Jupyter Notebook environment.

Get Pandas

Pandas is one of the open-source libraries of Python that is used for data analysis and data manipulation. It can be used to read, write, explore and visualize data.

Pandas does not come with a regular Python install. Install Pandas as follows:

  • Run Command Prompt as administrator.
  • Enter “pip install pandas

Download an example dataset

Download the Iris dataset locally. It is one of the most popular datasets that is used for learning data analysis.

Download the CSV file (source: https://datahub.io/machine-learning/iris#resource-iris)

Data Exploration

Using Jupyter Notebook, explore the dataset to understand the data:

Import pandas

Code: import pandas as pd

Read the dataset

Code: iris_dataframe = pd.read_csv(“<path to downloaded csv>”)

Some common Pandas functions that can be used to explore the data

head() — the function displays the top 5 rows of the dataset

Code: iris_dataframe.head()

Output:

sample(n) — function displays n rows from the dataset but randomly

Code: iris_dataframe.sample(10)

Output:

shape() — the function returns the number of rows and columns in the dataset

Code: iris_dataframe.shape

Output:

columns() — functions displays all the columns of the dataset

Code: iris_dataframe.columns

Output:

Display specific rows

# this example prints rows 5 to 10

Code: iris_dataframe[5:11]

Output:

Display specific columns

# this example prints first 10 rows for only columns Id and Species

Code: iris_dataframe[[“Id”,”Species”]].head(10)

Output:

Select data or filter data

loc() is label-based. You have to specify the name of the row or column to select or filter data when using loc().

# In this example, filter on data where Species is Iris-setosa and PetalWidthCm >0.4

Code: iris_dataframe.loc[(iris_dataframe[“Species”] == “Iris-setosa”) & (iris_dataframe[“PetalWidthCm”]>0.4)]

Output:

# In this example, use loc() to select rows 11 to 13

Code: iris_dataframe.loc[11:13]

Output:

iloc() is index-based. You have to specify the row or column by their integer index when using iloc().

#In this example, select row with index 5

Code: iris_dataframe.iloc[5]

Output:

Calculate sum, mean, median for a specific column

Code:

col_sum = iris_dataframe[“PetalWidthCm”].sum()

col_mean = iris_dataframe[“PetalWidthCm”].mean()

col_median = iris_dataframe[“PetalWidthCm”].median()

print(“Sum:”,col_sum, “\nMean:”, col_mean, “\nMedian:”,col_median)

Output:

Get min, max for a specific column

Code:

col_min=iris_dataframe[“PetalWidthCm”].min()

col_max=iris_dataframe[“PetalWidthCm”].max()

print(“Minimum:”,col_min, “\nMaximum:”, col_max)

Output:

value_counts() — function counts the number of times particular value occurs.

Code: iris_dataframe[“Species”].value_counts()

Output:

Data Manipulation

Add columns

Code:

iris_dataframe[“new_col”]=iris_dataframe[“PetalWidthCm”]*10

iris_dataframe.head()

Output:

Rename columns

Code:

renanmedcols={

“SepalLengthCm”:”sepalLength”,

“SepalWidthCm”:”sepalWidth”,

“PetalLengthCm”:”petalLength”,

“PetalWidthCm”:”petalWidth”}

iris_dataframe.rename(columns=renanmedcols,inplace=True)

iris_dataframe.head()

Output:

Conditional formatting

Code: iris_dataframe.head(10).style.highlight_max()

Output:

Find and remove missing values

isnull() — will display True for missing data, else False

Code: iris_dataframe.isnull()

Output:

#this example will tell us the number of missing values in each column

Code: iris_dataframe.isnull().sum()

Output:

These are some of the functions you can use to explore and manipulate your data to prepare for data analysis.

--

--

--

Data Scientists must think like an artist when finding a solution when creating a piece of code. ⚪️ Artists enjoy working on interesting problems, even if there is no obvious answer ⚪️ linktr.ee/mlearning 🔵 Follow to join our 18K+ Unique DAILY Readers 🟠

Recommended from Medium

Life Expectancy and GDP

Data Science Resources for Learning

Active Learning for classification models

Random Forest Regressor.

Difference between Descriptive and Inferential Statistics

How HomeToGo connected dbt and Superset to make metadata more accessible and reduce analytical…

4 Examples of How I Used Data to Reduce Costs and Increase Profits

Data Applications for Analytics

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Divya Sikka

Divya Sikka

Student

More from Medium

Flux.jl-A simplified way to build custom ML models with ease 🤖

Real-time classification with Deephaven and SciKit-Learn

NumPy reshape can break your heart

Is Pandas really that slow?