How do I Analyze the data before building a machine-learning model?

Gowtham S R
8 min readSep 12, 2022

--

photo from Unsplash by UX Indonesia

Table of Contents:

· Basic questions to be asked when we read the data
What is the shape of the data?
How does the data look like?
What is the data type of each column?
Are there any missing values in the data?
How does the data look mathematically?
Are there duplicate values in the data?
How is the correlation between columns?
· Types of Data
Numerical Data
Categorical Data
· Exploratory Data Analysis
· Univariate Analysis
· Bivariate and Multivariate Analysis

To build a machine learning model that performs well in the production environment, we need to understand the data. We cannot build a good model without knowing the data.

So the first step in building a machine learning model is to understand the data and get the insights and underlying patterns in the dataset. In this blog, I will share some of my ideas to understand the insights of the data.

Basic questions to be asked when we read the data

There are 7 basic questions that we should ask when we read the data.

Let us analyze the Titanic Survival data, The 7 basic questions to ask when we read the dataset are given below.

1. What is the shape of the data?

df.shape
(891, 12)

Using the above code we find that there are actually 891 rows and 12 columns in the dataset.

2. How does the data look like?

df.head()

With the head() method, we can get to know what the data look like.

df.sample(5)

Sometimes, the dataset will be biased(having similar data in the contiguous rows). In order to avoid this, we can get random samples of the data.

3. What is the data type of each column?

df.info()

The next thing we need to know is what are the data types of each column, what is the memory usage, and how many not-null rows are there in each column. This can be checked using the ‘pandas’ function info().

4. Are there any missing values in the data?

df.isnull().sum()

It is important to have knowledge of the number of null values in our dataset, which can be checked using the above code.

5. How does the data look mathematically?

df.describe()

The above code gives the summary of numerical columns in our dataset.

6. Are there duplicate values in the data?

df.duplicated().sum()
0

The machine learning model will be impacted by duplicate rows in our dataset, so it is better to check if there are any duplicate rows in the dataset.

7. How is the correlation between columns?

df.corr()

It is better to know, how each of the columns is related to each other before building a model.

df.corr()['Survived']

Use the above code, to get the Pearson Correlation values for each feature with respect to the target variable(the target variable in our case is ‘Survived’).

Exploratory Data Analysis

After asking the 7 basic questions we get some knowledge about the data. To have a deeper understanding of the data, we need to do an Exploratory Data Analysis.

The purpose of Exploratory Data Analysis is to get in-depth knowledge about the data using some of the statistical and visualization tools. This helps to get insights which in turn help in building a good model. Mainly EDA has 3 types of analysis,

  1. Univariate Analysis
  2. Bivariate Analysis
  3. Multivariate Analysis

(Uni-Single and Variate-Variable) When we analyze the data using a single variable, that means we are doing a univariate analysis. When we get the data with multiple columns, each column can be called a variable, and when we independently analyze the data with respect to each variable, it is called the univariate analysis.

When we perform the analysis using two variables at the same time, it is called bivariate analysis(bi-two, variate-variables), and performing analysis using more than two variables at the same time is called multivariate analysis.

Types of Data:

The data will be either one of the two types numerical or categorical.

Numerical Data:

Numerical data refers to data that is in the form of numbers. Eg, Age of the person, the height of the person, weight of the person.

Categorical Data:

Categorical data refers to the collection of information that is divided into groups. The data belongs to one of the categories. Eg, If the person is male or female if the person is rich or poor.

When we start EDA, we need to check if the column is a numerical or a categorical column and perform the analysis accordingly, which I will show in this blog.

Univariate Analysis:

Univariate Analysis for the categorical data can be performed with the help of a count plot, or a piechart and for the numerical data, we can perform using a histogram, dist plot, or a boxplot.

Categorical Data

Countplot

import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(df['Survived'])

The below count plot shows that about 550 people did not survive, and only about 350 people survived.

df['Survived'].value_counts()0    549
1 342
Name: Survived, dtype: int64
df['Survived'].value_counts().plot(kind = 'bar')

This can also be visualized using the inbuilt method called plot().

sns.countplot(df['Pclass'])

About 210 passengers were traveling in Passenger Class — 1, about 190 passengers were traveling in Passenger Class -2, and about 490 passengers were traveling in Passenger Class -3.

sns.countplot(df['Sex'])

About 580 were male passengers and about 300 were female.

sns.countplot(df['Embarked'])

About 650 passengers boarded from Southampton(S), 150 boarded from Cherbourg(C), and about 80 boarded from Queensland(Q).

Pie Chart

df['Survived'].value_counts().plot(kind = 'pie' , autopct = '%.2f')

The pie chart below shows that 62% of the passengers did not survive and only 38% survived.

df['Pclass'].value_counts().plot(kind = 'pie' , autopct = '%.2f')

55% of the passengers traveled in class 3, 21% in class 2, and 24% in class 1

df['Sex'].value_counts().plot(kind = 'pie' , autopct = '%.2f')

65% of the passengers were male, and 35% were female.

df['Embarked'].value_counts().plot(kind = 'pie' , autopct = '%.2f')

About 72% of passengers boarded from Southampton(S), 19% boarded from Cherbourg(C), and about 9% boarded from Queensland(Q).

Numerical Data:

Histogram

plt.hist(df['Age'] , bins=10)
plt.xlabel('Age')
plt.ylabel('Count')
plt.hist(df['Fare'], bins=10)
plt.xlabel('Fare')
plt.ylabel('count')

Distplot

sns.distplot(df['Age'])
sns.distplot(df['Fare'])

Boxplot

sns.boxplot(df['Age'])
sns.boxplot(df['Fare'])

Bivariate and Multivariate Analysis

tips = sns.load_dataset('tips')
titanic = pd.read_csv('titanic_train_data.csv')
flights = sns.load_dataset('flights')
iris = sns.load_dataset('iris')

1. Scatterplot (Numerical-Numerical)

sns.scatterplot(tips['total_bill'] , tips['tip'])
sns.scatterplot(tips['total_bill'] , tips['tip'] , hue=tips['sex'])
sns.scatterplot(tips['total_bill'] , tips['tip'] , hue=tips['sex'] , style=tips['smoker'])
plt.figure(figsize=(12,6))
sns.scatterplot(tips['total_bill'] , tips['tip'] , hue=tips['sex'] , style=tips['smoker'] , size=df['size'])

2. Bar Plot (Numerical — Categorical)

sns.barplot(titanic['Pclass'] , titanic['Age'])
sns.barplot(titanic['Pclass'] , titanic['Fare'])
sns.barplot(titanic['Pclass'] , titanic['Fare'] ,hue = titanic['Sex'])
sns.barplot(titanic['Pclass'] , titanic['Age'] ,hue = titanic['Sex'])

3. Box Plot (Numerical-Categorical)

sns.boxplot(titanic['Sex'] ,titanic['Age'])
sns.boxplot(titanic['Sex'] ,titanic['Age'] , hue=titanic['Survived'])

4. Distplot(Numerical-Categorical)

sns.distplot(titanic[titanic['Survived']==0]['Age'], hist=False , color='red')
sns.distplot(titanic[titanic['Survived']==1]['Age'], hist=False , color= 'green')

5. Heatmap (Categorical — Categorical)

pd.crosstab(titanic['Pclass'] , titanic['Survived'])
sns.heatmap(pd.crosstab(titanic['Pclass'] , titanic['Survived']), annot=True)
round(titanic.groupby('Pclass')['Survived'].mean()*100,2)Pclass
1 62.96
2 47.28
3 24.24
Name: Survived, dtype: float64
(titanic.groupby('Pclass')['Survived'].mean()*100).plot(kind = 'bar')
round(titanic.groupby('Embarked')['Survived'].mean()*100,2)Embarked
C 55.36
Q 38.96
S 33.70
Name: Survived, dtype: float64
(titanic.groupby('Embarked')['Survived'].mean()*100).plot(kind = 'bar')

6. Pairplot

sns.pairplot(iris)
sns.pairplot(iris , hue = 'species)

7. Lineplot (Numerical - Numerical)

new = flights.groupby('year').sum().reset_index()sns.lineplot(new['year'] , new['passengers'])

--

--