Deep Dive in Machine Learning with Python

Part — XIII: Converse with the data

Published in

Analytics Vidhya

3 min readFeb 8, 2020

Welcome to another blog of Deep Dive in Machine Learning with Python, up to this point we have extensively worked to learn the concepts of python programming, data wrangling, and data visualization. And, in the last blog, we gained an understanding of Heatmaps and Pair Plots.

In today’s blog, we will explore the depths of various levels of data analysis. And, learn how to enable the raw data to discover the hidden patterns.

As a Data Scientist or ML Engineer, it is your primary task to listen to the data in many possible ways to manifest a credible story of the data.

Levels of Data Analysis

Every stage of the ML problem requires a distinct type of Data Analysis to be performed.

As the above image represents, we have 4 levels of data analysis and each has its significance in a ML task.

Initial Data Analysis (IDA)

IDA is the first activity that we perform while working on a Data Science problem and it solves below issues:

Detect duplicate records
Detect if any feature requires character encoding
Detect the datatype incompatibility in features
Detect NULL or missing values
Detect the categorical variables which can be coded as 0 or 1
Detect the quantitative features with different measurement units
Detect the existence of Dual-peak or Skewed distribution in features
Detect outliers in the data

Exploratory Data Analysis (EDA)

After performing IDA, we investigate in the dataset with a motive to discover patterns and this is known as Exploratory Data Analysis.

It addresses the broad question “what is going here?”.
It requires the detective-like activity from you to gain an understanding of the data.
Majorly focusses on learning from the data.
It answers questions about the data using visual methods.
In EDA, we consider numerous hypotheses, look for patterns and suggest hypotheses based on the data.

Pre-Confirmatory Data Analysis (PCDA)

In Pre-CDA, we offer an initial assessment of all the credible models using evaluation methods.
In this step, we figure out the best model based on the results of cross-validation carried out on different data.

Confirmatory Data Analysis (CDA)

In confirmatory data analysis, we work to test specific hypotheses using probabilistic techniques such as confidence intervals or significance tests.
CDA is the process used to assess evidence by questioning their assumptions about the data. It’s like examining evidence and questioning witnesses in a trial, attempting to determine the guilt or innocence of the defendant.

The goal of EDA is indictment; the goal of CDA is conviction. (By: Behrens & Smith)

Congratulations, we come to the end of this blog. To summarize, we covered the different levels of data analysis. In the next blog, I’ll demonstrate each level of data analysis on real-life datasets.

Thank you and happy learning!!

Blog-14: Initial Data Analysis (IDA) with example