Chapter-4 Knowledge from the data and Data Exploration Analysis

Ashish Patel
ML Research Lab
Published in
6 min readJul 3, 2018

Machine Learning Series!!!

Hello folks!!! I hope that you are well off earning the best data science skills and helping you achieve the best results in your career. you have a great experience, you will learn to use it.From now you have seen the highlights of below articles which is help you to learn basics, which is required to know.If you don’t know about this topics just go through this.

  1. Chapter-1 Machine Learning Introduction
  2. Chapter-2 Data and It’s Different Types
  3. Chapter-3 Bias and Variance Trade-off in Machine Learning
  4. Your Practice Book

Today we are discuss about the main topic which is initial point in any data science process before this i will explain you Knowledge data discovery process.I want to separate this article into two parts with the goal that you can comprehend it better.

Outline

  1. Knowledge Data Discovery
  2. Data Exploration
  3. Practical Guide of Data Exploration ← Click on this for Practice
  4. Your practice Guide

1. Knowledge Data Discovery

Definition : Data mining, also popularly referred to as knowledge discovery from data (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive information repositories or data streams.”

What is KDD?

  • computational theories and tools to assist humans in extracting useful information (ie knowledge) from digital data
  • development of methods and techniques for making sense of data
  • maps low-level data into other forms that are:
  • core of KDD process employs “data-mining”

Why KDD?

  • the size of datasets are growing extremely large — billions of records — hundreds to thousands of fields
  • analysis of data must be automated
  • computers enable us to generate amounts of data too large for humans to digest, thus we should use computers to discover meaningful patterns and structures from the data.

An Outline of the Steps of the KDD Process

KDD Process(Source)
  1. Develop an understanding for the application domain and identify the goal.
  2. Create a target dataset
  • selecting a dataset or focusing on a subset of samples or variables on which to make discoveries

3.Data cleaning and preprocessing

  • removal of noise and outliers
  • collecting necessary information to model or account for noise
  • handling of missing data
  • accounting for time sequence info

4.Data reduction and projection

  • finding useful features to represent the data relative to the goal
  • dimensionality reduction/transformation ==> reduce number of variables
  • identification of invariant representations

5.Selection of appropriate data-mining task

  • summarization, classification, regression, clustering, etc.

6.Selection of data-mining algorithm(s)

  • methods to search for patterns
  • decision of which models and parameters may be appropriate
  • match method to goal of KDD process

7.Data-mining

  • searching for patterns of interest in one or more representational forms

8.Interpretation and visualization

  • interpretation of mined patterns
  • visualization of extracted patterns and models
  • visualization of the data given the extracted models

9.Consolidating discovered knowledge

  • incorporating the discovered knowledge into another system
  • documenting and reporting knowledge to interested parties
  • checking for inconsistencies with other prior extracted or believed knowledge

The terms knowledge discovery and data mining are distinct.

KDD refers to the overall process of discovering useful knowledge from data. It involves the evaluation and possibly interpretation of the patterns to make the decision of what qualifies as knowledge. It also includes the choice of encoding schemes, preprocessing, sampling, and projections of the data prior to the data mining step.
Data mining refers to the application of algorithms for extracting patterns from data without the additional steps of the KDD process.

2. Exploration Data Analysis

Exploratory Data Analysis (EDA) is used on the one hand to answer questions, test business assumptions, generate hypotheses for further analysis. On the other hand, you can also use it to prepare the data for modeling. The thing that these two probably have in common is a good knowledge of your data to either get the answers that you need or to develop an intuition for interpreting the results of future modeling.

There are no shortcuts for data exploration. If you are in a state of mind, that machine learning can sail you away from every data storm, trust me, it won’t. After some point of time, you’ll realize that you are struggling at improving model’s accuracy. In such situation, data exploration techniques will come to your rescue.you can find all insight of data exploration in video and cheat sheet which will give u basic understanding of EDA.

The steps for data exploration are in this order:

1. Variable Identification:

We have first to define the type of every variable (continuous or categorical…) and its role in the dataset (input variable or an output variable),

2. Univariate Analysis:

2.1.For continuous variables: We can build histograms and boxplots for each continuous variable independently. These figures would give us an understanding about the variables’ central tendencies and spread.

2.2 For categorical variables: We build a bar chart visualization that shows the frequencies in each category.

3. Bi-Variable Analysis:

3.1 Continuous & Continuous: We can build a scatter plots in order to see how two continuous variables interact between each other.

3.2 Categorical & Categorical: A Stacked Column Chart is a good visualization that shows how the frequencies are spread between the two categorical variables.

3.3 Categorical & Continuous: Here, the best visualization in my opinion is building boxplots combined with swarmplots.

4. Detecting / Treating missing values.

This phase is more of an art rather than a systematic approach and usually it depends to the problem in hand. However I will describe here what I usually do in two different situations:

4.1 If for example we have only few missing values and they appear to be random we can just proceed with the deletion of these cases.

4.2 If we have many missing values, we don’t want to proceed into their deletions because that would end up in having a much smaller dataset which would influence the predictive model’s performance. In this case I would either replace the values with the median/mean/model or/and add another column that shows if the other variable has a missing value or not. In the latter case, ideally they newly added column should correlate with the output variable, thus creating a new variable that might be a good predictor.

5. Detecting / Treating outliers:

Having many outliers in the dataset can harm the predictive model’s performance and thus would be nice to treat them. Also in this phase there is no systematic approach to deal with them and I am going to just share some of my thoughts here.

5.1 During the detection phase, one of the best visualization approaches to use are boxplots for univariate analysis and scatterplots for bi-variate analysis.

5.2 During the treatment phase, we can either delete them if they are very few or if not, we can use a special treatment like imputing them or just treat them independently by having their own predictive model.

6. Feature Engineering:

During this phase we try to infer better variables/predictors out of the existing variables. Like imagine we have a date variable, we can create other new variables out of it like weekday/weekend, Monday/Tuesday…., and so on. This newly created variable can turn out to be a good predictor if it correlates somehow with the output variable.

What is EXPLORATORY DATA ANALYSIS? What does EXPLORATORY DATA ANALYSIS mean?
EDA(Source)

7. Visualization guide

Here are the types of visualizations and the python packages we find most useful for data exploration.

Univariate

Categorical:

Continuous:

Bivariate

Categorical x categorical

Categorical x continuous

Continuous x continuous

Multivariate

Timeseries

  • Line plots
  • Any bivariate plot with time or time period as the x-axis.

Panel data

  • Heat map with rows denoting observations and columns denoting time periods.
  • Multiple line plots
  • Strip plot where time is on the x-axis, each entity has a constant y-value and a point is plotted every time an event is observed for that entity.

Geospatial

  • Choropleths: regions colored according to their data value.

Helpful packages

  • matplotlib: basic plotting
  • seaborn: prettier versions of some matplotlib figures
  • mpld3: interactive plotting
  • folium: geospatial plotting

References:

  1. https://www.tutorialspoint.com/data_mining/dm_knowledge_discovery.htm
  2. https://www.datacamp.com/community/tutorials/preprocessing-in-data-science-part-1-centering-scaling-and-knn
  3. https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/
  4. http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html

--

--

Ashish Patel
ML Research Lab

LLM Expert | Data Scientist | Kaggle Kernel Master | Deep learning Researcher