Exploratory Data Analysis in [ML]

Sonal Dev
Catalysts Reachout
Published in
5 min readNov 2, 2022

What is exploratory data analysis?

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.

Why is exploratory data analysis important in data science?

The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.

Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning.

Programming Language Used

Python: an interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components together. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning.

TYPES OF EXPLORATORY DATA ANALYSIS:

  1. Univariate Non-graphical
  2. Multivariate Non-graphical
  3. Univariate graphical
  4. Multivariate graphical

1. Univariate Non-graphical: As we only use one variable to research the data, this is the most basic type of data analysis. Understanding the sample distribution and underlying data in order to draw conclusions about the population is the basic objective of univariate non-graphical EDA. The analysis also includes outlier detection. The population distribution’s characteristics include:

  • Central tendency: The average or middle values have something to do with the central tendency or distribution location. Statistics with the names mean, median, and occasionally mode are frequently useful gauges of central tendency, with mean being the most prevalent. The median may be selected when there is a skewed distribution or when outliers are a concern.
  • Spread: Spread serves as a gauge for how far we should look to find the information values from the centre . The variance and quality deviation are two helpful measurements of spread. The variance is the root of the variance because it is the mean of the square of each unique deviation.
  • Skewness and kurtosis: Two more useful univariates descriptors are the skewness and kurtosis of the distribution. Skewness is that the measure of asymmetry and kurtosis may be a more subtle measure of peakedness compared to a normal distribution

2. Multivariate Non-graphical: In cross-tabulation or statistics, the multivariate non-graphical EDA technique is typically used to illustrate the relationship between two or more variables..

  • For categorical data, an extension of tabulation called cross-tabulation is extremely useful. For 2 variables, cross-tabulation is preferred by making a two-way table with column headings that match the amount of one-variable and row headings that match the amount of the opposite two variables, then filling the counts with all subjects that share an equivalent pair of levels.
  • For each categorical variable and one quantitative variable, we create statistics for quantitative variables separately for every level of the specific variable then compare the statistics across the amount of categorical variable.
  • Comparing the means is an off-the-cuff version of ANOVA and comparing medians may be a robust version of one-way ANOVA.

3. Univariate graphical: Non-graphical methods are quantitative and objective, they are doing not give the complete picture of the data; therefore, graphical methods are more involve a degree of subjective analysis, also are required. Common sorts of univariate graphics are:

  • Histogram: The foremost basic graph is a histogram, which may be a barplot during which each bar represents the frequency (count) or proportion (count/total count) of cases for a variety of values. Histograms are one of the simplest ways to quickly learn a lot about your data, including central tendency, spread, modality, shape and outliers.
  • Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots. It shows all data values and therefore the shape of the distribution.
  • Boxplots: Another very useful univariate graphical technique is that the boxplot. Boxplots are excellent at presenting information about central tendency and show robust measures of location and spread also as providing information about symmetry and outliers, although they will be misleading about aspects like multimodality. One among the simplest uses of boxplots is within the sort of side-by-side boxplots.
  • Quantile-normal plots: The ultimate univariate graphical EDA technique is that the most intricate. it’s called the quantile-normal or QN plot or more generally the quantile-quantile or QQ plot. it’s wont to see how well a specific sample follows a specific theoretical distribution. It allows detection of non-normality and diagnosis of skewness and kurtosis

4. Multivariate graphical: Multivariate graphical data uses graphics to display relationships between two or more sets of knowledge. The sole one used commonly may be a grouped barplot with each group representing one level of 1 of the variables and every bar within a gaggle representing the amount of the opposite variable.

Other common sorts of multivariate graphics are:

  • Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is that the scatterplot , sohas one variable on the x-axis and one on the y-axis and therefore the point for every case in your dataset.
  • Run chart: It’s a line graph of data plotted over time.
  • Heat map: It’s a graphical representation of data where values are depicted by color.
  • Multivariate chart: It’s a graphical representation of the relationships between factors and response.
  • Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in two-dimensional plot.

--

--