EDA-DataScience

Basic Questions on EDA Analysis for Data Science

Ajay Maurya
4 min readJul 17, 2022

--

What is EDA, why we need to use EDA?

  1. EDA stand for Exploratory data analysis, which help us to manipulate which columns needs to be consider from the given data set
  2. Identify the variable categorical or numerical
  3. Univariate analysis
  4. How the data affect each other using bivariate analysis by using any two set of data
  5. Visualise the data pair plot
  6. Transform variables using in numeric to perform

Are missing values always missing due to chance alone?

  1. Survey distributed are people income but some people did not response so the data is missing as MCAR
  2. Probability of missing on same column is MCAR
  3. Subset of designation manager is more missing value assuming did not reveal the icon then it will
  4. MAR
  5. Designation is manager no missing

What is your approach to handle missing value?

  1. Evaluate all variable in given data set and check the each required column variable.
  2. Get the avg value or total number row affected due to missing value. If the avg % is to low then drop those rows other wise follow the step number 2
  3. If the avg number is very high of missing value the check quantile or median figure
  4. Fill na to missing value if the relevant data which may affect in analysis.
  5. Mean and median for numerical data and mode for categorical data
  6. Drop only all missing value is completely at random
  7. Replace or impute the values the with means and median etc
  8. Create another level for missing categorical data
  9. Run predictive models to impute missing data > retrieve the column of missing value as target column and training data will be the row which has the value in target data and test the data will comprise the row which has the value then build model to predict value
  10. Build the model using training data

Does the correlation coefficient of 0 between two numeric variables means no relationship between them?

Correlation coefficient doesn’t imply causation. It shows LINEAR relationship between two variables. But, It does not be a cause of any relationship between them. It can be Positive, Negative or No correlation. With positive (+1) correlation when one value increases, the other one increases as well. Negative (-1) correlation means that when one value increases, the other one decreases. Zero (0) correlation means that the growth or fall of one value does not visibly affect the other.

While zero correlation shows no linear relationship between the values, it does not mean that there is no relationship between them at all.

For example: Number of people drowned by falling into the pool vs Movie Nicolas cage appeared.. it may the number lies in same range but it doesn’t cause each other they felt because cage appeared.

Coffiecient relations:

  • 1 > strong relation (+0.9 , +0.9) to -1 negative strong (0.9, -0.9) and 0/0 > there is no linear relation
  • It is always observed that more the number of ice creams sold in a city. Higher are the number of murders in that city i.e there seems to be a high correlation between the number of ice creams sold and number of murder is committed in a city.

Can we say that because of the higher ice cream sales is the city there are higher murder?

Correlation between two variables does not always mean causation. In the example above we cannot say that ice-cream makes people murder each other and higher ice-cream sales cause higher murder rates. We can see the example of causation in lightning and thunder. The more the lightning strikes, the more thunder you will hear because the lightning causes the thunder.

How do we confirm causation between the variables?

Causation only imply when two variable are depend and related each other due to some hidden factors and time. For example, drinking impure water will cause more disease in this the case two variable. We may not say only impure water only the reason to cause the disease.

Pair of two numeric variable which plot will be use.?

  1. Use scatter for 2 numeric variable
  2. Line plot is connected with line and sequence continue with time.

While dealing with multi dimensional data, how can we visualise more then two variable (say3 or 4 variable) in 2-D without using dimension reduction techniques

Daceting attribute which will help to plot 4-d

  1. How to handle outliers
  2. Drop the observations
  3. Replace the extreme value with fixed values
  4. Impute the outlier with means , median etc
  5. Run model for predicting outlier bottom 5 outlier and 99 %tiles outliers
  6. Run models for predicting outliers observations
  7. Transforming the variable
  8. Binning continues variable can be handled in the data set.

What would you pref as a measure of variability of your data “”standard deviation or the interquartile range IQR

Interquartile Range (IQR) is the difference between a factor of Q3 = 75th — Q1->25th percentiles (box in a box plot). Recommended to use when the outlier present and calcuatin the number percentile data want to drop or impute. If the given data set there is not extreme variation in data then we can use Standard deviation is a measurement from the mean value.

lower (first) quartile Q1 = 7

median (second quartile) Q2 = 8.5

upper (third) quartile Q3 = 9

interquartile range, IQR = Q3 — Q1 = 2

lower 1.5*IQR whisker = Q1–1.5 * IQR = 7–3 = 4. (If there is no data point at 4, then the lowest point greater than 4.)

upper 1.5*IQR whisker = Q3 + 1.5 * IQR = 9 + 3 = 12. (If there is no data point at 12, then the highest point less than 12.)

IQR is preferred in data containing extreme value

If extreme value are not present the sensitive deviation use standard deviations

--

--

Ajay Maurya

Senior Software Engineer - Mobile Android Architect | Data Science Enthusiasm