DATA Exploration & Preparation | Data Science | Machine Learning
When to start Data Exploration
We start with the Hypothesis.
Data Exploration
Understand your data first!
1.Variable Identification
2.Univariate Analysis
3.Bivariate Analysis
1.Variable Identification
Given Business Problem Identify!
2.Univariate Analysis
Explore variables one by one depending on the variable type (categorical or continuous) and also used to highlight missing and outlier values.
Continuous variable :
1.Central tendency
2.Spread of variable
Categorical variable :
1.Frequency table
2.Bar chart
3. Bivariate Analysis
To know the relationship between two variables
Continuous & Continuous: With Scatterplot
Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))
Chi-Square Test
To derive the statistical significance of the relationship between the categorical variables.
Probability of 0: It indicates that both categorical variables are dependent Probability of 1: It shows that both variables are independent
Categorical & Continuous: Z-Test/ T-Test
To assess whether the mean of two groups are statistically different from each other or not.
•If the probability of Z is small then the difference of two averages is more significant.
•The T-test is very similar to Z-test but it is used when a number of observations for both categories are less than 30.
Data Preparation
How your data should be for modeling?
1.Missing values treatment
2.Outlier treatment
3.Variable transformation
4.Variable creation
1.Missing Values Treatment
How Will It Effect?
Reasons for missing values :
Missing completely at random
Missing that depends on unobserved predictors
Handling missing values:
1.Deletion: If the missing values are less than 5% of your data then delete them
2.Mean/ Mode/ Median Imputation: Mean or Median (numeric attribute) or Mode (categorical attribute)
3.KNN Imputation: Attributes that are most similar to the attribute whose values are missing are selected using a distance function and it is very time-consuming in analyzing a large database
Imputation of Missing Data, How?
2.Outlier Treatment
Outlier is an observation that appears far away and diverges from an overall pattern in a sample.
•It increases the error variance and reduces the power of statistical tests
•If the outliers are non-randomly distributed, they can decrease normality
How to detect Outliers? Visualizations are the best methods to identify outliers
How to remove Outliers?
- Deleting observations: If outlier observations are very small in numbers then we delete outlier values
2. Treat separately: Take all the significant number of outliers and work on them separately
3.Transforming: This is done to reduce the variation caused by extreme values
4.Binning values: To group a number of more or less continuous values into a smaller number of “bins”
Example: if you have data about a group of people, you might want to arrange their ages into a smaller number of age intervals
3.Feature Engineering
Predictive power of several models improves significantly with the application of feature engineering.
What is Feature Engineering?
Extraction of as more info from the existing data by making data we already to more useful data
Business Problem: Footfall reduction in a shopping mall?
Process of Feature Engineering
•Variable transformation
•Variable / Feature creation
*Give your most of the time to do these processes*
Variable transformation
Why Transform Your Data?
Some machine learning algorithms require the data to be in a specific form.
Transformations Varieties:
How to transform?
Data Pre-Processing Methods
For which algorithms we use data transformation:
Instance-based methods: Needs Scale data
(KNN, LVQ, Support vector machines and Neural networks)
Regression methods: Needs Standardized data
Not so useful for tree and rule-based methods.
- Scale Data
Scale transform calculates the standard deviation for an attribute
and
Divides each value by that standard deviation.
The values it creates are known under being z-scores, this is a method of standardization.
2.Center Data
The center transform calculates the mean for an attribute and subtracts it from each value.
- Find the mean of Sepal length
- Subtract that mean with each record of Sepal length
3.Standardize Data
Combining the scale and center transforms will standardize your data. Attributes will have a mean value of 0 and a standard deviation of 1.
Makes your data how many standard deviations from the average that that data lies
4.Normalization Data
Data values can be scaled into the range of 0 and 1
Power Transforms
1. Box-Cox Transform
When it is used:
If the data is skewed and to convert that data into Normal distribution
Note: Assuming values are positives
2. Yeo-Johnson Transform
Same as Box-Cox Transform but it can take zero and negative values
important points to keep in mind
1. Don’t skip over this step as it has a huge impact on the accuracy of your final models.
2.First view your data with summary function summary ( )
3. It is a good idea to visualize the distribution of your data
Variable Creation
Creation of new variables based on existing variable(s) that may have a better relationship with the target variable.
Creating dummy variables:
Used for Regression Analysis with Categorical Variable
Convert categorical variables into numerical variables with given all importance to each other.
Create dummy variables for more than two classes of categorical variables with n or n-1 dummy variables.
Extract maximum information out of your data
1.Develop variables for the difference in date, time and addresses :
People use date and time values on their own normally but.?
Kukatpally phase 9 HIG 126 Near Park 21–10–1988 (before)
|Kukatpally |phase 9 | HIG 126 | Near Park |21– | 10– | 1988 | (after)
Separate them and make more variables and then you can find out influential factors from them
2.Create new ratios and proportions :
Credit card sales: Number of sales persons
Instead of using the absolute number of the card sold in the branch
3. Standard transformations should be applied:
4. Check variables for seasonality and create the model for right period
Businesses face seasonality driven by tax, festivals or weather
In this case, data variables are chosen for the right period.
Other General Mandatory Data preparation
Convert a variable to different data type
•Transpose a table
•Sort Data
•Merge data
•Subset data
•Delete data
,