11 Important Analytical Steps for your Data Science project

Mehul Gupta
Data Science in your pocket
5 min readJun 25, 2019

--

Courtesy Audiencetools

Data Analysis is a crucial part of the Data Science domain. Often beginners in the field start off by applying fancy algorithms without any preprocessing and hence don’t get the expected results. It is important to know that building an ML model is not the first step of any Data Science problem.

In this article I will highlight some important steps one must apply before feeding the data to an ML model.

  1. KNOW YOUR DATA: Before you apply any processing steps, or build any ML model on your data, the first thing you should do is know what type of data do you have. This can be done simply by using info & describe(in python).
    - info() highlights datatype & null values
    -
    describe() shows summarized info like min, max, std, etc for all columns
    This helps us to know which fields to fill, know which fields are textual & hence to be converted to numeric form, have a look towards the distribution to know about whether scaling needed, etc.

2. FILL MISSING VALUES: You can fill it using many ways like NA, mean, mode, median or even drop(as mentioned earlier). But do remember to use mode/median to fill categorical data field (converted to numeric) as using mean may provide you with decimal points that actually don’t represent any category(for label encoding)

Example-> Let there be 2 categories that are ‘A’ & ‘B’.Let using LabelEncoding they get converted to 1 & 2. Now mean is 1.5 which doesn’t represent any category. The same problem won’t arise with OHE.

3. CHECK FOR OUTLIERS: You can use the z-score for this purpose. Any data with z-score below -3 or greater than 3 is an outlier. Data visualization can be handy.

You can refer the section 2 and section 3 of the following article which covers the missing value and outlier treatment part. The article talks about the different reasons we have missing values in the data and also different methods to fill the missing values. Furthermore, the next section covers how outliers affect the dataset and methods to eliminate it — Guide to Data Exploration

4. DROP UNIMPORTANT COLUMNS: If you have a constant field (a column with only one value) or a column having a large number of missing values(about 70%), such variables need to be removed from the data as they do not add any important information to the model but increase the dimension of the dataset.

5. CONVERT STRINGS TO NUMERICAL VALUES: Most machine learning algorithms cannot deal with strings and hence we need to convert strings to numbers using OneHotEncoder or LabelEncoding. OneHotEncoding is used when the strings do not have any order between them, while LabelEncoding is used when the strings do have an order.

Example->Let us have 25 categories in a field. When using LabelEncoding, let ‘A’ be converted to 1 & ‘B’ to 2. As ‘Y’ would be converted to 25 (following the same pattern), hence A is closer to B than Y but this isn’t the actual case. All categories are equally different. This problem is eliminated using OHE but also increase dimensions.

Also perform this step before filling NA’s as you might have to find some logic to fill textual data with another textual value (you can use NA but if converted to numeric after filling with NA, it would be taken as another category). It is easy to fill NA values using builtin functions like mean, median or mode when data is numeric but this won’t apply on textual data. Hence first convert than fill would be my suggestions.

I encourage you to check out this very comprehensive article that talks about one-hot encoding and label encoding in much detail — Simple Methods to deal with Categorical Variables in Predictive Modeling.

6. FEATURE ENGINEERING: It refers to building new features using old features. But do remember that it shouldn’t have a high correlation with other existing features.

Example->total area= area_1st_floor +area_ground_floor where total_area is a new feature while the other two are existing in dataset.

Section three of this article introduces you to the art of feature engineering and some basic methods to perform the same — Guide to Data Exploration. Also, there is a very popular library for automated feature engineering, called AutoML, and the following article provides the basic introduction of the same: Automated Feature Engineering using AutoML

7. SKEWNESS/KURTOSIS: For regression problems, do check target for skewness/ kurtosis and if found guilty, apply log transform or BoxCox transform. This is important for the assumption that data has normal distribution because of the Central Limit Theorem.

8. SCALING: Scaling is necessary sometimes as without scaling, some features get major dominance than others like ‘Age’ & ‘Income’. Here though they both weigh the same, but due to differences in scales, income may become dominant(in some ML models only). For this, RobustScaler, StandardScaler & Min-MaxScaler are available. Though RobustScaler is robust to outliers & hence my first choice.

9.FEATURE CORRELATION: If any two features found correlated, either they both can be retained or any one of them can be chucked off (check with the result of training data for both cases).

10. DIMENTIONALITY REDUCTION: If the number of features is more than data rows, you might need PCA for dimension reduction (like reducing the number of features from 100 to 10 without any major data loss) to avoid CURSE OF DIMENSIONALITY. Below I have provided links to relevant articles on dimensionality reduction for further reading:

11. Check for IMBALANCE IN TARGET (target field). It means having a target value in a heavy majority than any other target value.

Example->If the target has Yes & No, then Yes with about 95% and No with only 5%.

This can be resolved using upsampling(replicating ‘No’ data ) or downsampling(decreasing ‘Yes’ data) in the training dataset.

Apart from this, using data visualization alongside some steps would help you yield better results.Also do split your data for validation purpose as well using the 80:20 rule.And don’t forget to apply all changes to both training and testing dataset.

These were some of the basic analytical steps that might get you better results in a classification or regression problem. Though for problems related to NLP, Time Series, etc some more steps may be required, the basic approach remains the same. But always remember, there is nothing as a rule in Data Science(you might get the best results without any of the above steps) & hence

Explore more, Learn more!!!

--

--