Feature Engineering in ML-Part 1

Sameer Kumar
Analytics Vidhya
Published in
9 min readSep 18, 2020

Introduction

Feature Engineering is considered to the most important step in the life cycle of any Data Science project .This part of the project ultimately decides the fate of your Machine learning or deep learning model as the accuracy of your model is heavily dependent on what you do in feature engineering .Most of the problem solvers in fact spend the most of their time performing engineering on features .But what is exactly Feature Engineering and why is it important?

Let us first understand the basic meaning of a feature and how is it related to our ML model.

Feature

Feature is nothing but an attribute or a piece of information which has an impact on the output of our model.

Suppose I have a data set consisting of height and weight of 50 people and I have to create a ML model which determines whether the person is obese or not. The columns of height ,weight ,obesity are considered to be features in this example.

Height and Weight are the two parameters which are going to decide whether the person is obese or not,so they are called as independent features and obesity is considered to be a dependent feature as it’s value depends on height and weight.

Feature can be basically anything depending on the context and not just numbers:

  1. Numbers
  2. Category(gender)
  3. Text
  4. Image
  5. Audios and Videos

After having the basic understanding of feature, let us understand the concept of Feature Engineering.

Feature Engineering

Feature Engineering is the process of converting the raw form of data into a more suitable format or features so that our algorithm can understand the data and predict the patterns from the unseen data.

Let me explain this through a basic real life example:

Imagine you are invited to your aunt’s house for a party and there you see a four year old kid sitting with a picture book in his hand. Your aunt hands you a book of pictures and numbers and asks you to teach that kid about pictures of fruits. So, how do you approach him?

You start teaching him from a basic level and you present your information(data) about fruits in a way that he understands .You show him various examples until he learns and is able to predict the answer on unseen fruits.

This is what happens in Feature Engineering, you keep extracting information from the data and check what kind of data suits your model’s accuracy.

The data that we obtain from 3rd party resources are unclean and in raw form.Our job is perform operations on such data and make it more presentable.

The success of all ML models depends on the way we present our data .The more information we are able to extract from the data, more will be the performance of our model.

Importance

If we have performed the feature engineering quite well ,then even the wrong models or the less complex models can get us good results. This gives us the kind of flexibility we want while selecting the model and it’s parameters.

Feature Engineering steps

Following are the 6 steps of data pre processing ,the way he handle the raw data and how we convert it into a more suitable format .Here we will understand the type of input we will provide to our model. Let us understand each one of them in detail.

  1. Handling Missing values in the data
  2. Encoding the categorical(nominal and ordinal) data
  3. Removal of outlier
  4. Feature Selection
  5. Feature Scaling
  6. Transformation of Variables

Handling missing values in data

Missing data

Missing data is one of the most common errors we see in a data set .Our first job is to check whether the data set has null values or not. We use the isnull() function from pandas to check missing values.

We can infer that Age column has 177 missing values and Embarked column has 2 missing values whereas Cabin has 687 missing values .Due to incorrect filling of surveys and incorrect information ,we generally get to see missing values.

Types of Missing value

  1. Missing Completely at random(MCAR): A variable is MCAR if the probability of being missing is same for all observations .There is absolutely no relationship b/w missing data and the observed values.
  2. Missing data not at random(MNAR): A variable is MNAR if there exists some relationship b/w the missing data and all other observed values.

So, based on the type of missing value ,we select techniques to handle missing values. There is no one fixed formula by which you handle missing values, therefore I have listed 2 of the methods that I follow:-

  1. Mean median imputation

In this method we tend to replace the missing values of a particular column with mean or median of the observed values of that column.

median imputation

Here I created a separate column for median values and then later dropped the original Age column and replaced with Age_median column with no nulls.

Disadvantage of mean median imputation

One of the major problem with this method is that the median Age group kind of changes or distorts the original variance. The below image shows the decrease in standard deviation .This method also fails to preserve the relationship between variables or kind of impacts co relation.

So there comes a second method to handle missing values which overcomes the above problem and that is Random Sample Imputation.

The difference in standard deviation

2. Random Sample Imputation

This method is much like median imputation but it consists of taking random observations from the null free dataset to replace the null values.

For that we first apply dropna() function to get rid of null values and then use sample function providing null values as parameter to get sample values for all nulls.

random sample imputation
std deviation for random sample

Here from above image we can see that the disadvantage we faced in median imputation has been overcomed in this method.

This proves that random sample imputation method is far better than median imputation as it does not distort the variance that much.

After dealing with missing values, the next important step is to deal with the categorical data .

2) Encoding the categorical data:

The thing with the categorical data is that the algorithm does not understand the categories so we convert that category feature into 1 and 0 called as dummy variable. This process of converting categorical into continuous value is called One Hot Encoding.

Based on the number categories in a feature ,those many number of columns are created and only n-1 columns are taken into consideration as the third column can be represented with the help of 2 other columns. This is called as Dummy variable trap.

One hot encoding is only performed when the categories do not follow any particular order or rank(eg. gender).The categorical variable which do not follow any rank are called as nominal variables.

On the other hand, when categories do follow particular order or rank ,they are called as ordinal variables. For ordinal variables we perform label encoding where we assign ranks instead of one hot encoding.

This was all about the categorical data and now let us to move the next important step , Feature Scaling.

Feature Scaling

Feature scaling is the process of scaling down the value of features to a similar scale for easier calculation process and fast evaluation process.

Let us understand why feature scaling is required and how is the process carried out in detail.

In a dataset we have multiple features like f1,f2,f3…fn and each feature has two properties:

  1. Magnitude
  2. Unit

The magnitude is the value of that particular feature and unit is the dimensions in which it is measured.

Now different features have different magnitudes and different units so while calculating the Euclidean distance between points , the distance becomes too large and eventually takes up time in the process .

Consider an example where I have height and weight as my two features and both have huge difference in their magnitudes. So in such cases we tend to scale down the values to a similar scale to keep the euclidean distance minimum.

Two techniques are there for scaling down values:

  1. Normalization(Min max scaler): In this technique we scale down the value of features wrt their magnitude and unit in between 0 and 1 in same scale.
  2. Standardization(Standard scaler): This technique helps us to rescale the values of features based on the standard normal distribution where mean is 0 and standard deviation is 1.It transforms the data which have a mean of 0 and std dev of 1.

Where is Feature Scaling used?

Feature Scaling is used in algorithms where Euclidean distance is used to calculate distance between points like KNN, linear and logistic regression.

Eucledian distance
Standard Scaler

This was all about feature scaling and where is it used and why is it necessary.

Now let us discuss about the next feature engineering technique known as Feature Selection.

Feature Selection

The process of selecting the best subset of relevant features which have a direct impact on the dependent variable is called as Feature Selection.

Having millions of features in a dataset might lead to something called as curse of dimensionality.

Curse of Dimensionality: It is a observed phenomena that whenever we increase the number of features in the dataset ,the accuracy also increases as it has sufficient amount of data to deduce the output but this is always not necessary.

There exists a threshold value after which if we keep adding features exponentially ,the accuracy decreases.

Why does this happen?

Till the threshold value , the model is able to learn more and more info from the data but when features start increasing exponentially ,model gets confused as we are feeding it a lot.

That is why we select only few features which have impact on the output and ignore the irrelevant features.

How do we select those relevant features?

There are various methods for the right selection but here I will discuss two main methods:

  1. Forward Selection: Forward Selection is an iterative method in which we start with no feature in the model. In each iteration we keep adding the feature which improves our model’s accuracy till the addition of new feature does not improve.
  2. Backward Elimination: In this we start with all the features and removes the least significant features from behind at each iteration which improves the performance of model .We repeat this until no improvement is observed on removal of data.
Forward Selection

Although this method is not really suitable for huge datasets as it will take lots of time, so there is another method which I use most frequently is Feature Importance .

The above picture shows the importance of different features based on a score and based on that we can drop features who do not have much impact on the output variable.

Advantages of Feature Selection

  1. Reduces overfitting
  2. Improves accuracy as less misleading data and reduces complexity of the model.

Let’s discuss about the next factor in feature engineering:

Conclusion

There are some other 1 or 2 steps too in feature engineering like outliers and gaussian transformation which I will discuss in my next article.

So these were some of the most important steps followed in while engineering features which ultimately improves the accuracy of our model!

You can connect with me on my LinkedIn profile to see more exciting projects.

https://www.linkedin.com/in/sameer-kumar-20988b1a6?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3BAgbG55feQWmzROuOlvderQ%3D%3D

--

--

Sameer Kumar
Analytics Vidhya

AI Research intern at SCAAI || Kaggle 2x Expert || Machine Learning || Deep Learning || NLP || Python