Intro to Feature Engineering and Ensembling techniques

Utkarsh I
Nybles
Published in
7 min readAug 21, 2021

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. It is the soul when it comes to training the machine learning models and getting the highest precision and accuracy.

Some of the common techniques.

1. Handling “Na” values

Whenever we get any dataset there might be a possibility that it might contain some ‘na ’ value these create redundancy in our model. The most common way of handling na values is removing one row from the dataset, this method is efficient when the na values are less and the size of the dataset is large, but if the data values are less eg . 1000 then removing ‘na’ is not a good idea rather in place of that we could fill the cell with an average value of the corresponding column.

2. Removing outliers

Outliers are the curse for our data, a single outlier can affect our model to a great extend (in terms of accuracy ), it can also lead to the chance of overfitting.

We can identify them by drawing the scatter plot which gives a better visualization of it and then we can put certain conditions to remove those data.

Apart from sketching the graph, we could use an efficient way which is plotting a normal distribution curve, it indicates how many standard deviations a data point is from the sample’s mean if any data point lies after 2.5, 3 S.D is considered as outlier.

3. Data visualization -

Plotting data on the graph could be very helpful when we are dealing with a huge chunk of data. There are many types of graphs that we could draw like a bar chart, scatter plot, histogram, violin plot, etc.

Some commonly used plots are-

  • Line plot

This is the plot that you can see in the nook and corners of any sort of analysis between 2 variables. The line plots are nothing but the values on series of data points will be connected with straight lines.

  • Count plot and Word cloud

As the name suggests, this plot gives the count of different classes of any numerical or categorical data.

Similar to this is word cloud but here it gives the bigger visualization of words from the dataset for example if we are having a paragraph, then the word cloud will plot visualization of words where a word with higher frequency will appear in bigger whereas word with less frequency will appear small.

  • Scatter plot

This plot gives us a representation of where each point in the entire dataset is present concerning any 2/3 features(Columns). This plot is widely used and most commonly it helps us to identify the outliers or when we are dealing with clustering problems or any kNN learning algorithm.

A simple code snippet of the histogram chart.

This graph shows the sqft area of houses ranging from 500–6000.

4. Scaling

Sometimes data comes with variation in range, there could be cases when some features could be in decimals while some could range in millions so to maintain the decorum of the dataset we do scaling.

Basically, there are two techniques:-

Normalization-

Normalization (or min-max normalization) scales all values in a fixed range between 0 and 1. This transformation does not change the distribution of the feature and due to the decreased standard deviations, the effects of the outliers increases. Therefore, before normalization, it is recommended to handle the outliers.

Standardization-

Standardization (or z-score normalization) scales the values while taking into account standard deviation. If the standard deviation of features is different, their range also would differ from each other. This reduces the effect of the outliers in the features.

For python, there is a Sklearn library that has a min-max scaler that normalizes the data from range 1–10 (default could be changed).

5. One hot-encoding —

One-hot encoding is one of the most common encoding methods in machine learning. This method spreads the values in a column to multiple flag columns and assigns 0 or 1 to them. These binary values express the relationship between grouped and encoded columns.

This is useful when we are dealing with categorical data where there is no relation between two features and we want some correlation between the features then we use this technique. The algorithm does not understand the string data but if we hot — encode them, it can understand the labels as shown in the above example, like for user 2 he lives in Madrid but being the string data it can not interpret the input but after hot-encoding it has created columns for the same and initializing with 1in that column and rest with 0.

6. Imbalance dataset-

In machine learning, we often came across several cases when one class has much higher data points than others for example — in the fraud detection problems there are very fewer cases of fraud whereas perfect cases are too many.

Example of imbalanced data

Let’s understand this with the help of an example.

Ex: In an utilities fraud detection data set you have the following data:

Total Observations = 1000

Fraudulent Observations = 20

Non Fraudulent Observations = 980

Event Rate= 2 %

There are mainly two algorithms to solve this problem —

  • > SMOTE (Synthetic Minority Oversampling Technique) — Oversampling

SMOTE synthesizes new minority instances between existing minority instances. It generates the virtual training records by linear interpolation for the minority class. Basically, it replicates some new training data points from present data, it is generated by randomly selecting one or more of the k-nearest neighbors for each example in the minority class.

In the above case Non-Fraudulent Observations =980

Fraudulent Observations after replicating the minority class observations= 400

Total Observations in the new data set after oversampling=1380

Event Rate for the new data set after under sampling= 400/1380 = 29 %

  • > NearMiss Algorithm — Undersampling

This is an undersampling technique where the majority class is undersampled i.e we eliminate some data points from the majority class. When instances of two different classes are very close to each other, we remove the majority of class instances to increase the spaces between the two classes. This process helps in the classification problems.

In this case, we are taking 10 % samples without replacement from Non-Fraud instances. And combining them with Fraud instances.

Non Fraudulent Observations after random under sampling = 10 % of 980 =98

Total Observations after combining them with Fraudulent observations = 20+98=118

Event Rate for the new dataset after under sampling = 20/118 = 17%

7. Boosting -

Boosting is a commonly used ensemble technique where it combines several weak learner algorithms to give better and efficient learners. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.

AdaBoost is a way of doing this where it takes multiple weak learners into account and combines them to form a stronger learner. The weak learners in AdaBoost are decision trees with a single split, called decision stumps. When AdaBoost creates its first decision stump, all observations are weighted equally.

Besides it, AdaBoost is not very efficient so in place of it, we use Gradient boost. It works in the same way by adding sequential predictors to an ensemble then instead of changing the incorrect weights, it fits a new predictor to the error is created.

In spite of both these algorithms, we still don't use them in our models nowadays. There is an algorithm named XGboost (eXtreme Gradient Boosting) this is a direct application of Gradient Boosting for decision trees.

8. Voting Classifier

After everything is done there is always one thought left, which algorithm is most suitable for our training model. A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier or the other way round is to check the accuracy of each algorithm and select the highest rated one using a method called GridsearchCV and Cross-Validation.

To know more about Cross-validation check this out.

Using the above method we could select the appropriate algorithm for our training model dataset.

  • To understand some of the concepts we can have a look at the following-

Note — Not every technique is shown in this notebook.

Dataset used — https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data/

--

--