Feature Engineering Techniques for Machine Learning

Aditi Mittal
Nerd For Tech
Published in
4 min readMar 13, 2023

While working with data scientists, I have realized that expert data scientists spend almost 80% of their time doing feature engineering because it’s a time-consuming and difficult process, yet important step for building accurate models. In this article, I will be discussing what si feature engineering, its importance and few of the important feature engineering techniques.

Photo by Michael Dziedzic on Unsplash

What is feature engineering?

Feature engineering is the process of selecting, manipulating, and transforming the existing features to more useful features for learning of the machine learning model. It can produce new features for both supervised and unsupervised learning. It generally simplifies and speeds up data transformations while also enhancing the model accuracy. A terrible feature will have a direct impact on the model, independent of the model type or data.

Importance of feature engineering

Feature engineering refers to the process of designing features for a machine learning algorithm. These features are then used by that algorithm to improve its performance. The major concern is to get an accurate model that can learn the relationships between features well and generate correct prediction. So when feature engineering is performed correctly, the final dataset is optimal and contains all important features that can affect the use-case. As a result, the most accurate predictive model is generated.

Feature Engineering Techniques

Now I’ll discuss few feature engineering techniques that can be used in most cases.

1.Imputation

It is very common for the dataset in machine learning to have missing values. Missing values have an impact on the performance of machine learning models. Removing all the rows with missing values is one solution but it can lead to removing valuable information of other features. The main goal of imputation is to handle these missing values. There are two types of imputation :

  • Numerical Imputation: Numerical imputation can be used to fill missing values with the mean of the column or 0 as the value .

#Filling all missing values with 0

data = data.fillna(0)

  • Categorical Imputation: Categorical imputation is used to fill missing values in case of categorical columns. Missing values are replaced by the most commonly occurring value in other records. However, if the values in the column are evenly distributed and there is no most commonly occurring value, adding a category like “Other” would be a better choice.

#Max fill function for categorical columns

df[‘column’].fillna(df[‘column’].value_counts().idxmax(), inplace=True)

It is really important to be careful when using this technique because retention of dataset size with this technique could come at the cost of deterioration of data quality. That’s why the value chosen needs to be thought of carefully.

2. Handling Outliers

Outliers are unusual high or low values present in the dataset which are present in almost every real case dataset. As these outliers can adversely affect the machine learning model efficiency, they must be handled appropriately. Outlier handling can be used on a variety of scales to produce a more accurate data representation. This procedure should be completed prior to model training. There are few methods for handling outliers which are given below:

  1. Removal: Outlier-containing rows are deleted from the dataset. But if there are many outliers across different features, this might result in loss of huge subset of dataset and would lead to missing valuable information.
  2. Replacing values: The outliers can be handled as missing values and replaced using suitable imputation technique.
  3. Capping: In this technique, we can use an arbitrary value or a value from the distribution to replace the maximum and minimum values.
  4. Discretization : Discretization is the process of converting continuous variables into discrete variables. This is done by creating continuous intervals (or bins) that span the range of our desired variable. It can be applied to numerical values as well as to categorical values

3. Log Transform

Log Transform is mostly used to turn a skewed distribution into a normal or less-skewed distribution. We transform the values of a column to their log in this transform. It makes the data becomes more approximative to normal distribution.

df[‘Price’] = np.log(df[‘Price’])

4. Encoding

A one-hot encoding is a type of encoding in which an element of a finite set is represented by the index in that set, where all elements are assigned indices within the range [0, n-1].

5. Scaling

After a scaling operation, the continuous features become similar in terms of range. Distance-based algorithms like k-NN and k-Means require scaled continuous features as model input. There are two common processes of scaling:

a) Normalization : All values are scaled in a specified range between 0 and 1 via normalisation. This modification has no influence on the feature’s distribution. But it enhances the effect of outliers due to lower standard deviations. Thus, outliers should be dealt with prior to normalisation.

b) Standardization: Standardization is the process of scaling values while taking standard deviation into account. In this technique, the effect of outliers in features is reduced. All the data points are subtracted by their mean and the result is divided by the variance. This will make the distribution with 0 mean and variance of 1.

6. Feature Splitting

Sometimes splitting of features into parts can also improve the dataset thus leading to a better model. For example, if there is date with time then it’s better to use just date rather than using the combination.

7. Creating Features

Creating features involves deriving new features from present ones. It can be done by simple mathematical operations like mean, median, mode, sum, or difference and even product of two values. Even though these features are derived directly from the existing features, if the new aggregation is carefully selected, it can have an impact on the performance of the model.

Conclusion

These were few of the mostly used techniques by the data scientist to improve their dataset for efficient machine learning model.

Thanks for reading!

--

--