Comprehensive Guide to Machine Learning (Part 2 of 3)

Tapas Das
Analytics Vidhya
Published in
7 min readAug 29, 2020

--

Pic Courtesy: https://www.educative.io/track/become-ml-engineer

Welcome to 2nd part of the “Comprehensive Guide to Machine Learning” series. In the first part of this series, we explored the below machine learning concepts.

  • Acquiring data
  • Data Cleansing
  • Exploratory Data Analysis

I hope you guys had fun playing around with these concepts on your own datasets. In this post, let’s deep-dive further and explore the below concepts, which will significantly help in making a robust machine learning model.

  • Feature Engineering
  • Feature Selection
  • Train/Validation/Test split

Note: These 3 concepts form the backbone of entire machine learning model development life-cycle. Any machine learning model is only as-good as the features that it’s been trained on. A good data engineer or data scientist should be able to identify and create intuitive features, as well as discard non-intuitive features.

4) Feature Engineering

Let’s first understand that what exactly “feature” means. According to Wikipedia —

A feature is an attribute or property shared by all of the independent units on which analysis or prediction is to be done. Any attribute could be a feature, as long as it is useful to the model.

So, there are two fundamental properties for “feature” in context of machine learning.

  • All features should be independent of each-other. That is, there should be minimal or no correlation between them. (We have already discussed about correlation analysis in the first part of this series)
  • The feature should be intuitive and useful to the model. It’s generally considered best practice to discard non-intuitive or non-useful features.

Now, let’s come to the definition of “Feature Engineering”.

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms.

Domain knowledge is one of the keys to building a good machine learning model. It’s advisable to acquire sufficient knowledge of the domain or business area, before actually starting work on developing a machine learning model.

Now that we’ve got the theoretical knowledge of feature engineering, let’s look at the code for same.

  • Date-time features

Given a date-time value, we can derive the below attributes from it.

  1. Year
  2. Month
  3. Quarter
  4. Week
  5. Day of Year
  6. Day of Month
  7. Day of Week
  8. If the date falls on weekend or not
  9. Hour
  10. Minute
  11. Second
  12. Minutes elapsed

The pandas python library already has ready-to-use functionalities to derive the features mentioned above, as shown in below image.

  • Categorical Features

Generally the below 2 techniques are used for feature engineering of categorical variables.

  1. One-hot encoding

This can be done using the “get_dummies” functionality of Pandas library, as shown below.

2. Label encoding

This can be done using the “factorize” functionality of Pandas library, as shown below.

You can checkout the below post for further understanding categorical features encoding.

  • Continuous Features

For categorical features, it comes down to the domain knowledge to generate new features. In the below image, I have derived “area” and “diagonal” value using “length” and “height” features.

Similarly, I generated new features by using different arithmetic combinations of existing variables, as shown in below image.

I’d really encourage to go through the below post to gain further understanding of feature engineering techniques.

5) Feature Selection

Let’s first understand that why it’s so important to perform feature selection.

Machine learning works on a simple rule — if you put garbage in, you will only get garbage to come out. By garbage here, I mean noise in data.

This becomes even more important when the number of features are very large. You need not use every feature at your disposal for creating an algorithm. You can assist your algorithm by feeding in only those features that are really important.

Below are the advantages of using feature selection.

  • It enables the machine learning algorithm to train faster.
  • It reduces the complexity of a model and makes it easier to interpret.
  • It improves the accuracy of a model if the right subset is chosen.
  • It reduces over-fitting.

In this post, I’m only going to discuss about using PCA as feature selection method.

PCA (Principle Component Analysis) is a dimensionality reduction technique that projects the data into a lower dimensional space.

Let’s first look at the number of features I had after I finished feature engineering.

So I ended up with 667 features, not all of them intuitive or useful for the final machine learning model. Now let’s apply PCA to keep only the useful features and discard the rest.

As shown in above image, only 100 useful features are kept for training the machine learning model.

I’d recommend to go through the below posts for better understanding of PCA and other feature selection techniques.

6) Train/Validation/Test split

When training a machine learning model, it’s essential to monitor the model performance and to ensure that there’s no “under-fitting” or “over-fitting”. The meaning of these 2 terms will be clear in sometime.

Lets say that we got a data set having 10,000 records, which we have to use to build the machine learning model. I have the below 3 possible scenarios when training the model.

  • Train the model on entire data set of 10,000 records. But then, I don’t have any further data to validate if the model is working as expected.
  • I can separate out 2,000 records out (test data set) and train the model on rest of 8,000 records (train data set). This way I can validate the model on remaining 2,000 records once training is done. However, the only problem with this approach is as we tune the hyper-parameters and retrain/re-validate the model, it will become biased toward the test data set, as the model has seen both the data sets now.
  • I can separate out 1,000 records (test data set), which I will only show the model once I’m pretty sure that it’s working as expected. I’ll further separate out another 1,000 records (validation data set), which I’ll use to validate the model on different hyper-parameters. I’ll use rest 8,000 records (train data set) to train the model. This way I can solve the bias issue in the 2nd approach.

Now that we have a theoretical idea of train/validation/test data set split, let’s discuss briefly about “under-fitting” and “over-fitting”.

Underfitting — When the model is performing poorly on both train and validation data sets. For example, we can say that the model is under-fitting if it’s giving 70% accuracy on training data and 55% accuracy on validation data. In such scenarios, the best practice is to make the model more complex or to increase the number of features to train the model.

Overfitting — When the model is performing excellent on training data, but is performing poorly on validation data. For example, we can say that the model is over-fitting if it’s giving 99% accuracy on training data and only 60% accuracy on validation data. In such scenarios, the best practice is to simplify the model or to decrease the number of features to train the model.

Now let’s look at the code for train/validation/test data set split. The “sklearn” python library provides 2 functionalities for achieving this.

  • sklearn.model_selection.train_test_split — I usually prefer this for regression problems.
  • sklearn.model_selection.StratifiedShuffleSplit — I usually prefer this for classification problems.

Below image shows the python code for performing the data set split.

You can checkout the below post for further understanding of data set splits for model validation.

Concluding Remarks

This concludes the 2nd part of the comprehensive machine learning guide. In the next post, I’ll cover the model building, hyper-parameters tuning and model validation techniques. Then we will make predictions on test data set to ensure that the model is working as expected.

As always, you can find the codebase for this post at the below link. I’d highly recommend to get your own dataset (either from Kaggle or using web-scraping) and try out the different feature engineering and feature selection methods detailed in this post.

Please visit my blog (link below) to explore more on Machine Learning and Linux Computing.

--

--