How to Improve the Quality of Your Machine Learning Models

Rohan
The Startup
Published in
6 min readNov 1, 2020
Photo by Luke Chesser on Unsplash

If you have some background in machine learning and you’d like to learn how to quickly improve the quality of your models, you’re in the right place!

In this blog, you will accelerate your machine learning expertise by learning how to:

  • tackle data types often found in real-world datasets (missing values, categorical variables),
  • design pipelines to improve the quality of your machine learning code,
  • use advanced techniques for model validation (cross-validation),

1. Handling missing values

A. Simple Option: Drop Columns with Missing Values

The simplest option is to drop columns with missing values.

Unless most values in the dropped columns are missing, the model loses access to a lot of (potentially useful!) information with this approach. As an extreme example, consider a dataset with 10,000 rows, where one important column is missing a single entry. This approach would drop the column entirely!

Photo by Kaggle.com

B. A Better Option: Imputation

Imputation fills in the missing values with some number. For instance, we can fill in the mean value along each column.

The imputed value won’t be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column entirely.

Photo by Kaggle.com

C. An Extension To Imputation

Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below their actual values (which weren’t collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing.

In this approach, we impute the missing values, as before. And, additionally, for each column with missing entries in the original dataset, we add a new column that shows the location of the imputed entries.

In some cases, this will meaningfully improve results. In other cases, it doesn’t help at all.

Photo by Kaggle.com

2. Categorical Variables

A categorical variable takes only a limited number of values.

  • Consider a survey that asks how often you eat breakfast and provides four options: “Never”, “Rarely”, “Most days”, or “Every day”. In this case, the data is categorical, because responses fall into a fixed set of categories.
  • If people responded to a survey about which what brand of car they owned, the responses would fall into categories like “Honda”, “Toyota”, and “Ford”. In this case, the data is also categorical.

You will get an error if you try to plug these variables into most machine learning models in Python without preprocessing them first. In this tutorial, we’ll compare three approaches that you can use to prepare your categorical data.

Three Approaches

A) Drop Categorical Variables

The easiest approach to dealing with categorical variables is to simply remove them from the dataset. This approach will only work well if the columns did not contain useful information.

B) Label Encoding

Label encoding assigns each unique value to a different integer.

Photo by Kaggle.com

This approach assumes an ordering of the categories: “Never” (0) < “Rarely” (1) < “Most days” (2) < “Every day” (3).

This assumption makes sense in this example, because there is an indisputable ranking to the categories. Not all categorical variables have a clear ordering in the values, but we refer to those that do as ordinal variables. For tree-based models (like decision trees and random forests), you can expect label encoding to work well with ordinal variables.

C) One-Hot Encoding

One-hot encoding creates new columns indicating the presence (or absence) of each possible value in the original data. To understand this, we’ll work through an example.

Photo by Kaggle.com

In the original dataset, “Color” is a categorical variable with three categories: “Red”, “Yellow”, and “Green”. The corresponding one-hot encoding contains one column for each possible value, and one row for each row in the original dataset. Wherever the original value was “Red”, we put a 1 in the “Red” column; if the original value was “Yellow”, we put a 1 in the “Yellow” column, and so on.

In contrast to label encoding, one-hot encoding does not assume an ordering of the categories. Thus, you can expect this approach to work particularly well if there is no clear ordering in the categorical data (e.g., “Red” is neither more nor less than “Yellow”). We refer to categorical variables without an intrinsic ranking as nominal variables.

One-hot encoding generally does not perform well if the categorical variable takes on a large number of values (i.e., you generally won’t use it for variables taking more than 15 different values).

3. Pipelines

Pipelines are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:

  1. Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won’t need to manually keep track of your training and validation data at each step.
  2. Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
  3. Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won’t go into the many related concerns here, but pipelines can help.
  4. More Options for Model Validation: You will see an example in the next tutorial, which covers cross-validation.

4. Cross Validation

What is cross-validation?

In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality.

For example, we could begin by dividing the data into 5 pieces, each 20% of the full dataset. In this case, we say that we have broken the data into 5 “folds”.

Photo by Kaggle.com

Then, we run one experiment for each fold:

  • In Experiment 1, we use the first fold as a validation (or holdout) set and everything else as training data. This gives us a measure of model quality based on a 20% holdout set.
  • In Experiment 2, we hold out data from the second fold (and use everything except the second fold for training the model). The holdout set is then used to get a second estimate of model quality.
  • We repeat this process, using every fold once as the holdout set. Putting this together, 100% of the data is used as holdout at some point, and we end up with a measure of model quality that is based on all of the rows in the dataset (even if we don’t use all rows simultaneously).

When should you use cross-validation?

Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions. However, it can take longer to run, because it estimates multiple models (one for each fold).

So, given these tradeoffs, when should you use each approach?

  • For small datasets, where extra computational burden isn’t a big deal, you should run cross-validation.
  • For larger datasets, a single validation set is sufficient. Your code will run faster, and you may have enough data that there’s little need to re-use some of it for holdout.

There’s no simple threshold for what constitutes a large vs. small dataset. But if your model takes a couple minutes or less to run, it’s probably worth switching to cross-validation.

Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each experiment yields the same results, a single validation set is probably sufficient.

Citations : https://www.kaggle.com/learn/intermediate-machine-learning

DISCLAIMER: I hereby declare that I do not own the rights to the images/content. All rights belong to Kaggle Inc. No Copyright Infringement Intended. This blog is just for educational purpose.

--

--