Choosing & Fine-Tuning your Machine Learning Model

Published in

Big Data at Berkeley

16 min readJul 3, 2020

How do I pick the right Machine Learning Model? How do I get it to perform well? If these are questions that you’re asking yourself, read on. This article will give you an overview of how to choose and fine-tune your supervised Machine Learning (ML) model.

Some Assumptions About You

I’m going to assume a couple of things about you, reader. Yes — you. Even if these assumptions aren’t valid, I’m sure that this article will still be valuable for you. If they are valid, however, then you’ll get even more out of the content here. The first two assumptions below are super important — the remaining three, a little less, but significant regardless.

Assumptions

You’re familiar with some of the basics of machine learning
a. Train vs. Test Split
b. Bias-Variance Tradeoff
You don’t need a non-supervised algorithm
a. I’m only covering supervised ML models here, which means this only really applies if you have labels or a target variable for your model.
You have a specific use-case in mind
a. This is so you can make decisions about what model to use most effectively, for your specific use case.
You understand the nuances of your use case
a. How much data you have
b. How accurate the data is
c. How many features you’ll be using
d. Etc.
You know about the American response to COVID-19
a. This article is COVID-themed, so I’ll be sprinkling in a bunch of references to how people are responding to the crisis in this article (more like drowning but oops!)

How To Understand If Your Task Is Regression or Classification

In order to choose what kind of machine learning model you need, you need to decide what models will work for your task and which won’t. The first step in looking at that is deciding whether your task is regression or classification. Let’s start with some definitions.

Regression: the prediction of a target value (numerical label) based on individual characteristics

Let’s see a few examples of this.

Predicting the number of COVID-19 cases in a country
Predicting the amount of the loan a small business can get under the Payroll Protection Program
Quantifying the risk of infection for a retail worker in specific areas
Predicting the number of masks available for healthcare workers

Classification: the task of assigning classes (categorical labels) to data based on shared characteristics

And let’s get a few examples here as well.

Classifying a symptomatic individual as infected with COVID-19 or with the common cold
Determining the severity of infection of an infected individual — asymptomatic, mild, moderate, or severe
Classifying a business as essential or non-essential in California
Classifying a respiratory virus’s spread as an outbreak, epidemic, or pandemic

A Quick Note

In many ways, classification can be considered a discrete version of regression, and regression a continuous version of classification. Here are some quick examples.

Classification problem (1) could become a probability assignment to COVID-19 and not common cold (or vice versa).
Regression problem (3) could become a classification problem, determining if the infection risk is low, moderate, or high.

If you choose to transition from classification to regression, you can usually increase your precision (how close to the right answer you are) but at the cost of accuracy (going from infection risk is low to infection risk is 13% is a lot harder).

This is not always the case. However, when it is, you can adjust your approach accordingly, sometimes with large improvements.

How To Choose Your ML Model

The meat of the article. What you’ve alllll been waiting for. The moment … has arrived.

Here’s how I’ve structured my analysis.

Every model gets 5 parts, each designed to maximize how much you understand about the model in about a page or less. Next to the name of each model is either a bolded R or a bolded C, possibly both if the model can do both regression and classification.

How the model works
a. This is a relatively brief, quick intro that avoids too much math and focuses more on intuitive ideas. It’s anything interesting or relevant to improving your understanding of what’s happening behind the scenes, so you better understand its strengths and weaknesses.
Pros
a. Pretty straightforward — I’ll include anything that might be helpful in understanding where the model performs well. Also, in every model, I’ll put the bias and variance of the model here or below.
Cons
a. A shocker after the last one, I know. Pro tip: think about your use case and identify a few things that make your problem uniquely hard or easy, and note down which models are able to benefit from those easy parts and can ignore those hard parts.
Libraries
a. Any common modules that contain an implementation of this model. A simple import with what I put here (like import sklearn.linear_model.LinearRegression) should be enough to get you rolling with that model.
Relevant Resources
a. Usually a link to the documentation and some good articles that will help you implement and understand the models better.

With that said, let’s dive in!

Linear Regression [R]

How It Works

With Linear Regression, we’re trying to find a line that cuts through the middle of our data. We determine how middle is middle enough by trying to minimize the square of the distance between the line and the point, for every point. As a result, Linear Regression is also commonly called the Ordinary Least Squares (OLS) model.

Pros

Extremely effective for linear relationships and simple modeling
Low Variance
Models are very similar as long as you have a small set of points
Super easy and fast to train
Flexible model
Take the log of an exponential relationship to get a linear relationship
Take the square root of a quadratic relationship to get a linear relationship
Can generalize to data outside the range of the training data

Cons

High Bias
Very few relationships can be modeled in a linear fashion
Very sensitive to outliers
Because OLS minimizes the squared error, a few far-off points grow the error heavily and thus the model adjusts drastically to compensate
Not effective for non-polynomial relationships
Can’t understand interplay between variables
Very Simple

Libraries

sklearn.linear_model.LinearRegression

Relevant Resources

Documentation: sklearn.linear_model.LinearRegression — scikit-learn 0.23.1 documentation

Articles: A Beginner’s Guide to Linear Regression in Python with Scikit-Learn

Polynomial [R]

How It Works

With Polynomial Regression, we’re doing something similar to Linear Regression. We’re trying to find the polynomial of given degree that minimizes the squared error, except the degree of the polynomial can be set arbitrarily.

Pros

Allows you to more easily understand polynomial relationships between your input and target data
Low Variance (assuming low maximum degree)
As long as the maximum degree of the polynomial you set is below ~3, you can expect consistent performance and outputs for not-too-small datasets
Easy to train
Generalizes to data outside the domain of the input data

Cons

Moderate-to-High Bias
For non-polynomial relationships, it’s still difficult for this model to internalize the nuances of the relationship
Very easy to overfit
Setting the maximum degree to more than 3 means you have at least 3 times as many features to train
Sensitive to outliers
Because points far from the prediction are penalized heavily, this picks up on outliers and considerably alters itself to compensate

Libraries

sklearn.linear_model.LinearRegression & sklearn.preprocessing.PolynomialFeatures

Resources

Documentation: sklearn.linear_model.LinearRegression — scikit-learn 0.23.1 documentation
sklearn.preprocessing.PolynomialFeatures — scikit-learn 0.23.1 documentation

Articles: Python | Implementation of Polynomial Regression
Machine Learning: Polynomial Regression with Python

Logistic [RC]

How It Works

Logistic regression takes in input data belonging to one of two classes and fits a logistic curve to maximize the probability of a correct prediction at any point. Then, it outputs the probability that data is in one of the classes (and, because there are only two classes, you also get the probability that it’s not in the other class).

Pros

Gives probabilities, not just classes
This allows you to determine your own threshold depending on your own situation
In medicine, a low threshold is used — if there’s even a 10% chance you have COVID, we want to know about that
Modest Variance
Pretty consistent model, not too affected by outliers
Modest Bias
Pretty good at being accurate so long as your data is close to linearly separable
Will suck if your data overlaps a lot (meaning you have very similar or overlapping inputs but opposite outputs)
Can be used for classification easily

Cons

Struggles with many features
Output space is restricted to input space
You can’t predict things you weren’t explicitly prepared for
Generally limited regression use-case
Weak for non-linearly separable inputs
Overall, dependent on a good starting point

Libraries

sklearn.linear_model.LogisticRegression

Resources

Documentation: sklearn.linear_model.LogisticRegression — scikit-learn 0.23.1 documentation

Articles: Building a Logistic Regression in Python
Logistic Regression vs Decision Trees vs SVM: Part II

kNN [RC]

How It Works

A k-Nearest Neighbors algorithm (kNN) utilizes its training data uniquely. It compares new data to the stored training data and finds the k stored data points most similar to the new data.

Classification: Then, it outputs the most common class of those k points (mode).

Regression: Then, it outputs the average of the values of those k points (mean).

Pros

Almost 0 training time
Easy to interpret
Very insensitive to outliers
If you’re far away, you can’t be one of my nearest neighbors!
Modest Bias
Moderate Variance
No assumptions about the data
Great for nonlinear/non-polynomial data!
Works for both regression and classification problems

Cons

Impractical to scale because massive testing time
You compare new data to every training data point
Large memory requirement
You need to store all the training data, not just some final computed values
Sensitive to noisy data

Libraries

sklearn.neighbors.NearestNeighbors

Resources

Documentation: sklearn.neighbors.NearestNeighbors — scikit-learn 0.23.1 documentation

Articles: The k-Nearest Neighbors Algorithm In Python

Support Vector Machines [RC]

How It Works

Support Vector Machines (SVM) use what’s called the “kernel trick” — here’s a quick recap:

The model takes your data and through some clever linear algebra, applies some transformations to it. Now, you’ve got your dataset with a few more features / dimensions, where you can now find a way to separate the data easily.
For regression, the SVM takes this easily separable data and then outputs a value.
For classification, it outputs a class.

Pros

Works great for classification
Works for regression
Modest Bias
Modest Variance
Works for nonlinear data

Cons

Difficult to interpret
Does not scale well
Transforming data is an expensive procedure
Moderately sensitive to noisy data

Libraries

libsvm
sklearn.svm.SVC
sklearn.svm.NuSVC
sklearn.svm.LinearSVC

Resources

Documentation: sklearn.svm.SVC — scikit-learn 0.23.1 documentation
sklearn.svm.NuSVC — scikit-learn 0.23.1 documentation
sklearn.svm.LinearSVC.html — scikit-learn 0.23.1 documentation
Support Vector Machine (LibSVM)

Articles: Support Vector Machines — Introduction to Machine Learning Algorithms

Decision Tree [RC]

How It Works

A Decision Tree trains iteratively:

First, it finds the feature and value that most easily “splits” the data into two parts.
If a data point meets the criteria (say, coughing == True or temperature > 100), the Decision Tree decides to put that point in one subset; if not, it decides to put the point in the other one.
It repeats this finding and deciding process on each subset, until no more splits are possible (usually when there’s only one object left in each subset).

For classification, it finds the feature and value that lead to the best separation of classes.

For regression, it instead finds the feature and value that results in the lowest SD for each subset.

When testing, it simply follows its decision rules, based on those chosen features and values, and determines what to do with a data point.

Pros

Easy to interpret
You can easily tell exactly how the algorithm is making its decisions
Works for both regression and classification
Great starting point for classification tasks
Great for non-linear features

Cons

Low Bias
Will memorize the training set entirely
High Variance
Changing a few points can drastically change the decision boundaries
Very prone to overfitting
Needs significant pruning and regularization in order to achieve great results
No ranking scores to understand the surety of a prediction

Libraries

sklearn.tree.DecisionTreeClassifier
sklearn.tree.DecisionTreeRegressor

Resources

Documentation: sklearn.tree.DecisionTreeClassifier — scikit-learn 0.23.1 documentation
sklearn.tree.DecisionTreeRegressor — scikit-learn 0.23.1 documentation

Articles: How Decision Tree Algorithm works
Logistic Regression vs Decision Trees vs SVM: Part II

Random Forest [RC]

How It Works

A RandomForest creates a bunch of DecisionTrees, and outputs the most common/average output of all its member DecisionTrees. The naming is yet another shocker, I know.

Pros

Modest Bias
Moderate Variance
By using a bunch of different DecisionTrees, it’s able to add variance
Much more effective than a DecisionTree
Great for classification, good for regression
Great for non-linear data

Cons

Hard to visualize and understand
It doesn’t get much more accurate with more data
Limited generalizability for data outside the input space
Moderate training time

Libraries

sklearn.ensemble.RandomForestRegressor

sklearn.ensemble.RandomForestClassifier

Resources

Documentation: sklearn.ensemble.RandomForestRegressor — scikit-learn 0.23.1 documentation
sklearn.ensemble.RandomForestClassifier — scikit-learn 0.23.1 documentation

Articles: Understanding Random Forest Models
What Are The Advantages And Disadvantages For A Random Forest Algorithm?

How To Fine-Tune Your Models

Regularization

Regularization is a technique for avoiding overfitting and helping your model generalize effectively. If you’re using regularization, you’re changing the way you evaluate your model. Normally, you simply evaluate a regression model by how close it is to the answer, and a classification model by whether it predicts the right class or not (with more complexity than I’m mentioning here). With regularization, however, you add a penalty to your model for its weights; in other words, you evaluate the model both by how well it performs and how complex it is. A complex model is likely to have larger weights and more weights; a simpler model has smaller weights and fewer weights. By adding regularization, you “encourage” your model to be simpler and focus on both getting it right and not memorizing the training set. By not memorizing the training set, the model is now able to pick out the trends more effectively and generalize better to new data.

There are two common regularization loss functions, L1 loss and L2 loss. Let’s start with L1 loss.

L1 Loss

L1 loss adds a penalty for the sum of the absolute values of the weights of your model. It’s also called Mean Absolute Error (MAE), although it has a little more flexibility than that. As a result, a weight that’s large is “punished” more than a weight that’s small; however, a small reduction in a small weight has the same effect on the loss function as a small reduction in a large weight. That’s why L1 loss often results in having a few large weights, and a bunch of zero weights. Dropping those smaller weights gives a noticeable reduction in loss without a significant change in accuracy, so the model begins to prefer those large weights that make a big difference in the accuracy of its predictions. Another implication of using L1 loss is that you get a little more speed out of your model during inference. Multiplying by 0 is pretty easy and fast compared to multiplying by almost anything else. Therefore, your model spends less time going through computations — especially as you increase the complexity (as measured by the number of weights!) of your model.

L2 Loss

L2 loss adds a penalty for the sum of the squares of the weights of your model. It’s also called Mean Squared Error (MSE), a term you’re probably familiar with, although again it’s more flexible than regular MSE. With L2 loss, a large weight is punished much, much more than a smaller weight. Reducing a small weight by a small amount reduces loss by a very small amount. Reducing a large weight by a small amount reduces loss by a large amount. Thus, L2 loss doesn’t penalize a bunch of small weights as much as it punishes even one large weight. This means L2 loss often encourages your model to have many small weights and very few large weights — it doesn’t care about those small weights nearly as much as it cares about the larger ones.

These are not the only ones; there are many others, like Huber Loss and Log-Cosh. Take a look at this great article on loss functions for more information.

L1 Loss: Retains only the most important weights

L2 Loss: Reduces all weights

Cross-Validation

You’re always ensuring your model performs well on your validation set, but what if your validation set is a little biased? Or worse, your training set is biased? To counter these possibilities, you can use cross validation.

Let’s run through a quick example of how to do cross validation:

Split your data into 3 subsets.
Train on the first two, run validation on the last.
Then, train on the first and last, validate on the middle.
Then, train on the last two, validate on the first.

What you’ve just done is 3-fold cross validation. You can increase from 3 to any integer k, which is how we get k-fold cross validation.

Here’s the essence of what happens when you’re doing cross validation. You create a few models that are pretty similar to your final model, and you evaluate them on different sets of your data. By looking at how these models perform on the different validation sets, you can get a good sense of how your overall model will perform on your test set, and generally how well this specific model works in general.

This technique is especially powerful when data is scarce. It’s almost like pretending you have more data by reusing it in a slightly different context, and getting a sense of how you’re doing before you dive in. As mentioned in the beginning, it both reduces the bias of your model and adds variance to it. It also gives you a better sense of how your final model will perform.

However, it does slow your process down, since you have k times as many training and validation sets to run through.

An easy way to do cross-validation in python: sklearn.model_selection.cross_val_score

Ensembling

Ensembling is a creative technique that I love exploring. When you’re ensembling, you create 10 different, relatively simple models on subsets of the data. Then, you put them together — use some way of aggregating their predictions and call that your final answer. This could be taking the mean or median, or for classification maybe it’s the mode. What you’re doing is letting their combined power give you more variance and less bias than before, and thus making them more accurate. (Note that you can ensemble with any number of models, not just 10!)

You might recognize this concept. See if you can find which model you read about earlier utilizes this technique. The answer is at the bottom of this section.

There are some drawbacks to using an ensemble model, however. Since you have more models, you’re going to need more time to both train and evaluate your model. Plus, it’s harder to get an understanding of what’s going wrong if something is going wrong, as you’re looking at a bunch of different models at once instead of one.

More Data!!

For almost every ML model, the more practice it gets, the better it learns. So if you can get more data, your model is more capable of understanding the relationships you’re looking for it to identify.

Fitting a million parameter neural network to a million data points is akin to fitting a curve to every single point — the real learning just isn’t happening. Instead, fit a model of 10k parameters to 10 million data points and allow it to really learn the patterns and relationships that matter!

As usual, adding more data means more time spent training. Data collection may also be difficult, expensive, or some other resource-intensive, so this is often the hardest thing to do.

Hyperparameter Optimization

Use Hyperparameter Optimization only for complex models that take in parameters. The number of DecisionTrees in a RandomForest is a parameter; the k in k-NearestNeighbors; the learning rate in NeuralNets. These are all parameters that you have to manually set, and thus are called hyperparameters. Finding the value of each parameter that gets you the best result can make a large positive impact on your model.

Some common ways to do this:

Test models with a set of parameters on small subsets of your data
Run full-on k-fold cross validation on a series of similar models, only changing the hyperparameters each time.

By doing these, you’re able to get a sense of what hyperparameters would work best for your task.

This again takes more time to complete, especially if you have a lot of data or a lot of hyperparameters to test. To mitigate both of these issues, there are some clever approaches people have developed, but those are outside the scope of this article.

Ensembling Answer: RandomForests! They take a bunch of DecisionTrees and output something based on each of the Trees predictions.

Quick Recap

Here are the key concepts we covered:

Choosing Your ML Model

Linear Regression [R]
Polynomial Regression [R]
Logistic [RC]
k-Nearest Neighbors [RC]
Support Vector Machines [RC]
DecisionTrees [RC]
RandomForests [RC]

Fine-Tuning Techniques

Regularization
Cross-Validation
Ensembling
More Data
Hyper-Parameter Optimization

Conclusion

Well, this was a long read (and trust me, a much much longer write), but I hope you got something valuable out of this article! Everything has its benefits and drawbacks, and the best choice for you depends on your specific use-case. So when it comes to choosing what model to run or how to best reign it in, keep your problem in mind. The things that make your problem uniquely challenging or simple will define how you use these ideas.

Now, it’s your turn. Go do something great — you’ve got the entire Big Data at Berkeley team cheering you on!

As usual, we love hearing your thoughts and feedback. If you enjoyed this article, please drop a clap below! (We’re not in a theater, but I promise your claps are heard!) Feel free to comment about any models you want us to discuss in the future, let me know if I missed something important, or just share how this article helped you! (Or just tell me I suck if you hate my writing, the choice is yours.) All feedback is welcome!

Special shoutout to Kendall Kikkawa, Ronak Laddha, and Riya Master for helping with this article!

Choosing & Fine-Tuning your Machine Learning Model

Some Assumptions About You

Assumptions

How To Understand If Your Task Is Regression or Classification

Regression: the prediction of a target value (numerical label) based on individual characteristics

Classification: the task of assigning classes (categorical labels) to data based on shared characteristics

A Quick Note

How To Choose Your ML Model

Linear Regression [R]

How It Works

Pros

Cons

Libraries

Relevant Resources

Polynomial [R]

How It Works

Pros

Cons

Libraries

Resources

Logistic [RC]

How It Works

Pros

Cons

Libraries

Resources

kNN [RC]

How It Works

Pros

Cons

Libraries

Resources

Support Vector Machines [RC]

How It Works

Pros

Cons

Libraries

Resources

Decision Tree [RC]

How It Works

Pros

Cons

Libraries

Resources

Random Forest [RC]

How It Works

Pros

Cons

Libraries

Resources

How To Fine-Tune Your Models

Regularization

L1 Loss

L2 Loss

Cross-Validation

Ensembling

More Data!!

Hyperparameter Optimization

Quick Recap

Choosing Your ML Model

Fine-Tuning Techniques

Conclusion

Written by Big Data at Berkeley