Choosing & Fine-Tuning your Machine Learning Model

Big Data at Berkeley
Big Data at Berkeley
16 min readJul 3, 2020

--

By Aurum Kathuria

Image Source

How do I pick the right Machine Learning Model? How do I get it to perform well? If these are questions that you’re asking yourself, read on. This article will give you an overview of how to choose and fine-tune your supervised Machine Learning (ML) model.

Some Assumptions About You

I’m going to assume a couple of things about you, reader. Yes — you. Even if these assumptions aren’t valid, I’m sure that this article will still be valuable for you. If they are valid, however, then you’ll get even more out of the content here. The first two assumptions below are super important — the remaining three, a little less, but significant regardless.

Assumptions

  1. You’re familiar with some of the basics of machine learning
    a. Train vs. Test Split
    b. Bias-Variance Tradeoff
  2. You don’t need a non-supervised algorithm
    a. I’m only covering supervised ML models here, which means this only really applies if you have labels or a target variable for your model.
  3. You have a specific use-case in mind
    a. This is so you can make decisions about what model to use most effectively, for your specific use case.
  4. You understand the nuances of your use case
    a. How much data you have
    b. How accurate the data is
    c. How many features you’ll be using
    d. Etc.
  5. You know about the American response to COVID-19
    a. This article is COVID-themed, so I’ll be sprinkling in a bunch of references to how people are responding to the crisis in this article (more like drowning but oops!)

How To Understand If Your Task Is Regression or Classification

Image Source

In order to choose what kind of machine learning model you need, you need to decide what models will work for your task and which won’t. The first step in looking at that is deciding whether your task is regression or classification. Let’s start with some definitions.

Regression: the prediction of a target value (numerical label) based on individual characteristics

Image Source

Let’s see a few examples of this.

  1. Predicting the number of COVID-19 cases in a country
  2. Predicting the amount of the loan a small business can get under the Payroll Protection Program
  3. Quantifying the risk of infection for a retail worker in specific areas
  4. Predicting the number of masks available for healthcare workers

Classification: the task of assigning classes (categorical labels) to data based on shared characteristics

Image Source

And let’s get a few examples here as well.

  1. Classifying a symptomatic individual as infected with COVID-19 or with the common cold
  2. Determining the severity of infection of an infected individual — asymptomatic, mild, moderate, or severe
  3. Classifying a business as essential or non-essential in California
  4. Classifying a respiratory virus’s spread as an outbreak, epidemic, or pandemic

A Quick Note

In many ways, classification can be considered a discrete version of regression, and regression a continuous version of classification. Here are some quick examples.

Classification problem (1) could become a probability assignment to COVID-19 and not common cold (or vice versa).
Regression problem (3) could become a classification problem, determining if the infection risk is low, moderate, or high.

If you choose to transition from classification to regression, you can usually increase your precision (how close to the right answer you are) but at the cost of accuracy (going from infection risk is low to infection risk is 13% is a lot harder).

This is not always the case. However, when it is, you can adjust your approach accordingly, sometimes with large improvements.

How To Choose Your ML Model

Image Source

The meat of the article. What you’ve alllll been waiting for. The moment … has arrived.

Here’s how I’ve structured my analysis.

Every model gets 5 parts, each designed to maximize how much you understand about the model in about a page or less. Next to the name of each model is either a bolded R or a bolded C, possibly both if the model can do both regression and classification.

  1. How the model works
    a. This is a relatively brief, quick intro that avoids too much math and focuses more on intuitive ideas. It’s anything interesting or relevant to improving your understanding of what’s happening behind the scenes, so you better understand its strengths and weaknesses.
  2. Pros
    a. Pretty straightforward — I’ll include anything that might be helpful in understanding where the model performs well. Also, in every model, I’ll put the bias and variance of the model here or below.
  3. Cons
    a. A shocker after the last one, I know. Pro tip: think about your use case and identify a few things that make your problem uniquely hard or easy, and note down which models are able to benefit from those easy parts and can ignore those hard parts.
  4. Libraries
    a. Any common modules that contain an implementation of this model. A simple import with what I put here (like import sklearn.linear_model.LinearRegression) should be enough to get you rolling with that model.
  5. Relevant Resources
    a. Usually a link to the documentation and some good articles that will help you implement and understand the models better.

With that said, let’s dive in!

Linear Regression [R]

How It Works

With Linear Regression, we’re trying to find a line that cuts through the middle of our data. We determine how middle is middle enough by trying to minimize the square of the distance between the line and the point, for every point. As a result, Linear Regression is also commonly called the Ordinary Least Squares (OLS) model.

Pros

  • Extremely effective for linear relationships and simple modeling
  • Low Variance
  • Models are very similar as long as you have a small set of points
  • Super easy and fast to train
  • Flexible model
  • Take the log of an exponential relationship to get a linear relationship
  • Take the square root of a quadratic relationship to get a linear relationship
  • Can generalize to data outside the range of the training data

Cons

  • High Bias
  • Very few relationships can be modeled in a linear fashion
  • Very sensitive to outliers
  • Because OLS minimizes the squared error, a few far-off points grow the error heavily and thus the model adjusts drastically to compensate
  • Not effective for non-polynomial relationships
  • Can’t understand interplay between variables
  • Very Simple

Libraries

sklearn.linear_model.LinearRegression

Relevant Resources

Documentation: sklearn.linear_model.LinearRegression — scikit-learn 0.23.1 documentation

Articles: A Beginner’s Guide to Linear Regression in Python with Scikit-Learn

Polynomial [R]

Image Source

How It Works

With Polynomial Regression, we’re doing something similar to Linear Regression. We’re trying to find the polynomial of given degree that minimizes the squared error, except the degree of the polynomial can be set arbitrarily.

Pros

  • Allows you to more easily understand polynomial relationships between your input and target data
  • Low Variance (assuming low maximum degree)
  • As long as the maximum degree of the polynomial you set is below ~3, you can expect consistent performance and outputs for not-too-small datasets
  • Easy to train
  • Generalizes to data outside the domain of the input data

Cons

  • Moderate-to-High Bias
  • For non-polynomial relationships, it’s still difficult for this model to internalize the nuances of the relationship
  • Very easy to overfit
  • Setting the maximum degree to more than 3 means you have at least 3 times as many features to train
  • Sensitive to outliers
  • Because points far from the prediction are penalized heavily, this picks up on outliers and considerably alters itself to compensate

Libraries

sklearn.linear_model.LinearRegression & sklearn.preprocessing.PolynomialFeatures

Resources

Documentation: sklearn.linear_model.LinearRegression — scikit-learn 0.23.1 documentation
sklearn.preprocessing.PolynomialFeatures — scikit-learn 0.23.1 documentation

Articles: Python | Implementation of Polynomial Regression
Machine Learning: Polynomial Regression with Python

Logistic [RC]

How It Works

Logistic regression takes in input data belonging to one of two classes and fits a logistic curve to maximize the probability of a correct prediction at any point. Then, it outputs the probability that data is in one of the classes (and, because there are only two classes, you also get the probability that it’s not in the other class).

Pros

  • Gives probabilities, not just classes
  • This allows you to determine your own threshold depending on your own situation
  • In medicine, a low threshold is used — if there’s even a 10% chance you have COVID, we want to know about that
  • Modest Variance
  • Pretty consistent model, not too affected by outliers
  • Modest Bias
  • Pretty good at being accurate so long as your data is close to linearly separable
  • Will suck if your data overlaps a lot (meaning you have very similar or overlapping inputs but opposite outputs)
  • Can be used for classification easily

Cons

  • Struggles with many features
  • Output space is restricted to input space
  • You can’t predict things you weren’t explicitly prepared for
  • Generally limited regression use-case
  • Weak for non-linearly separable inputs
  • Overall, dependent on a good starting point

Libraries

sklearn.linear_model.LogisticRegression

Resources

Documentation: sklearn.linear_model.LogisticRegression — scikit-learn 0.23.1 documentation

Articles: Building a Logistic Regression in Python
Logistic Regression vs Decision Trees vs SVM: Part II

kNN [RC]

Image Source

How It Works

A k-Nearest Neighbors algorithm (kNN) utilizes its training data uniquely. It compares new data to the stored training data and finds the k stored data points most similar to the new data.

Classification: Then, it outputs the most common class of those k points (mode).

Regression: Then, it outputs the average of the values of those k points (mean).

Pros

  • Almost 0 training time
  • Easy to interpret
  • Very insensitive to outliers
  • If you’re far away, you can’t be one of my nearest neighbors!
  • Modest Bias
  • Moderate Variance
  • No assumptions about the data
  • Great for nonlinear/non-polynomial data!
  • Works for both regression and classification problems

Cons

  • Impractical to scale because massive testing time
  • You compare new data to every training data point
  • Large memory requirement
  • You need to store all the training data, not just some final computed values
  • Sensitive to noisy data

Libraries

sklearn.neighbors.NearestNeighbors

Resources

Documentation: sklearn.neighbors.NearestNeighbors — scikit-learn 0.23.1 documentation

Articles: The k-Nearest Neighbors Algorithm In Python

Support Vector Machines [RC]

How It Works

Support Vector Machines (SVM) use what’s called the “kernel trick” — here’s a quick recap:

  • The model takes your data and through some clever linear algebra, applies some transformations to it. Now, you’ve got your dataset with a few more features / dimensions, where you can now find a way to separate the data easily.
  • For regression, the SVM takes this easily separable data and then outputs a value.
  • For classification, it outputs a class.

Pros

  • Works great for classification
  • Works for regression
  • Modest Bias
  • Modest Variance
  • Works for nonlinear data

Cons

  • Difficult to interpret
  • Does not scale well
  • Transforming data is an expensive procedure
  • Moderately sensitive to noisy data

Libraries

libsvm
sklearn.svm.SVC
sklearn.svm.NuSVC
sklearn.svm.LinearSVC

Resources

Documentation: sklearn.svm.SVC — scikit-learn 0.23.1 documentation
sklearn.svm.NuSVC — scikit-learn 0.23.1 documentation
sklearn.svm.LinearSVC.html — scikit-learn 0.23.1 documentation
Support Vector Machine (LibSVM)

Articles: Support Vector Machines — Introduction to Machine Learning Algorithms

Decision Tree [RC]

How It Works

A Decision Tree trains iteratively:

  • First, it finds the feature and value that most easily “splits” the data into two parts.
  • If a data point meets the criteria (say, coughing == True or temperature > 100), the Decision Tree decides to put that point in one subset; if not, it decides to put the point in the other one.
  • It repeats this finding and deciding process on each subset, until no more splits are possible (usually when there’s only one object left in each subset).

For classification, it finds the feature and value that lead to the best separation of classes.

For regression, it instead finds the feature and value that results in the lowest SD for each subset.

When testing, it simply follows its decision rules, based on those chosen features and values, and determines what to do with a data point.

Pros

  • Easy to interpret
  • You can easily tell exactly how the algorithm is making its decisions
  • Works for both regression and classification
  • Great starting point for classification tasks
  • Great for non-linear features

Cons

  • Low Bias
  • Will memorize the training set entirely
  • High Variance
  • Changing a few points can drastically change the decision boundaries
  • Very prone to overfitting
  • Needs significant pruning and regularization in order to achieve great results
  • No ranking scores to understand the surety of a prediction

Libraries

sklearn.tree.DecisionTreeClassifier
sklearn.tree.DecisionTreeRegressor

Resources

Documentation: sklearn.tree.DecisionTreeClassifier — scikit-learn 0.23.1 documentation
sklearn.tree.DecisionTreeRegressor — scikit-learn 0.23.1 documentation

Articles: How Decision Tree Algorithm works
Logistic Regression vs Decision Trees vs SVM: Part II

Random Forest [RC]

How It Works

A RandomForest creates a bunch of DecisionTrees, and outputs the most common/average output of all its member DecisionTrees. The naming is yet another shocker, I know.

Pros

  • Modest Bias
  • Moderate Variance
  • By using a bunch of different DecisionTrees, it’s able to add variance
  • Much more effective than a DecisionTree
  • Great for classification, good for regression
  • Great for non-linear data

Cons

  • Hard to visualize and understand
  • It doesn’t get much more accurate with more data
  • Limited generalizability for data outside the input space
  • Moderate training time

Libraries

sklearn.ensemble.RandomForestRegressor

sklearn.ensemble.RandomForestClassifier

Resources

Documentation: sklearn.ensemble.RandomForestRegressor — scikit-learn 0.23.1 documentation
sklearn.ensemble.RandomForestClassifier — scikit-learn 0.23.1 documentation

Articles: Understanding Random Forest Models
What Are The Advantages And Disadvantages For A Random Forest Algorithm?

How To Fine-Tune Your Models

Image Source

Regularization

Regularization is a technique for avoiding overfitting and helping your model generalize effectively. If you’re using regularization, you’re changing the way you evaluate your model. Normally, you simply evaluate a regression model by how close it is to the answer, and a classification model by whether it predicts the right class or not (with more complexity than I’m mentioning here). With regularization, however, you add a penalty to your model for its weights; in other words, you evaluate the model both by how well it performs and how complex it is. A complex model is likely to have larger weights and more weights; a simpler model has smaller weights and fewer weights. By adding regularization, you “encourage” your model to be simpler and focus on both getting it right and not memorizing the training set. By not memorizing the training set, the model is now able to pick out the trends more effectively and generalize better to new data.

There are two common regularization loss functions, L1 loss and L2 loss. Let’s start with L1 loss.

L1 Loss

L1 loss adds a penalty for the sum of the absolute values of the weights of your model. It’s also called Mean Absolute Error (MAE), although it has a little more flexibility than that. As a result, a weight that’s large is “punished” more than a weight that’s small; however, a small reduction in a small weight has the same effect on the loss function as a small reduction in a large weight. That’s why L1 loss often results in having a few large weights, and a bunch of zero weights. Dropping those smaller weights gives a noticeable reduction in loss without a significant change in accuracy, so the model begins to prefer those large weights that make a big difference in the accuracy of its predictions. Another implication of using L1 loss is that you get a little more speed out of your model during inference. Multiplying by 0 is pretty easy and fast compared to multiplying by almost anything else. Therefore, your model spends less time going through computations — especially as you increase the complexity (as measured by the number of weights!) of your model.

L2 Loss

L2 loss adds a penalty for the sum of the squares of the weights of your model. It’s also called Mean Squared Error (MSE), a term you’re probably familiar with, although again it’s more flexible than regular MSE. With L2 loss, a large weight is punished much, much more than a smaller weight. Reducing a small weight by a small amount reduces loss by a very small amount. Reducing a large weight by a small amount reduces loss by a large amount. Thus, L2 loss doesn’t penalize a bunch of small weights as much as it punishes even one large weight. This means L2 loss often encourages your model to have many small weights and very few large weights — it doesn’t care about those small weights nearly as much as it cares about the larger ones.

These are not the only ones; there are many others, like Huber Loss and Log-Cosh. Take a look at this great article on loss functions for more information.

L1 Loss: Retains only the most important weights

L2 Loss: Reduces all weights

Cross-Validation

You’re always ensuring your model performs well on your validation set, but what if your validation set is a little biased? Or worse, your training set is biased? To counter these possibilities, you can use cross validation.

Let’s run through a quick example of how to do cross validation:

Image Source
  • Split your data into 3 subsets.
  • Train on the first two, run validation on the last.
  • Then, train on the first and last, validate on the middle.
  • Then, train on the last two, validate on the first.

What you’ve just done is 3-fold cross validation. You can increase from 3 to any integer k, which is how we get k-fold cross validation.

Here’s the essence of what happens when you’re doing cross validation. You create a few models that are pretty similar to your final model, and you evaluate them on different sets of your data. By looking at how these models perform on the different validation sets, you can get a good sense of how your overall model will perform on your test set, and generally how well this specific model works in general.

This technique is especially powerful when data is scarce. It’s almost like pretending you have more data by reusing it in a slightly different context, and getting a sense of how you’re doing before you dive in. As mentioned in the beginning, it both reduces the bias of your model and adds variance to it. It also gives you a better sense of how your final model will perform.

However, it does slow your process down, since you have k times as many training and validation sets to run through.

An easy way to do cross-validation in python: sklearn.model_selection.cross_val_score

Ensembling

Ensembling is a creative technique that I love exploring. When you’re ensembling, you create 10 different, relatively simple models on subsets of the data. Then, you put them together — use some way of aggregating their predictions and call that your final answer. This could be taking the mean or median, or for classification maybe it’s the mode. What you’re doing is letting their combined power give you more variance and less bias than before, and thus making them more accurate. (Note that you can ensemble with any number of models, not just 10!)

You might recognize this concept. See if you can find which model you read about earlier utilizes this technique. The answer is at the bottom of this section.

There are some drawbacks to using an ensemble model, however. Since you have more models, you’re going to need more time to both train and evaluate your model. Plus, it’s harder to get an understanding of what’s going wrong if something is going wrong, as you’re looking at a bunch of different models at once instead of one.

More Data!!

For almost every ML model, the more practice it gets, the better it learns. So if you can get more data, your model is more capable of understanding the relationships you’re looking for it to identify.

Fitting a million parameter neural network to a million data points is akin to fitting a curve to every single point — the real learning just isn’t happening. Instead, fit a model of 10k parameters to 10 million data points and allow it to really learn the patterns and relationships that matter!

As usual, adding more data means more time spent training. Data collection may also be difficult, expensive, or some other resource-intensive, so this is often the hardest thing to do.

Hyperparameter Optimization

Image Source

Use Hyperparameter Optimization only for complex models that take in parameters. The number of DecisionTrees in a RandomForest is a parameter; the k in k-NearestNeighbors; the learning rate in NeuralNets. These are all parameters that you have to manually set, and thus are called hyperparameters. Finding the value of each parameter that gets you the best result can make a large positive impact on your model.

Some common ways to do this:

  • Test models with a set of parameters on small subsets of your data
  • Run full-on k-fold cross validation on a series of similar models, only changing the hyperparameters each time.

By doing these, you’re able to get a sense of what hyperparameters would work best for your task.

This again takes more time to complete, especially if you have a lot of data or a lot of hyperparameters to test. To mitigate both of these issues, there are some clever approaches people have developed, but those are outside the scope of this article.

Ensembling Answer: RandomForests! They take a bunch of DecisionTrees and output something based on each of the Trees predictions.

Quick Recap

Here are the key concepts we covered:

Choosing Your ML Model

  • Linear Regression [R]
  • Polynomial Regression [R]
  • Logistic [RC]
  • k-Nearest Neighbors [RC]
  • Support Vector Machines [RC]
  • DecisionTrees [RC]
  • RandomForests [RC]

Fine-Tuning Techniques

  • Regularization
  • Cross-Validation
  • Ensembling
  • More Data
  • Hyper-Parameter Optimization

Conclusion

Well, this was a long read (and trust me, a much much longer write), but I hope you got something valuable out of this article! Everything has its benefits and drawbacks, and the best choice for you depends on your specific use-case. So when it comes to choosing what model to run or how to best reign it in, keep your problem in mind. The things that make your problem uniquely challenging or simple will define how you use these ideas.

Now, it’s your turn. Go do something great — you’ve got the entire Big Data at Berkeley team cheering you on!

As usual, we love hearing your thoughts and feedback. If you enjoyed this article, please drop a clap below! (We’re not in a theater, but I promise your claps are heard!) Feel free to comment about any models you want us to discuss in the future, let me know if I missed something important, or just share how this article helped you! (Or just tell me I suck if you hate my writing, the choice is yours.) All feedback is welcome!

Special shoutout to Kendall Kikkawa, Ronak Laddha, and Riya Master for helping with this article!

--

--