Machine Learning: Model Selection and Optimization

Emre Can Yesilyurt
Machine Learning Turkiye

--

I’m going to share some technical information about model selection and optimization, so let’s get started.

First of all, I need to say that machine learning is a pattern-finding task and this task is accomplished with statistical and mathematical methods.

For example, we have data that we need to model and we need to choose the most suitable model for this data. What should we do about this? Should we choose the model with the highest-scoring metric or the one that best fits the distribution?

I should open a parenthesis here that if there is a model that is specific to the work to be done or that should continue to be used, the model selection section loses its importance.

All data in nature necessarily follow a distribution. Examples of this are the chirping interval of a bird or the frequency with which users click on an ad.

The important thing for us is to choose a model suitable for the distribution and to optimize the model we have chosen.

It is impossible for me to go into the subject of statistics in this article (even if I do, you won’t read it :) )

That’s why I’m attaching a video here that you can get an idea from.

I think you understand that model selection is not an automatic process and often has to be handled instinctively.

The metrics we use when choosing a model should reflect the distribution. However, in supervised methods, we usually divide the data into a certain proportion of test and train data. But what if the distribution of the train section and the distribution of the test section are different? Can metrics mislead us? The answer is yes of course. If you want metrics to look at the data holistically, you need to validate all the data. One second it’s easy to verify 100 megabytes of data, so how do we verify 100 terabytes of data? By choosing a sample. There is a big difference between a randomly selected 1000 rows of train data from a data and a 1000 rows sample that can summarize that data. 1000 rows of train data may not give an idea about the distribution of the entire data. However, a correctly selected sample can contain correct information about the distribution of the data.

Here we understand that when evaluating data with metrics, we need to look at the train and test sections holistically. If you are working with a sample, the part that we will consider holistically will be our sample, since we separate this sample as train and test.

The metric with which we will evaluate the model we have chosen must give correct scores. The K-Fold Cross Validation method, which is a method that can handle train and test sets in parts, will enable us to look at the data holistically so that the metric can give accurate scores.

What is K-Fold Cross Validation?

As I explained above, K-Fold Cross Validation is a method that validates the entire data set or the entire sample selected from the data set and ensures that the metrics obtain scores suitable for the distribution of the entire data.

So how is this process performed? K-Fold Cross Validation is an iterative algorithm. In each iteration, the test set is replaced, making it possible to enter all the data into the train and test sets.

K-Fold Cross Validation Steps
K-Fold Cross Validation

How do I use K-Fold Cross Validation?

I will use our sacred dataset, the iris dataset, to demonstrate how K-Fold Cross Validation works. I keep this technical part as simple as I can.

Routine part; loading the dataset, separating the train and test sets.

Now let’s create the model.

We are ready to use K-Fold Cross Validation.

You need to import as follows.

Let’s create an object.

Estimators: the model to be tested.
We give our train and test sets in the x and y sections.
cv: how many iterations to perform.

Now we can learn the accuracy and standard deviation values from the object we created by using the .mean() and .std() functions.

Now let’s do this for both models.

Let’s take a look at the outputs.

Please note that these values apply to the entire data, not just the train set or just the test set. Also, although I only took the mean and std values, you can get all the calculations you can get from the data here.

What is model optimization?

At the time of model creation, we cannot say that all of the predefined parameters are the correct parameters for the data. Let’s think simple. How can we find out that the K value we defined in a KNN algorithm is the k value with the highest metric score? Of course, you don’t need to try manually one by one. Grid Search will do this for you and evaluate it with the metric you specify.

Let’s start by importing the grid search.

Then we define the parameters that grid search will compare as a list of dictionaries.

We can define the Grid search object.

Estimators: a model to be evaluated.
param_grid: list of parameters.
scoring: success metric

After the definitions, we need to train the grid search object with our data.

Let’s assign the best result and the best parameters to a variable.

Here are the outputs

We see the values that provide the highest accuracy value from the parameters we defined.

Is the highest accuracy the best result?

I can express this by saying that memorizing and learning are two different things. If the model you have created has achieved high accuracy values by memorizing, your algorithm will likely return incorrect results for new data. You can research this situation as “Overfitting in Machine Learning Algorithms”. Maybe one day I’ll write an article on this topic.

Thank you for reading, and you can reach me from the links below.

--

--