Origin of wine part 5

Nelson Punch
Software-Dev-Explore
2 min readNov 2, 2023
Photo by Sven Wilhelm on Unsplash

Introduction

There are many models exists in Scikit-Learn but which one is the best for the classification problem in this case?

Guided by Scikit-Learn map. There are few models I can use to deal with classification problem. Instead of trying each models one after another then compare their performance, I can adopt Model Selection technique by using Cross-Validation from Scikit-Learn.

Code

Notebook with code

Cross-Validation

The goal here is to evalute each potential models’ cross validation score and find out which model stand out among of them.

The first thing I need to do is to collect potential models that I am going to evaluate.

Follow by cross validation. Finally I present the result in a human readable form.

Cross-Validation will split data both X(input data) and y(labels) into number of folds. Here cv paramerter in cross_val_score indicate the number of folds.

After data is splitted into number of folds, each fold will be fed into model. Therefore the return scores is an array of score for each individual folds. In this case cv=5 will return 5 scores stored as an array.

According to this idea, I can calculate the average score from the list of scores by using numpy’s mean.

DataFrame from Pandas can help me to produce a readable form.

Looks like LinearSVC is my go to model.

Acorrding Scikit-Learn documentation about cv.

Determines the cross-validation splitting strategy. Possible inputs for cv are:

None, to use the default 5-fold cross validation,

int, to specify the number of folds in a (Stratified)KFold,

CV splitter,

An iterable that generates (train, test) splits as arrays of indices.

I specified number for cv so splitting strategy is (Stratified)KFold

Scikit-Learn documentation about StratifiedKFold.

The folds are made by preserving the percentage of samples for each class.

Conclusion

Model Selection technique enable me to find out the best model for the problem. By using Cross-Validation from Scikit-Learn, I can reduce the complexity of process to search the best model.

Next

Each models involved a number of parameters that can be adjusted in order to fit the data and produce best performance. Hyperparameter Tuning is a technique to find the best parameters.

part 6

--

--