Algorithm Selection: Mystery Solved

A guide for choosing the best among a collection of algorithms (Theory & Code).

Published in

The Startup

7 min readJul 20, 2020

Let’s imagine that you have to buy a car. The first thing would be to look at your options; I guess they will be more than one. After thorough thinking, you probably create a list of 5 cars. What is the next step? How would you choose which one to buy? You want to be careful here, you will use it for years. I think we all know the answer… it is time for a test-drive! This is exactly what happens when you have a task in Machine Learning (ML) and a couple of algorithms that could possibly do the job. To choose the best one, you need to take them on a test-drive.

In other words, given a task (classification or regression), a set of algorithms for solving this problem (SVM, Logistic Regression, Neural Networks), and a specific dataset, decide which of the algorithms can be expected to perform best on that dataset.

Model vs Algorithm Selection

For those who are new to the terms, let me help you by briefly presenting those two techniques. Model Selection answers the question:

Which hyper-parameters set gives me the best solution given an algorithm?

We all have gone through the process of Model Selection, which is often also known as Hyper-parameter Optimisation, if not then read this. However, before this optimisation step, you must first select the algorithm. Algorithm Selection answers the question:

Which algorithm gives me the best predictive and computational performance?

😕 I know right? What does that suppose to mean…🤔 That is why we need to read the next section.

The How and Why

The perfect solution cannot be found in books or blogs about ML. Probably you are also familiar with the “No Free Lunch” theorem, which states that there is no one algorithm that works best for every problem. And yes you should expect a marathon of trial & error 😩. However, any conclusion will be based solely on data. This is where it all starts (and ends…)

Most of us do not have access to an abundance of data. It is a luxury to have a large sized sample set; even nowadays. Therefore, the best way to choose among a pool of algorithms is by using Nested Cross-Validation (Nested CV). Before jumping into the details of the suggested approach, let me briefly explain CV as a process.
k-fold CV is when you partition your dataset into k folds, run the analysis on each fold, and then average the overall error estimate. You could read more about this here.

Why do we need CV and not Hold-out set?

Imagine that you have performed hyper-parameter optimisation using a train / test set and you have now a trained model. But, how does your selected algorithm compare to other different algorithms? Is your neural network performing better than, let’s say, a random forest trained with the same combination of train / test set? You cannot compare based on the validation set, because that validation set was part of the fitting of your model. You used it to select the hyper-parameter values. But I am sure you are all thinking about the test set. Why not use this to evaluate which algorithm is the best one, right? Nope!

In principle we should not touch our test set until after we have chosen all our hyper-parameters and algorithm. Were we to use the test data in the algorithm selection process, there is a risk that we might overfit the test data. Then we would be in serious trouble. If we overfit our training data, there is always the evaluation on test data to keep us honest. But if we overfit the test data, how would we ever know? In other words, if you use the test set multiple times, you “leak” information and end up with an optimistically biased result.
Note: One possible solution would be Multiple Split Tests; split multiple times the data set into train/val/test sets and calculate the average error across the test sets. However, there is one issue with this method. There is a possibility that some samples were never included in the training or testing set.

Thus, we should never rely on the test data for algorithm selection.

Why do we need Nested CV instead of CV?

Because we are performing two procedures:

Optimise the hyper-parameters
Estimate an unbiased generalisation performance

By selecting the best hyper-parameter set and estimating the generalisation performance based on the average k-fold performance, we introduce a bias into the procedure.

Nested CV for Algorithm Selection

According to research: "A nested CV procedure provides an almost unbiased estimate of the true error". This is how it looks like:

A step-by-step guide for each of the N classifiers you are interested in:

Define a parameter grid of the hyper-parameters
Perform a separate nested CV. Note: The selection of hyper-parameters in the inner loop will vary from fold to fold. Therefore, this is not a way to select any hyper-parameters.
Use the average score of the outer loops to get an unbiased estimate of the classifier under consideration.

After performing nested CV for finding the best algorithm across a collection of possible algorithms for your problem, you calculate the generalisation error using the Outer CV, which is just the average of the scores on the test sets of the Outer CV. So, each of the test folds is an Algorithm X but with different hyper-parameters — that is why the generalisation error is actually a pretty good estimate of the true error because the algorithm is trained on different data each fold AND with different hyper-parameters!

tip: additionally to the avg of the Outer CV, also have a look at the std. If it is high then this means instability and thus, you cannot trust it.

Enough Talk. Show me the Code.

For this purpose I use the data from Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media by Chakraborty et al (2016) and their accompanying Github repo. It contains ~30,000 article titles that have been labeled as clickbait and Non-clickbait (almost balanced).

from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV, train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import accuracy_scoreimport numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords

As a text preparation step, we convert text to lowercase and remove punctuation, special characters and stopwords.

WORD = re.compile(r'\w+')
STOPWORDS = set(stopwords.words('english'))
STOPWORDS = [s for s in STOPWORDS if s not in ('not', 'no')]def prepare_text(text: str):
    """
    Text preparation: lowercase, remove of punctuation, special characters and stopwords.
    Args:
        text: input sample from data set
        
    Returns: 
        modified initial text
    """
    text = text.lower()
    text = re.sub(r'[/(){}\[\]\|@,;]', r' ', text)
    text = re.sub(r'[^0-9a-z #+_]', r'', text)
    text = re.sub(' +', ' ', text) # remove extra spaces
    text = ' '.join(word for word in WORD.findall(text) if word not in STOPWORDS) # remove stopwords
    return text

We apply the function to the corresponding input text column of the dataset and split the dataset into train / test set. At this point, it is important to use stratification when splitting the data. This stems from the assumption that both the training and test data are drawn independently from identical distributions (commonly called the i.i.d. assumption).

data['prep_text'] = data.title.map(prepare_text)X = data.prep_text.tolist()lb = LabelBinarizer()
y = lb.fit_transform(data.label)X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=.2,
                                                    random_state=42,
                                                    stratify=y)

Next, we would need to include a hyper-parameter grid for each of the algorithms. The grids do not have to be the ones that you would use for Model Selection, but way smaller. Remember that we just want to test our algorithms on different hyper-parameters and data; rather than finding the optimum hyper-parameter set. We also have to create pipelines for each algorithm we wish to test.

# Pipelines
pipe1 = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('SVC', SVC(random_state=42))])pipe2 = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('RF', RandomForestClassifier(random_state=42))])pipe3 = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('LR', LogisticRegression(penalty='l2', random_state=42))])# Setting parameter gridsparam_grid1 = {'SVC__kernel': ['rbf'], 'SVC__gamma': [1e-3, 1e-4], 'SVC__C': [1, 10, 100, 1000]}param_grid2 = {'RF__n_estimators': range(200, 1200, 200), 'RF__max_depth': np.linspace(1, 32, 32, endpoint=True)}param_grid3 = {'LR__C': [0.001,0.01,0.1,1,10,100]}

Now, it is time for 5x2 Nested CV:

# INNER
rcvs = {}
inner_cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)for pgrid, pipe, algo in zip((param_grid1, param_grid2, param_grid3),
                            (pipe1, pipe2, pipe3),
                            ('SVC', 'RF', 'LR')):
    rcv = RandomizedSearchCV(estimator=pipe,
                             param_distributions=pgrid,
                             scoring='accuracy',
                             n_iter=6,
                             n_jobs=3,
                             cv=inner_cv,
                             verbose=0,
                             refit=True)
    rcvs[algo] = rcv# OUTER
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)for algo, rcv in sorted(rcvs.items()):
    score = cross_val_score(rcv, 
                            X=X_train, 
                            y=y_train, 
                            cv=outer_cv,
                            n_jobs=1)
    print('%s | outer ACC %.2f%% +/- %.2f' % 
          (algo, score.mean() * 100, score.std() * 100))

Results of Nested CV

🎉Hooray🎉 We can now conclude that Logistic Regression is the best choice among the algorithms we’ve tested. The next step would be Model Selection on that algorithm, which is out of the scope of this article.

Conclusion

Machine Learning almost always starts experimentally. You have to be resilient and meticulous all the way towards your ultimate goal. Overall, it is difficult to make a decision, but always remember that data has your back 😉. With a little bit of help from computational statistics you can get the unbiased estimate that you were after.

Thank you for reading ☺️