Approach to selecting the best algorithm for classification problems — A case study!

5 min readOct 30, 2018

One question that gets asked a lot is what kind of algorithms should be used for a particular problem. The short answer is it depends.

Scikit learn has a good article that illustrates the different kind of algorithms to be considered for different problems such as regression, classification, clustering and dimensionality reduction. In some cases, there are some industry practices such as using Convolutional Neural Networks (CNN) for image recognition problems. However, selecting the best algorithm to use for a particular problem is an art rather than a science and there is no straight answer. So in this article, let’s look at the approach to selecting the best algorithm for a classification problem.

Selecting the right algorithm for classification problem — A case study

Let’s go through a use case and find out how to select the best algorithm for a classification problem. The business problem that we are going to tackle is as follows — Categorize the incoming emails from users coming into a university system and automatically allocate these emails to appropriate customer departments based on the contents of the email leveraging AI and NLP techniques. So from the problem statement, we realize that this is a multi-classification problem.

Here is a pretty picture explaining the problem:

To simulate this problem, the unstructured data from the 20 newsgroup data set (http://qwone.com/~jason/20Newsgroups/ ) was used.

Steps to select the right classification algorithm

The following flow explains how the emails are extracted from the system, how the data is cleansed, tokenized and scored, and how appropriate classification algorithm is used to classify emails. You can choose to apply dimensionality reduction techniques and regularization depending on the accuracy of the model and the algorithms that you are using.

**Approach to NLP and selecting the right classification algorithm**

Measuring the efficacy of different classification models

The accuracy of different models were tested using accuracy, precision, recall, F1-score, confusion matrix and time taken to execute the model. The best model was selected based on a combination of these different factors. Below diagram shows some of these metrics.

**Measuring different classification algorithms**

Note: All algorithms (E.g. Neural Networks, Logistics Regression etc.) were not considered when evaluating this model. This evaluation is representative and you can add as many classification algorithms as you can.

Linear SVC was the clear winner followed by K Nearest Neighbor and Naïve Bayes. This was expected as SVM algorithms are best for Natural Language Processing. Since the data was not massive, the time of execution was not a major factor. The below bar graph illustrates the accuracy and time taken for each model.

**Comparison of different models — Accuracy and Time Taken**

Now let’s also look at the things to consider when selecting an algorithm. This topic is extensively covered in many research papers and publications, but it’s worth reviewing as it is fundamental to selecting the right algorithm. Microsoft, Scikit learn and dataschool.io have some excellent document that I included in the references below.

Factors to consider when selecting an algorithm

There are a lot of assumptions that goes into selecting an algorithm such as availability of enough training data to achieve good enough performance, imbalanced data set leading to accuracy paradox, issues related to high correlation etc. With these caveats out of the way, here are my simplified top 5 things to consider.

1. Metrics such as Accuracy, Precision, Recall, AUC and F1-Score — Metrics such as Accuracy , Precision, Recall, AUCand F1-Score are key drivers when deciding an algorithm. Displaying the confusion matrix and identifying the data that was misclassified and to be able to provide a reasonable rationale for the misclassified data is also important. E.g. In my case an example was classified under religion instead of computers. This is because the training had words that were confusing to the model.

2. Training time — If we are looking at massive data sets, training time will be an important factor. In my scenario, Random Forest took 10 times longer than Naïve Bayes.

3. Linearity — Most of the times linear algorithms are the easiest and fastest and the preference is to use them if it is a straight forward problem. Few examples of linear algorithms are Logistics regression and Linear SVM.

4. Number of parameters — Parameters are the knobs that data scientists use to fine tune the models. In the case of Neural Networks, you will have to do a lot of parameter turning to get to the appropriate model, whereas in linear models the parameter tuning is minimal. Algorithms with more number of parameters requires a lot of trial and error to get to the best outcome.

5. Number of features — If the number of features are high, you may want to look at dimensionality reduction techniques to improve accuracy. Consider reducing the features (zero importance features, collinear features etc.) and use the features that are relevant- Here is an approach to feature Selection Tool for Machine Learning in Python.

So let’s summarize what we have done so far:

Reviewed the top 5 things to consider when selecting an algorithm
Reviewed the approach to finding the best algorithm for a classification problem by looking at a use case (multi-classification problem)
Reviewed the steps leading up to selecting the best classifier

The code is available in my Github. I am looking forward to your feedback.

Email — yasimk@theunio.com

Twitter — yasimk

LinkedIn — https://www.linkedin.com/in/yasim/

References for selecting best classifiers

https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice