Classification Algorithms — what will work?

9 min readApr 10, 2019

If one is working on any of these kinds of problems — text categorization (e.g., spam filtering), fraud detection, optical character recognition, machine vision (e.g., face detection), natural-language processing (e.g., spoken language understanding), market segmentation (e.g.: predict if customer will respond to promotion), bioinformatics (e.g., classify proteins according to their function); one is working on ML classification problem. The obvious question that comes to mind is “which algorithm to use for the model”, whose answer is “it depends”. “it depends” because it depends on the size, completeness and nature of the data, machine-resources available to you, time at your disposal and on what do you want to do with your answer. The last question stems from more recent needs coming to data scientists around — augmented analytics (like a crop picture gets geo-coded with attached weather-data, to share with community) or explainable analytics or is it feeding into the workflow.

First and foremost, a data-scientist should become intimately familiar with the data.

A general rule of thumb is, when the data has very large number of features, the linear algorithms may perform better. Here’s the anecdotal stuff from Andrew Ng about when to use linear kernel in a SVM. When should linear kernel be used in SVM? Andrew Ng gives a nice rule of thumb explanation in this video starting 14:46, though the whole video is worth watching.

Key Points

Use linear kernel when number of features is larger than number of observations.
Use gaussian kernel when number of observations is larger than number of features.
If number of observations is larger than 50,000 speed could be an issue when using gaussian kernel; hence, one might want to use linear kernel.

But before we delve too deep on the algorithms itself, it’s important for me to cover on two aspects — one, is a continuation from part1 of this blog; i.e. to mention the importance of few of the other performance metrics which comes handy in algorithmic models comparison and two, is to highlight some of the ways in which I like to go over the data before even thinking about modelling.

Performance metrics

Logarithmic Loss

Logarithmic loss (or logloss) is a performance metric for evaluating the predictions of probabilities of membership to a given class. The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm. Predictions that are correct or incorrect are rewarded or punished proportionally to the confidence of the prediction. Smaller the logloss the better the model performance is.

kfold = KFold(n_splits=10, random_state=7)
model = LogisticRegression()
scoring = 'neg_log_loss'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("Logloss: %.3f (%.3f)" % (results.mean(), results.std()))

Data exploration

Structured data

Perform Statistical analysis — by the end of this; you will know all the columns, what are their average, mean and median (to know the central tendency of the data; know the range of each column or feature; is there a correlation between two or more columns; which columns have missing data and how can you fill those gaps. DO NOT initialize the missing data to 0 or one such constant. Know what that feature represents and how could you fill the gap. For example, if the feature represents surface temperature of geocoded locations in a time series; take the trouble of finding what could be temperature of the place in the given week or previous and next week around the missing data. If the missing data is 10 or fewer per feature; always initialize them manually. A good analysis done in the beginning will save you from agony at the end.

Visual analysis — doing a simple box-chart on columns of interest will present to you all the outliers. Outliers are like missing data, pay attention to what you want to do with it. Simple histogram and density plots will show to you the spread of data. Scatter plot can be used to identify bivariate relationship. Remember at this stage, your customer will be very responsive about the nature of data, validate your assumptions about data initialization or answer your questions on bivariate relationship. Do not lose sight of the objective of this phase — “you should become intimately familiar with the data and how it got generated”. Dashboard and beautiful charts can wait for the end of the project.

Image data

One of the first step is to know the statistics around the data itself — how many classes or types or layout of the training data; is the label representing just the classname or is it a textual explanation. Are the number of images balanced across the classes? Write a simple program to randomly show few images. Look for their similarity in size, texture, orientation, colour, and objects across these images. Image analysis can include such tasks as finding shapes, detecting edges, removing noise, counting objects, and measuring region and image properties of an object.

This exploration will lead to identification of pre-processing tasks. For example, you might have to enhance some of the images for poor contrast or to enhance the multispectral colour or to correct non-uniform illumination of objects in the image or to improve the low-light exposure. Sometimes one may have to resize the images because the training set has variations. At this stage, remember not to blindly resize the images because making a 200X100 images into a 100X100 image will distort the object of interest. In such cases, cropping should be preferred over blind resize. Sometime back I had worked on landcover identification for which I had to use many of these techniques and I will publish that blog to show you through code the approach that I had taken.

Text data

Some of the tasks involved here are mostly mandatory and simple, like — removing stop words, removing special characters, lowercasing, possibly removing numbers, dates etc. Additionally one might have to stem the words or go with n-grams etc. There should be some uniformity across classes being trained for in terms of number of training examples and length of the documents.

Now that pre-processing task is over, let’s get to the classification algorithms. Microsoft has a cheat-sheet published here for algorithm shortlisting. But I will not stick my neck out on this one as I have seen too many departures. Instead, my suggestion will be to become aware of the nuances of many of these (classification) algorithms and to create your bucket list from them. Then try all that you shortlisted for spot-checking on the same data before finalizing one.

Classification Algorithms

This is not an exhaustive list. My attempt is only to mention those algorithms which gets used more often for real-life classification problems. But that said, you may want to evaluate the left out bunch like knn, CART, Naïve Bayes, Bagging (except Random Forest), Boosting (except xGboost) etc. depending on your dataset size and its exploration outcome.

Logistic regression

Logistic regression assumes a Gaussian distribution for the numeric input variables and can model binary classification problems. One can add weights to predict on three or more classes too. Logistic regression, also called logR, is the natural log of the odds of an event occurring (y=1). Its expressed as log(p/1-p). Odds is the probability of an event occurring divided by probability of it not occurring. And probability is the number of occurrences of an event divided by total number of events. Hence we know that the probability will range from 0 to 1; odds will range from 0 to infinity and logR will range from –infinity to +infinity. After ascertaining the logR value of an event, one can deduce its probability by e (the famous irrational number = 2.718 something) to the power of logR divided by 1 plus e to the power of logR.

Keras already does that for you, which we will review in the code, at the bottom of this article.

Linear Discriminant Analysis

When the dataset has multiple (three or more) very distinct classes and the size of training data is limited; logistic regression will yield ordinary classification results. This is where LDA comes handy. It assumes that the classes are very distinct, the data follows Gaussian distribution and each attribute has the same variance as each value varies by, from its mean; on an average. You can read more about it here.

Random Forest

Random Forest is a model made-up of many decision-trees. It doesn’t get into averaging the predictions of each tree for a given unseen input rather its beauty lies in approach it takes to build the model — one, it takes a random sample of training data and two, random subset of features are considered for splitting. Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation between individual classifiers. Specifically, rather than greedily choosing the best split point in the construction of each tree, only a random subset of features are considered for each split.

To understand these two points better, let’s pick a real-life example. Let’s say you want to buy or sell Infosys stock. To help you with decision-making you have hired a bunch of analysts. Let’s say these analyst have no personal investment in Infosys. So they have no bias. All they have is to refer to the reports and news about the company, which is the data that you have. If you distribute the reports randomly to the analysts, then the problem you will introduce is to pass on the bias build in the reports to the analyst. Hence influencing their decisions. To overcome that, you not only randomly distribute the reports but also randomly hide portions of these reports while sharing it with the analysts. Thus the in-build data bias will get negated when you average the predictions across all analysts.

eXtreme Gradient Boosting

There are times when Random Forest (RF) is not suitable for classification of a dataset. My experience shows that in-general if you have to rank the data then RF is not the answer. The other case could be that you have few features which are more important than the rest in your dataset. Again RF is not the right choice. Gradient boosting or GBM works only with numeric data. Its an ensemble of trees. It starts with selection of a partial set of input data and its label. Builds a baseline rule to classify them in the first pass. Then comes back to pick another set of data from that set for which largest number of misclassification was done by the previous rule. From that partial dataset it forms another rule which compliments the previous one and keeps building the trees which each iteration on the delta dataset. Because of the nature of the algorithm it’s hard to train, because of the number of parameters to tune, though in 70% of the cases, the defaults are good enough. It is also susceptible to over fitting unlike RF. However once trained properly, it will return very accurate results and that too, it is much faster than RF, at run-time, because of having fewer number of trees.

Support Vector Machine

Support Vector Machines (or SVM) seeks to identify a line or a hyperplane that best separates the training set into two classes. Those data instances that are closest to the line that best separates the classes are called support vectors or margins and therefore, influence where the line is placed. SVM has been extended to support multiple classes. Of particular importance is the use of different kernel functions via the kernel parameter. A powerful Radial Basis Function is used by default. SVM is used for text classification tasks such as category assignment, spam filters and sentiment analysis. It is also commonly used for image recognition challenges, performing particularly well in aspect-based recognition and colour-based classification. The advantage of this algorithm is it high accuracy but its slow.

Of-course there are many deep-learning techniques that could be used for classification. We will discuss about them in a separate article. If time permits, run three or more of these models on the same data with exact same train, test splits; to ascertain which two to evaluate further for performance improvement.

Top-pick evaluation

# prepare models
models = []
models.append(('XGB', XGBClassifier()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('RF', RandomForestClassifier()))
models.append(('LR', LogisticRegression ()))
models.append(('SVM', SVC()))
# evaluate each model in turn
scoring = 'accuracy'
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

You can always connect with me on Twitter and LinkedIn.