Ratings based on human assessment — A regression or a classification problem?

Ricardo N Santos
DataLab Log
Published in
8 min readAug 12, 2019

Supervised machine learning problems are typically separated into two specific and diametrical tasks: regression and classification. The choice of which method to use relies on the type of problem we need to solve. If a model needs to predict a target numerical value within a continuous scale, such as the price of a product or the probability of an event to happen, we face a regression problem. As another point of view, if we need to predict on which category an instance best fits based on its attributes, for example, a dog breed, we are dealing with a classification task.

Despite this customary dichotomic view in supervised machine learning problems, some situations cannot be clearly defined as a regression or a classification task and can raise doubts about which strategy to use. Most of these situations are seen when the target to be predicted is based on human preferences. For example, let's suppose we need to build a model to classify chili peppers for their perceived strength (or spiciness) in humans based on physical and chemical attributes. At first, a tempting solution would be to divide their reported strength in several classes (e.g.: weak, mild, hot, very hot) and train a model based on these categories. A crucial limitation of this strategy the reader should have promptly noticed (I hope so) is that the information of the categorical ordering is totally ignored in this point of view, due to the nature of machine learning classifiers. In few words, all classification mistakes are equally bad. This omission can lead to unstable classifiers with situations where the prediction jumps between non-adjacent classes with minimal changes of feature values (i.e. distant classes sharing same hyper-parameter boundary in multi-dimensional space). As a consequence, a small change in chili pepper features could lead our classifier to occasionally mistake very hot and weak peppers.

Leaving aside classification approach, a data scientist would then represent the range of pepper strength in a numerical scale (such as a rating represented by integers from 1 to 4, in analogy to previous "weak" to "very hot" scale) to train regression based models. An important and not so trivial limitation of this approach (and sadly often overlooked in such studies) is that when doing so, the trained model receives the information that rating values are always equidistant and, accordingly, that the criteria defined by humans to characterize the target (e.g. the perceived response of senses to stimuli) is linear. Valid use of regression models for such problems only holds if the premise of an evenly distributed scale of preference is reasonably true. Otherwise, regression models may provide poor prediction. As a matter of fact, this premise is not the case for our discussed problem of rating chili pepper strength when considering the well-known Scoville heat unit that accounts for the chemical level of capsaicin molecules as a scale to compare spiciness among peppers and other foods.

Ordinal Regression

Prediction of rating based on human preferences is a ubiquitous task observed in most diverse areas, such as inference of customer satisfaction, political bias, recommendation systems, and social surveys. These situations detach from the scope of traditional classification and regression methods as far as the subjectiveness of the predicted target (e.g. rating movies or tasteful dishes) obliterates the true scale of a target variable and leaves the relative order of values as the only purpose. Accordingly, we need machine learning methods that are able to predict discrete and ordered target variables. A statistical learning method that is able to deal with prediction of ordinal variables is ordinal regression (also known as ordinal classification).

Ordinal regression is based on the application of generalized linear models (GLM). These models are general in the sense that they extend the uniform slope of a response to a set of coefficients w = (w₁, …, wK) and thresholds ₁, …, θK−1) to fit data. Therefore, if we consider a target within a scale of K ordered numbers, we can divide it into a set of K disjoint segments by thresholds K-1 thresholds (θ₁ < θ₂ < … < θK−1) to represent progressive response levels. For the sake of completeness, we can set θ₀ to -∞ and θk to +∞.

Based on the generalization described above, we can apply statistical constructions to estimate the optimal threshold in order to reduce prediction errors (minimize a loss function) in ordinal regression models. Two possible implementations are the immediate-threshold and the all-threshold methods, that allow fitting models based on the sum of loss penalties in all segments of a target. In the first case, we can set a loss function that continuously increases in both directions outside the range of the correct prediction (when a threshold is crossed).

loss for immediate threshold method

where z is the predicted response by the model, i.e. z(x). A depiction of the behavior of immediate threshold loss function for ordinal regression (having y=4 as the reference target) is shown in the left portion of the figure below.

J. D. M. Rennie and N. Srebro. Loss functions for preference levels. 2005

As observed in the graphic above, a limitation of the immediate-threshold loss is that it ignores how many thresholds are crossed relative to the origin of the correct label. A more complete construction that bounds multiple penalties according to the number of threshold violations is the all-threshold method.

loss for all-threshold method

where l is the reference range for loss computation and s(l, y) is positive if l > y and negative otherwise. The profile of all-threshold loss function when considering y=4 as reference target is shown on the right side of the figure above. We can notice an important property in that case: the slope of the loss function increases each time a threshold is crossed. Therefore, more than reducing the distance of the predicted value from the correct label, this method drives the model to minimize the number of thresholds crossed from the correct response.

Just like the logistic regression can be fitted to data for classification tasks based on the posterior probability P(y=j|Xi), we can also employ the same conditional probability fitting to logistic functions to perform ordinal regression. We can compute the cumulative probability of a target ordinal value based on the following logistic function:

where w, θ are the set elements of slope for loss penalty and the threshold for a specific segment of the target range.

MORD library

Now that we set the basic idea of ordinal regression, a key step to making this knowledge useful is to move from theory to practical examples. In order to do that, we will use a python implementation of a collection of ordinal regression methods called MORD (Multi-class classifier for ORDinal regression). This implementation uses the same API of the well-known scikit-learn. This means it is very user-friendly and a no-brainer extension for scikit users.

Assuming you already have Anaconda python package for machine learning installed (which already includes pip and scikit-learn), MORD module can be installed with the following command in a terminal:

pip install mord

Alternatives for installation can be found at https://pypi.org/project/mord/

Predicting Preference in Automobiles

In this example, we will compare the performance of (i) regression, (ii) classification and (iii) ordinal regression/classification models to predict consumers preference for cars base on their attributes, such as safety, price, maintenance cost, and luggage boot size. The selected database includes 1728 customer evaluations and corresponds to the UCI Car Evaluation Dataset after a pre-processing where the target (consumer acceptance level) is label-encoded.

Firstly we should import the libraries to be used in this study:

Next, we can load the dataset and take a look on its features

outcome of df.head() command

As we can observe, some of the features that will be evaluated to predict the rating of cars are categorical. Therefore, we need to define a function to encode these categorical features as binary numerical columns (a.k.a one-hot-encoding).

and select numerical and categorical features to be passed:

Then, we apply this function to encode categorical info.

Now that we have the dataset prepared for application in machine learning, we can create a pipeline within a function to train and evaluate distinct models and make life easier:

Herein we will compare the results of prediction models considering the blameworthy approximations (as already discussed) of pure regression or classification methods (I) linear regression and (II) logistic regression and the ordinal regression methods (III) immediate-threshold and (IV) all-threshold. Hence, we need to define a list of these models and their respective parameters (keeping them as comparable as possible for a legitimate benchmark).

Presence of the two seemingly incompatible metrics of accuracy and mean absolute error (MAE) in the defined oraculo function may cause clumsy feelings to some readers. Since the systems discussed herein involve a blend of classification and regression tasks, considering both metrics is a good practice.

Finally, we can run our models and compare the results.

Those readers who persevered up to this point would receive the following outcome:

(I) Linear Regression model:

outcome of predictive model based on Linear Regression

(II) Logistic Regression (classification) model:

(III) Immediate-Threshold Ordinal Regression model:

(IV) All-Threshold Ordinal Regression model:

Obtained results clearly show a substantial superiority of ordinal regression in prediction performance for external validation when considering either accuracy or MAE metrics. Furthermore, we can also observe a better balance between accuracies for distinct rates (1 to 4) of car preference. A clear comparison of the performance of proposed models is provided in the figure below.

I leave for the reader the opportunity to compare the performance of these models after parameter optimization (you can take advantage of the ready-to-use GridSearchCV in our pipeline).

The database and notebook to reproduce all steps of this post (including plots) can be found here.

--

--