Turning a repetitive business task into a self-improving process

Simple and practical end-to-end case study of managing the bias vs variance trade-off on a small text data set

Romain Guion

Follow

Published in

VorTECHsa

11 min readAug 2, 2019

--

It comes with little surprise that spam detection at Google or tracking 200,000 vessels at Vortexa involves machine learning and data engineering. Humans cannot process billions of data points, and aren’t as reactive and consistent as computers.

For many analysts, this can feel quite far from their day to day though. In many cases creating a new ad hoc Excel spreadsheet or SQL query can yield good results pretty fast. However, companies that are also applying this kind of technology in non-core areas are gaining a competitive advantage, both for scalability and data-driven decision making. Failing that, routine processes will fill up your weeks, and give you little headroom to spot exceptional opportunities.

In this blog post, we share an example of how a repetitive business task can be transformed into a self-improving process in only a couple of hours. This first prototype is less than 10 lines of code, and achieves 90% accuracy with only a training sample of 1000.

Who should read this? This article is for those that already know a few data analytics concepts, and want to learn how to bring those together to solve a concrete business problem.

After reading this post, you will

be able to improve and automate your manual text processing tasks
know in practical terms how to measure and control under/overfitting
understand how machine learning can yield benefits in your day to day, in only a few hours, without having to beg your engineering or IT teams for resources
expand your intuition on what machine learning actually looks like, beyond the buzzwords

Business context

As part of their benchmarking activities, our energy market analysts need to identify what piece of reference data from customs data or national statistics (e.g. JODI or EIA) corresponds to the data produced by Vortexa for its clients. Each data source has its own definition of what the product is, and this blog post will focus on interpreting a description string. Below are examples of product descriptions:

This is the type of data we are trying to classify into a Vortexa product category

From these descriptions, our market analysts match each reference data type to the relevant product in Vortexa’s taxonomy:

DESCRIPTION is our text feature we want to classify (X), and PRODUCT is the classification / supervision made by our market analysts (y)

Problem formulation: text classification

Our market analysts noticed a repetitive pattern, and asked the R&D team how they could demultiply their efforts with some automated processing.

As formulated, the problem looks like an instance of classification based on text features, also known as natural language processing (NLP).

Reality check: manually generated supervision sets tend to be quite small

Although they have tens of thousands of data types to classify for benchmarking, our market analysts had manually labelled only 1400 samples at the time of writing. In contrast, in our data set there are 650 different words — that’s only 2 samples per words on average.

What’s worse is the words cannot be straightforwardly associated with a result: e.g. in our example #2 above, the description includes “excl. blends of biodiesel”. This means words will have to be interpreted in their context: we cannot simply infer that the presence of the word “biodiesel” means it is biodiesel (quite the opposite!).

A first simple proxy to capture the context of words is to track neighbouring words — a concept called n-gram.

Example of 1-gram, 2-gram, 3-gram and 4-gram in the context of our data set

If we consider all combinations of 1, 2, 3 and 4 neighbouring words (1, 2, 3, 4-grams), this leads to 6600 features.

This is 6 times more parameters than data points. If we are not careful, a model could be built that learns by heart all the examples, but is not able to learn generalisable lessons from them. This means that it will make zero mistakes on the training set, but will not be able to predict new cases effectively. In statistics jargon this is called overfitting or variance. Example below:

Illustration of overfitting and underfitting, in a simple example with ‘Time’ as a feature X, and ‘Values’ as the variables to predict y. On the left a 1 degree polynomial underfits the data set (too simplistic), which would display poor performance on the training data. In the middle this looks like a balanced model. On the right, the very high degree polynomial overfits the data set, which would show high performance on the training data, but wouldn’t generalise well on the test data (source for the illustration)

Choosing a robust & easy to use algorithm for a first prototype

State of the art NLP typically uses complex neural networks, and it can be tempting as a data scientist to jump on those tools at any opportunity. The context in which those tools are used typically involves both very large data sets and highly complex inference tasks (e.g. chatbots or some of Vortexa’s core models). However, they tend to have many parameters to tune and can make the overfitting problem worse.

Instead, we chose another class of algorithm: Random Forest classifiers. Why we chose those is beyond the scope of this post. In a word, this class of model works well without much optimisation — it is particularly robust to overfitting.

Extracting a tiny bit more information than a bag-of-words

Previously we saw that text patterns can be of different kinds (incl. word and n-gram). Those patterns of language, also called tokens, are linked together with grammar, and the absolute and relative sequence of tokens can be quite complex in human language. This section is about the amount of complexity we want to encode to resolve the task at hand and how.

Simply identifying the presence of a specific patterns / word in the description can actually go a long way — it is often called a Bag-of-Words (BoW) and is basically encoding words as dummy variables (each pattern / word will be represented as a new column that is equal to 1 when the word is present in the row, 0 otherwise).

A simple method to extract a bit more information from a group of documents is to remove or penalise words used non-specifically in many descriptions (documents), and highlight words that appear only in some specific cases and are perhaps used with insistence in that specific context.

Scikit-learn implements into a single method a number of strategies that go in that direction: removing stop words (and / then etc), setting bounds to term frequency, and scoring terms by Term Frequency and Inverse Document Frequency (TF-IDF), and storing them in a memory-efficient ‘sparse’ matrix:

Implementation of Term Frequency Inverse Document Frequency (TF-IDF), in Python with scikit-learn

Further step to control overfitting: dimensionality reduction

Applying dimensionality reduction algorithms to our product descriptions’ will identify what groups of words best separate our samples. The idea here is to remove groups of words that are redundant or overlapping (collinearity) and identify what combination seems the most complementary. Here is an illustration of the concept of dimensionality reduction:

Illustration of the concept of dimensionality reduction, using the example of an algorithm called PCA — Principal Component Analysis (illustration source). This describes a data cloud in a 2D space. PCA will identify in which directions the data gets best split, and will return the data on this new base. The axis are ordered by decreasing discriminatory power: as you can see PCA1 axis separates the data cloud more than PCA2. From there, dimensionality reduction consists in keeping only the top N dimensions. In this example, we could choose to keep only PCA1, and we would have reduced the dimensionality from 2D to 1D.

Reducing the number of features this way can improve the effectiveness of our classifier downstream for a fixed computational time. If we were to tune the downstream model perfectly, and didn’t care about processing time, chances are this wouldn’t improve the model’s performance, and would even decrease it given the information lost in the dimensionality reduction.

However, for a quick job I’d argue this can help performance by reducing the risk of overfitting (fewer features) and make the feature choice more optimal (orthogonal features, i.e. that aren’t collinear, are chosen more effectively). Given the high number of features here and the small number of samples, I judged likely that dimensionality reduction would help get good results fast. Training our classifier for longer or selecting features in different ways may lead to better results in the long term though.

There are a range of dimensionality reduction strategies and algorithms. I semi-arbitrarily chose truncated Singular Value Decomposition (SVD) to preserve the sparse matrix structure of our pipeline. This is for memory optimization, but the benefits are limited while the data set is small.

Let’s try it!

Python implementation of text processing (TF-IDF vectoriser), dimensionality reduction (SVD) and classification (Random Forest) using a scikit-learn Pipeline object

Our simple prototype fits in only a few lines of code. We wrap it up in a pipeline so it is easier to tune (see later).

How should we assess what good is? The choice of the metric depends on the business problem. Here we want a high fraction of correct results vs incorrect results, because that’s what market analysts will have to correct: this is called accuracy. With this quick prototype we get 90% accuracy — not bad for such a small data set and simple model!

In your specific case you need to choose your metric carefully — accuracy may not be the best metric for you. In particular, it is important to monitor imbalanced classes (e.g. much more gasoline than naphtha here). To do so, looking at predictions separately within each class is a good idea, and other aggregate metrics will come in handy. For example, we noticed that some classes have bad scores where reviews are few (e.g. naphtha). This means we should not use results for this category yet, until our analysts reviews more of that category. For the simplicity of communication we’ll keep to accuracy henceforth.

A word on fine-tuning — hyperparameter optimisation

Even after a careful design process, there are still case-dependent optimisations that are hard to anticipate from a theoretical standpoint. One way to further improve an algorithm is pretty basic: try lots of combinations of different parameters — this is called Gridsearch.

To optimise the parameters of both the preprocessing steps and the classifier step, we wrapped them into a Pipeline object. Below is an illustration of how simple Gridsearch looks like, with a first set of parameters to explore, based on my experience:

Illustration of how Gridsearch works in Python with scikit-learn

Gridsearch results for param_grid chosen above

A note of caution: if you use only a train set and a test set, after a large hyperparameter optimisation you may have indirectly fitted the hyperparameters to the specificity of your test set. Instead, you probably want a training set, a cross-validation (CV) set, and a test set. More sophisticated methods allow you optimise training and CV, e.g. k-folds, beyond the scope of this post.

What actually had an impact?

So far I gave you my perspective on how this problem should be approached. Let’s try our base model with all parameters constants, apart from the number of n-grams and with or without dimensionality reduction.

Impact of 2 parameters on model accuracy: number of n-gram, and using dimensionality reduction or not (SVD is the dimensionality reduction algorithm we picked here)

Here SVD seems to have helped the model learn faster (given more trees and therefore more computational time no-SVD may have reached a similar level). Adding more n-grams seems to have only a marginal impact.

Let’s also take a step back on what we have achieved: by using 6-gram with SVD we can predict 9 times out of 10 the right product. At this stage you may decide that (i) this is good enough, (ii) this can be corrected manually with way less effort and (iii) this needs improving. Case (i) doesn’t need more comments, and for case (ii) you would want to review in priority low probabilities. For the rest of this post we assume you are in situation (iii).

Would more data help?

A key question an organisation needs answering is whether more of the same data would help improve the results, or if more effort needs to be spent in creating new features, or better data / deeper / more precise.

The learning curves below show the models having been trained on a fixed-size evaluation set.

→ The x-axis: training set size increasing from left to right.

→ The y-axis shows accuracy:

The red curve represents the accuracy on the training set: the model is trained and evaluated on the same data. This represents how closely the model manages to fit the data.
The blue curve represents the cross-validation accuracy, i.e. on data the model has never seen yet. This represents how well the model can generalise.

*Top-left: no SVD, 1-gram ; top-right: with SVD, 1-gram ; bottom-left: no SVD, 6-gram ; bottom-right: with SVD, 6-gram*

Interpreting training performance (red curve)

It is expected that the training accuracy will decrease as the dataset size increases, as the number of degrees of freedom of the model remain the same while a broader dataset displays more variability, and therefore is harder to fit.

More interestingly, on the graph above the training accuracy is pretty low when there is no SVD (on the left side): this could be that the model cannot capture or encode the variability shown in the data. However, with SVD the training accuracy is pretty good with the same underlying features. So probably what is happening is that without SVD, our classifier does not manage to identify what features matter and fails to fit the data! While with SVD we did an unsupervised pre-selection.

Interpreting cross-validation performance (blue curve)

It is expected that the CV accuracy will increase with more data, as the model learns to distinguish noise (that would be overfitted) from genuine generalisable signal.

In the case with no SVD (left column), adding more data seems to slow improvements to the model. That’s probably because more data also brings more new words, and so more features. For the classifier as defined, the increase of signal and noise seems to be growing at a similar speed, and the signal to noise ratio only barely improves.

In contrast, for the model with dimensionality reduction (right column), adding more data clearly helps the model make better predictions.

Interpreting the difference between training and CV performance

The gap between training and CV performance is an indication of how much gain more data would make.

If both are really close, more data of the same kind won’t help, and more features, a better model or perhaps deeper / more precise data needs to be explored. Or perhaps we reached the maximum information that can be extracted from the data, and perhaps the remaining error is true noise that nobody or no algorithm could reduce — e.g. there could be mistakes in the data being analysed.

If the gap between training and validation is large, the model is managing to fit the data, but doesn’t generalise well. Often the difference reduces with more data, as is the case here, and this is an indication that more data is needed to improve the model performance, and reduce the overfit.

Conclusion

With this benchmarking example, half a day of work allowed the team at Vortexa to reduce the workload on market analysts by 90%.

Through this article you have learned how to approach a practical machine learning problem, and how to measure its performance and interpret how to improve it.

In this journey we used a few specific tools:

N-grams and TF-IDF to extract simple information from text,
SVD for dimensionality reduction,
Random Forest as a classifier,
Gridsearch and pipelines to tune our model,
Learning curves to understand whether to improve the model or get more data

For Vortexa, the next step here is to increase the supervised set through labelling more of our raw data. This will be prioritised to focus where the business needs higher performance, and where the classifier makes the most meaningful mistakes. In parallel, we have also improved our results with other features, outside the scope of this post.

If that solution proves valuable, it may need to be wrapped into a bit more engineering to be automated, and deployed in a scalable way throughout the organisation. This is the topic of an upcoming blog post. However, this should by no means prevent you from improving and automating your work and that of your close colleagues:both for proof of concept and immediate impact.