Our Data Science robot intern

Automating the boring parts of machine learning

Published in

Legiti

11 min readAug 5, 2020

Machine learning practitioners know how overwhelming the number of possibilities that we have when building a model can be. It’s like going to a restaurant and having a menu the size of a book and we never tried any of the dishes. What models do we test? How do we configure their parameters? Which features do we use? Those who try to solve this problem by ad-hoc manual experiments end up having their time consumed by menial tasks and have their work constantly interrupted to check results and launch a new experiment. This has motivated the rise of the field of automated machine learning.

The main tasks of automated ML are hyperparameter optimization and feature selection, and in this article, we will present Legiti’s solution to these tasks (no, we didn’t hire an intern to do that). We developed a simple algorithm that addresses both challenges and is designed for rapidly changing environments, such as the ones found by data scientists working in real-world business problems. Some convenient properties and limitations are presented, as well as possible extensions to the algorithm. We end by showing results obtained applying it to one of our models.

Introduction and motivation to the problem

At Legiti, we build machine learning models to fight credit card transactional fraud; that is, to predict whether an online purchase was made by (or with the consent of) the person who owns that credit card or not. We know when fraud occurs after some time when a refund is requested by the owner of the credit card, a process known as a chargeback. This is translated into our modeling as inputs for a supervised learning algorithm.

If you already have a good knowledge of common tools for feature selection and hyperparameter optimization you can skip to the “An outline of (the first version of) our algorithm” section.

Machine learning models usually take many parameters that can be adjusted

In our case, we’re currently using XGBoost. XGBoost builds a tree ensemble (a set of decision trees that are averaged for prediction) with a technique called gradient boosting. This algorithm usually works quite well “out of the box”, but its performance can be maximized by tuning some characteristics of the model, known as “hyperparameters”. Some examples of hyperparameters are the depth of the trees of the ensemble and the number of them.

Hyperparameters are not exclusive to XGBoost. Other ensemble algorithms share a similar set of hyperparameters and different ones exist for neural networks for example, such as the number of layers and number of nodes in each layer. Even linear models can have hyperparameters — think of L1 and L2 regularization parameters in Lasso/Ridge regression. Any applied machine learning practitioner, if they want to maximize the performance of their model, will be engaged, sooner or later in the task of optimizing these parameters, usually by trial and error.

We also usually have many features at our disposal

In solutions that use a lot of structured data to make predictions (in contrast with, for example, image or text processing with deep learning) a lot of the work is not on the model itself, but on its inputs; that is, on the features of this model. Another similar problem to hyperparameter selection arises here — the problem of feature selection.

Feature selection consists of choosing an optimal configuration of features that will maximize performance. Usually, this performance is measured in predictive capability (some accuracy metric in out-of-time validation sets), but other factors can count as well, such as the speed of training and interpretability of models. At Legiti, we have, for one of our models, more than 1000 features at our disposal. However, simply throwing these 1000 features to a model doesn’t necessarily lead to better accuracy than, for example, using only 50 carefully selected features. On top of that, using multiple features that yield little predictive firepower can make training time much longer.

The amount of features we are choosing from is higher than the variety of products in this shelf. Photo by NeONBRAND on Unsplash.

No one wants data scientists wasting their time with manual tasks

Unfortunately, when training a model, the model won’t tell you what the optimal hyperparameter and feature sets are. What data scientists do is to observe out-of-sample performance to avoid the effect of overfitting (that can be done in multiple ways, but we won’t enter in details here). The simplest way to address this problem is to experiment manually with different sets of hyperparameters and features, eventually developing an intuitive understanding of what kind of modifications are important. However, this is a very time-consuming practice, and we do not want to waste our data scientists’ time with often quite manual tasks. What happens then is that most machine learning “power users” develop and/or use algorithms to automate that process.

Some existing algorithms

Several feature selection and hyperparameter optimization algorithms exist, and lots of them are packaged in open-source libraries. Some basic examples of feature selection algorithms are backward selection and forward selection.

Another alternative is using the Lasso — an L1-regularized linear regression model that has sparseness properties. That means that lots of features end up having weight 0. Some people use the Lasso only for selecting the features that were assigned weights and discard the actual regression model.

Hyperparameter optimization is usually done either with very simple algorithms (random or grid selection) or with complex Bayesian optimization surrogate models, such as the ones that are implemented in the Hyperopt and Spearmint packages.

However, despite all this available tooling, as far as we know, there’s not a universal tool that is good both for hyperparameter optimization and feature selection. Another common difficulty is that the optimal hyperparameter set depends on the set of features being used and vice-versa. Finally, most of these algorithms were designed for a laboratory environment such as research programs or Kaggle competitions, where usually neither the data nor the problem changes with time.

Given all of that, we decided to build our own.

An outline of (the first version of) our algorithm

When we decided to build our own thing instead of relying on other methods, we didn’t do it because we thought we could do something “superior” to anything else available. In fact, it was more of a pragmatic decision. We didn’t have anything at all in place (we were at the stage of manual tweakings to features and hyperparameters) and we wanted to build sort of a minimum viable solution. None of the algorithms we found would work with little or no adaptation at all, so we decided to code a very simple algorithm and replace it with more elaborate solutions later. However, this algorithm exceeded our expectations and we have been using it since then. Some numbers are available later in the results section.

Description of the algorithm

The strategy we took was to use our best solution at the time as the initial anchor and apply small random modifications to it. And so, we built an algorithm that takes the following general steps:

1) Choose randomly between a change in features or hyperparameters
2.1) If we are changing a feature, choose a random feature from our feature pool and swap it in the model (that is, remove the feature if it is already in the model, or add it if it is not)
2.2) If we are changing a hyperparameter, choose a random hyperparameter out of the ones we’re using and then randomly increase it or decrease it by a certain configurable value
3) Test the model with these small changes in our cross-validation procedure. If the out-of-sample accuracy metric is higher, accept the change to the model; if it doesn’t improve, reject this change.

Perhaps some readers will note some similarities between this and the Metropolis-Hastings algorithm or to simulated annealing. We can actually see it as a simplification of simulated annealing, where there’s no variable “temperature” and the random conditional turns into a deterministic one.

Some big advantages

A few nice properties of this algorithm are:

It works for both hyperparameters and feature selection. There’s no need to implement and maintain two different processes/algorithms, both on the mathematical/conceptual point of view (there are fewer concepts to think about) and from the technology point of view (there’s less code to write and maintain, so fewer sources of bugs).
This method can quickly find a solution that is better than our current best. Unlike many algorithms, there’s no need to execute a very long-running task (such as backward/forward selection). We can interrupt it at any time and relaunch it right afterward. This is particularly useful considering our circumstances, where we are all the time receiving new data and changing our evaluation procedure to accommodate that. What we do then is to simply interrupt the process, evaluate the current best model with new data, and start running the process again. If instead, we were using Hyperopt for example, all the history of samples you gathered to build your surrogate model is suddenly not valuable anymore.
It is a never-ending process. This is interesting for a practical reason: no idle time for our machines.
The setup is very easy. The only “state” needed in the system is the current best model, which was already tracked by us on version control. There’s no need for a database or files to keep a history of runs for example.
It is easy to understand what happens under the hood. So when things are not working very well, we don’t need to think about Gaussian processes (the surrogate model used by most hyperparameter algorithms) or anything like that.

Limitations

Of course, this algorithm is not without challenges. However, for all of the most relevant difficulties, there are ways to overcome them with relatively small extensions to the algorithm. Some of the limitations we found are:

It can get stuck at local optima since all tested alternatives are derived from the best current model. A solution is to use multiple random steps at the same time, that is, instead of swapping one feature, we swap two features, or we change a hyperparameter and swap a feature and then we test this new model. By doing this we increase the number of possible modifications that the algorithm can make, therefore increasing its search space and decreasing the chances of getting stuck.

Image by peach turns pro on Wordpress, with modifications.

It can select highly correlated features. If interpretability is a big problem for you, this algorithm might not be the best bet. Eliminating features based on the correlation they have with other ones might be a good choice even if it decreases out-of-sample accuracy. We, however, are clearly choosing accuracy in the accuracy vs. interpretability trade-off.
New features take too long to be tested if the number of features is too high. We already are coming across this problem since we have more than 1300 features in our feature pool. To be tested, a new feature will have to wait on average hundreds of times the time of one iteration of the algorithm. Now, one iteration of the algorithm consists in calculating an out-of-sample accuracy of the new proposed model with, for example, a cross-validation procedure. To put that in numbers, if the cross-validation procedure takes 10 minutes and we have to wait for 600 iterations of the algorithm to test a feature, we will wait a total of about 10 days to test one feature. A solution here is to build a feature queue. For every feature, we store the last time in which it was swapped. Then, instead of choosing features randomly, we choose the feature that hasn’t been tested for the longest period or a feature that was never tested.
The running time is directly correlated with the model evaluation time. Therefore, If the evaluation process (usually cross-validation) takes a long time to run then this algorithm will be very slow as well. A possible solution here is to introduce a pre-selection procedure — for example, before fully testing one model we first test only one of the cross-validation iterations and pass it to the next step (the full test) if this partial metric has over-performed the current best model partial metric.

Results

Most importantly, none of this would matter if we couldn’t reap the benefits in terms of better fraud identification for our customers. Fortunately, Optimus Prime (as we decided to call our robot intern) is a results-improvement machine.

You can see some results below. In this specific case, Optimus Prime found multiple ways to improve our model, and we can see in the pull request the difference in our metrics from the old model to the new one.

Note that this PR was not opened by a human. With the help of Python packages for Git and GitHub, we can automate all this process. All that we do now is wait until we get notified of a new pull request with improvement from Optimus, check the results, and accept the pull request.

A very important feature of the GitHub API is that we can add text to pull request descriptions; without it, we wouldn’t be able to put randomly selected words of wisdom by Optimus Prime in them. This way we can always be reminded of our place in the universe and keep strengthening our company culture.

Optimus Prime fighting filthy fraudsters. Art by Naihaan at DeviantArt.

Next steps

Some of our main challenges were described in the limitations section. The ones we think are the most relevant for us at the moment are:

1. our lack of certainty on whether the model can get out of local optima and

2. the long time it takes for our algorithm to test new features.

The first expansion that we’re testing for is alternating between feature and hyperparameter optimization modes. The feature optimization mode uses the exact same algorithm that we described, while for the hyperparameter optimization we’re trying out Hyperopt, but with a limited number of trials.

Conclusion

In this article we gave a brief introduction to the field of automated machine learning, presenting some well-known tools and also the solution that we built for Legiti. As shown in the results section, we are now having continuous improvements in predictive capabilities, and on top of that with almost no cumulative work from our data scientists, who can now focus more on the development of new features and research in general.

Even with the success that we’re having, we don’t see our work as finished on this front. We now see this process as a core part of our product, and like any other part, should be undergoing constant improvement.

If you have similar experiences trying to build an automated machine learning system, feel welcome to contribute to the discussion in the comments below!