Uncertainties on ML Predictions — Genetic Regex — WBAA at PyData Amsterdam

Nikoletta Bozika
inganalytics.com/inganalytics
5 min readAug 9, 2019

The data science and developers community is well-aware of the global PyData network and its advanced knowledge sharing impact. PyData, brings together the data science and scientific computing communities to learn and discuss on the best methods, novel practices, and emerging trends in big data analytics, processing, and visualisation.

During this year’s PyData in Amsterdam, ING WB Advanced Analytics shared knowledge with the international open source community on two different topics: “Uncertainties on ML Predictions” and “Genetic Regex”.

“Uncertainties on Machine Learning Predictions”

To reach intelligent decisions it is essential to be able to provide uncertainties on machine learning (ML) predictions. However, estimating uncertainties on individual predictions is not done by any of the common ML libraries.

ING WB Advanced Analytics has been inspired by this gap in research and tried to fill the shortage of tooling by implementing the maximum likelihood based uncertainty estimation technique in Python, for machine learning algorithms like logistic regression and neural networks. Our senior data scientist Fabian Jansen together with the talented data scientist Eva van Weel presented their efforts to create a Python library that can help data scientists estimate uncertainties on individual machine learning predictions.

Fabian explained that the idea sparked while working in one of our projects and trying to construct probability curves for winning deals depending on offered price without getting sensible curves. They were wondering if it could be that there was just a large uncertainty on the curves, i.e. wide bands, but there was no tooling to determine this. It was when they decided to build the uncertainty tooling themselves.

“We are doing some really cool things at ING WBAA; we don’t only use data science tools, we research and build them too. We presented our first efforts on building a library for data scientists that they can use to easily estimate uncertainties on their predictions. There are plenty of libraries around that you can use to make predictive models. But there are surprisingly few that also offer easy functionality to estimate uncertainties on predictions. Estimating uncertainties is just still not a big thing in data science and definitely not in business. But it should be” — Fabian

Estimating uncertainties on ML predictions is pivotal when trying to solve a problem and achieve an optimal decision,

“Imagine you give a dinner party and you predict ten guests. Now it’s a completely different matter if it could very well be one more guest showing up, or five more guests showing up. One more you could accommodate, five more you’d have to plan for. See, the uncertainty matters, it influences your decision making” — Fabian

Nevertheless, the common practice focuses more on calculating these uncertainties in the form of performance estimates such as precision, recall and AUC, and not in the form of errors, noise, intervals, limits and outliers, as in physics. Fabian and Eva shared the mathematics on how to make statistical uncertainty estimates on individual predictions for models that minimise a loss function.

“I’m really proud of the work that Fabian and I have done. Presenting it during such an event as PyData was a great opportunity to share it with the open source community. We focused on two different aspects of the project. Firstly, what is the approach that we have taken to calculate the uncertainties over the machine learning predictions and secondly, how can you use our package in real world data science applications by providing practical examples” — Eva

What is more, during their presentation Fabian and Eva:

  • Showed common approximations that can be made in the above-mentioned steps and can allow one to make fast or even analytical calculations.
  • Gave examples on how a data scientist can use their package to incorporate uncertainty estimations in their daily machine learning practice.
  • Explained how these uncertainties turn out for several common machine learning algorithms and how they visualise the evaluated uncertainty estimates.

The audience was very enthusiastic, wanting to start using the package right away, asking Fabian and Eva to elaborate on certain questions after the end of their presentation.

“At the moment the package is still under development. The aim is that the package is written in such a way that it can be used for different projects” — Fabian

You can access the Python package for the uncertainty tool and watch Fabian and Eva’s full presentation below:

Genetic Regular Expressions (Regex)

Regular Expressions (Regex) in Python are a mechanism that can help us detect patterns in data; as in text or in document classification tasks. This can be really important in the context of any business in order to detect malware like spam emails, or in order to find and accordingly classify if the name of a bank account belongs to a company or a private individual.

ING WB Advanced Analytics data scientist Ahmet Erdem together with his former thesis student Robin Bakker, presented their research on the use of genetic programming to automatically learn regular expressions which are used in text classification tasks. They built the machinery to do this and showed that it is possible.

Robin was enthusiastic to share his learnings with the audience and explained how the end-to-end machine learner he built together with Ahmet during his internship at ING WBAA, combines genetic programming to build regular expressions, with a classifier based on the regular expressions’ matching.

“This was my first time presenting at a conference. The presentation was related to the work I did during my internship at ING WBAA for my master’s thesis and was structured similarly to my defence. In the first section, I explained how regular expressions can be used as features for problems like spam detection (or PPI in the case of ING). In the next section, the process of the genetic algorithm finding suitable regular expressions was highlighted. Finally, in the last section, the resulting regular expression features and performance on the dataset was shown”.

The feedback Ahmet and Robin received from the audience was very positive. Many good questions were asked during Q&A, mostly regarding implementation details.

Watch their full presentation below:

--

--