Machine Learning: a starter kit

Lewis Mills

Published in

WorldRemit

5 min readFeb 13, 2018

This post is for you if:

1. You’ve identified a business problem.

2. You think Machine Learning could help you solve it.

3. You’re not a data scientist and you’ve no idea what to do next.

If you’ve not worked with Machine Learning before, you might imagine it’s all deep maths and giant brains pulsating in the cloud. (And it is. A bit.)

But way before the data scientist enters the room, the success or failure of a Machine Learning (ML) project can be determined by the focus and direction that you as the Product Owner/ Project Manager/General Busybody bring to the table.

In this post, we’ll look at the key steps you can take towards getting your ML project off the ground before a single line of code is written.

Is your problem Machine Learning compatible?

Machine Learning can be a great solution to problems that i) have a static goal (or “output”) (e.g. to return the best content to customer X, to predict how much money to lend customer Y) but ii) ever-changing data (or “input”) that informs how you reach your goal (e.g. customer behavioural trends, fluctuations in the economy).

At WorldRemit, we’ve recently launched a Machine Learning model for fraud prevention. Fraud patterns change over time, some subtly and over long periods, others drastically in a matter of hours as new fraudsters enter the system and existing fraudsters change their approach.

Fraud, then, is a perfect use case for ML. Fraudsters may change their tactics (our varying input), but we always want to catch them in the act (our static output).

To have humans monitor for these minute changes in data patterns is a costly, time consuming process that is not an effective or scalable use of time for a rapidly growing business.

Defining your terms

With WorldRemit’s fraud model, we wanted to ask the question, what is the likelihood that this transaction is fraudulent? And receive a response in the form of a score (the higher the score, the more likely it’s fraudulent). Simple enough, right?

But as we started to boil this question down it became increasingly complicated, until we were sitting around asking unhelpfully vague metaphysical-ish questions… “what is fraud anyway?”

This drilling down into your desired output is critical: the best model in the world will be useless if it’s not trained to target the actual desired result. It’s like training your model to hit the bullseye on the wrong darts board.

In an early iteration of our model, we thought we had it cracked. We seemed to be detecting almost all fraud in our training set with a tiny amount of errors. But when we ran through our results with other stakeholders the feedback was that our model had performed poorly.

The discrepancy was in the definition (or “label”) of how we defined fraud. We had trained the model based on our assumption that fraudulent transactions were flagged in one way in our dataset, whereas in practice they were flagged in multiple, different ways depending on the source of the fraud notification (e.g. following internal protocols, certain types of fraud were entered into the system in different ways by our Loss Prevention Analysts. They follow a methodology that makes perfect sense to their needs, but we were oblivious to these nuances in their data entry when we started our project.)

There really is nothing for it but to get your hands dirty and make sure i) your business understanding marries up with ii) your data labelling. Our original question — what is the likelihood that this transaction is fraudulent? — ended up being closer to what is the likelihood that this transaction will result in one of these seven data points being subsequently set to “true” in our database? as this behaviour reflected how our business was truly identifying transactions as confirmed fraud. It’s messy, but anything else would have been useless.

Know what success looks like

In order to understand how a model would be applied in your context, you need to understand what good results look like. Again, in our fraud model example the success indicators seem intuitive: we want less fraud.

But it’s not so simple — you can stop all fraud if you decline all your transactions. So success for our model required working with the business to understand benchmarks of existing performance. For us, this meant understanding:

1. What percentage of fraud do we currently prevent successfully?

2. How much does this current baseline cost in overheads?

3. How much does it cost when we wrongly identify legitimate transactions as fraud?

4. What is our financial loss via fraud for different revenue channels?

It was through understanding these figures in detail and applying real-world calculations of hours and expenses that we could define our success metrics, which were robust but easy to share with stakeholders, i.e. we aim for an increase in fraud prevention by X, with a decrease in false positives (i.e. wrongly identifying legitimate customers as fraud) by Y, with overall financial loss through fraud to be reduced by Z.

These metrics are invaluable when it comes to working with data scientists to help train and calibrate your model.

Existing business knowledge

Unless your problem is brand new, the chances are you’re not the first one to think about solving it. It’s likely that your colleagues have developed numerous manual, imperfect (“heuristic”) processes to combat the problem. These various methods and approaches are a great starting point to build up a series of data points (or “features”) to start training your model.

In our fraud model example, we were able to greatly benefit from the experiences of our transaction monitoring specialists, whose heads are full of intimate knowledge of our customers and behaviour patterns, knowledge they apply every day in the trenches of day-to-day operations.

We set up workshops with these domain experts, questioned them, chatted, live-streamed scenarios and took copious amounts of notes on their thoughts and approaches — we refined these notes until they were a series of data points or behaviours (“features”), such as the actions performed by a customer on the website prior to completing their purchase. These served as a starting point to train our prospective model.

Ready to get algebraic

When you’ve completed the above, you’ll have a starter kit for your machine learning project:

1. A problem that is ripe for a Machine Learning solution.

2. A clear understanding of what you want to look for in your data (your “labels”).

3. A business-synced understanding of what success looks like (your “metrics”).

4. A solid starting point for a first iteration (your heuristic-led “features”).

So now you’re ready to bring in the data scientist, or whatever kind of genius you have to hand, and start training your model. Now that you have an informed starting point on what you want to achieve and some fleshed-out assumptions of how you think the model should do it, you’ll be able to work intelligently with a data scientist to begin to sculpt your model.

Machine Learning: a starter kit

Written by Lewis Mills