No man is an island

How to use machine learning for fraud (or outlier) detection

In the telecoms industry, abuse of service has existed for ages. In 2017, it accounted for US$6bn annual losses [source: CFCA]. This abuse, also known as “superimposed fraud” (or International Revenue Share Fraud) is closely related to hacking. Here is how it happens: a valid user account is impersonated (thanks to phishing, low security password, SIM card cloning, etc) and used remotely by a third party. Unfortunately, the real user will only notice they’ve been hacked — and panic — when they receive the bill. Fortunately, like credit card fraud, hacked users aren’t liable. Banks, and telecommunication service providers have a responsibility to ensure the security of their users.


Patterns change all the time

Patterns change all the time, to work around the existing security layers and defenses. Static detection rules can hardly cope with the changes, let alone anticipate them. So, static defenses have been more reactive than proactive — and are still not effective.

To the rescue: enter machine learning — unsupervised! Years of research, from AT&T, INRIA and the international research community, are finally finding their way into production software, and the 2018 technology to detect outliers has become accessible. It’s now easy to spot suspicious, and therefore possibly fraudulent, behavior. You can do it too.

Patterns on user behavior are more deterministic than one would imagine. We’re not robots yet, but we’re nonetheless driven by our regular activities and habits every day. These activities, such as browsing web pages, buying something online and checking social media, are all measurable as patterns.

Let see how machine learning helps in detecting patterns. Spoiler alert: no programming required at all!


First things first: what is user behavior?

Think about three (or any number) of quantities that are measurable for each user. We will hold these quantities in a vector, and later assign it to an opaque user ID.

If we go back to the previous telecommunications example, let’s think about five quantities:

  • Total service usage (or cost), for regular national calls, summed over the past seven days
  • Total service usage (or cost), for international calls, summed over the past seven days
  • Total service usage (or cost), for premium (VAS) calls, summed over the past seven days
  • SMS service usage summed over the past seven days
  • Data plan usage summed over the past seven days

At any given point in time, it’s possible to aggregate this information and apply machine learning. We will do it with the Loud ML fingerprints module, which is based on clustering techniques.

Data for this tutorial will be stored in the TICK stack from InfluxDB, but could be stored in MongoDB or Elasticsearch, and achieve the same output. We will use classic Python functions to plot the output.


Preparing the data-set — structured data helps a lot!

There’s nothing to do really, if the data is already stored in a database. Let’s assume the database name is my_really_awesome_database, and has:

  • A key (or tag) userID to split the data relevant to each subscriber
  • A timestamp field: when the billable event occurred
  • A costfield, that holds the billing information relevant to service usage
  • A classfield, that contains a number: 1, 2, 3, 4, 5. To keep it simple, we will assume 1=national, 2=international, 3=premium, 4=SMS usage, 5=data plan usage. Note: other data partitions can be used too!

Using Loud ML for painless machine learning

Let’s use Loud ML, Enterprise edition. The Enterprise edition provides the fingerprints toolset we need in this tutorial. Loud ML installation instructions are available here.

Creating the machine learning model

We can use the Linux CLI to create the model:

loudml create-model behavior.json

The interesting part is contained in behavior.json

Let’s have a look. I mean let’s have a deeper look at each section and its meaning:

  • “type”: it is a clustering model
  • “span”: each fingerprint will span 7 days
  • “key”: the primary key, used to split the different fingerprints
  • “default_datasource”: the database that contains up-to-date billing information that will be crunched by this model
  • “aggregations”: to define the quantities contained in each fingerprint
  • “measurement”: the table of interest, in the database
  • “field”: the field of interest, in the table
  • “metric”: our first quantity is named “cost-1”, and first quantity “cost-1” = sum(cost) where class=1
  • “anomaly_type”: for sure, we want to highlight outliers with unexpected high service usage
  • “match_all”: filtering relevant data

Training the model

We will use the CLI to train the model using one month of historical data already available in the database:

loudml train behavior-model --from "2018–01–01T00:00:00.000Z" --to "2018–02–01T00:00:00.000Z" -l 1000

This will select one thousand random users (-l 1000) and automatically cluster all the different profiles using deep neural networks clustering, and saves the training output to a model file.

Data visualization

We can use the CLI to compare the training state with the clustering state at any given point in time — past or present. Here we’re using one week of data, from March 2017:

loudml predict behavior-model --from "2017–03–01T00:00:00.000Z" --to "2017–03–08T00:00:00.000Z"

If we were to show this info in a graph, it would look like this (tip: this plot was created using Seaborn):

But wait, our data had five dimensions right? cost-1, cost-2, through to cost-5, and it could have had more. We’re still viewing the data in three dimensions (x axis, y axis, and z for gradients showing user clusters). This is called dimensionality reduction. Dimensionality reduction is super convenient since we, as humans, still can’t process data in 100 dimensions! Machine learning can, and it can also deliver the results in a format that is more meaningful to us.

No man is an island. The graph reveals high-density areas where user behaviors are the same. An outlier (plotted as a red triangle in the above graph) is a data point that lives on its own island or in a low-density area. It is easy to spot and take action! The nice thing is we did not have to define an actual numerical value for “high”, or “low”, at all! ML learns this info autonomously — saving time and effort.

Adding the most important dimension: time!

It’s time to take action. Patterns for abnormal use of services are not static: they change! In comparison, “normal” patterns are unlikely to change over time, which means we can benefit from dimensionality reduction to take the focus away from the normal patterns and concentrate on the abnormalities, whenever they may occur. Unsupervised learning saves resources as a result. To detect abuses and unexpected changes in patterns before it’s too late, it is important to take action in near real time. Let’s do it, with one extra step!

Live data hits the database in near real time. A deployment of the above ML model with Loud ML will repeat the above operations continuously on a schedule defined by the “interval” parameter in behavior.json and detect suspicious activities for further investigations.

To summarize all this information, we have put together a slide deck tutorial.

We hope you enjoyed reading this article. Let us know your questions, comments and ideas to enhance Loud ML with new exciting capabilities!

Next article: can you guess what happens if we only change “high” to “low” in the above example, what will it help to spot? Please post your ideas on Twitter and receive cool Loud ML goodies!