Anomaly detection in the OpenGaming System platform using Machine Learning

Light & Wonder Tech Blog
The Light & Wonder Tech Blog
5 min readNov 18, 2022

--

By Albin Abel, Advanced Senior Software Engineer, OGS Development

Introduction

An “anomaly” can be considered any change in pattern, such as the frequency that an event occurs. Detection of anomalies plays a critical role in a high-volume distributed software platforms, such as the Light & Wonder OpenGaming System (the OGS) a casino aggregate system. An anomaly in a single location can have a cascading impact on many integrated platforms. Early detection is vital if Light & Wonder is to offer the level of service to meet customer expectations.

Many observability platforms offer in-built machine-learning solutions for anomaly detection. An observability platform provides better insights on health and performance of the application through logs, metrics. To better understand these offerings, Light & Wonder set up its own Long Short Term Memory (LSTM) autoencoder model to see if it was able to predict anomalies based on a feed of OGS gaming data.

Why is anomaly detection important?

OGS is a middleware that facilitates gaming transactions between game developers and licenced gaming operators. Light & Wonder has large numbers of external systems integrated on both sides of the network and this makes detecting/tracking anomalies in individual parts both noisy and complex.

Pre-determined rules (e.g., Gameplay for system X dropped by Y%) can help, but it is hard to get the balance right between the false detection of anomalies and missing those issues completely. With machine learning, Light & Wonder can use a history of previous data/events to identify when anomalies are starting to occur and raise the alarm earlier and more accurately.

Experimental Machine Learning Architecture

Light & Wonder chose to use a custom “LSTM”[DGD1] model in this experiment. An LSTM is an artificial neural network used in the field of deep learning and is particularly effective at making predictions based on time-series data. There are many resources describing LTSM models and their workings, so details are not provided here.

Rather than using a single source of data to detect any anomalies, we trained the model with several streams of OGS data:

· Transactional volumes.
· Service response time latency.
· Percentage Service Errors.
· Percentage Transactional rollbacks (cancelled gameplays, usually as a result of a problem).

The data was detrended to remove the expected daily fluctuations in traffic on Light & Wonder’s system (e.g., peaks in the evenings, troughs overnight/early morning) and focus on the local trends.

Light & Wonder used an LTSM model with the following set-up:

· 2 layers of memory units (16 memory units in the 1st, 8 in the 2nd) for both the encoder and decoder.
· A repeated vector layer between the encoder/decoder
· A time distributed dense layer
· Dropout rate of 0.2
· ReLU activation function
· An Adam loss function used during training
· A validation split of 0.3
· Predict the next 2 seconds based on the past 30 seconds (time steps)

During training of the model, Light & Wonder could see that over time the accuracy of its understanding of both the training and validation data improved. This suggested that the model was successfully learning the expected patterns in our input data:

Light & Wonder’s target was to see if this newly trained model could accurately predict gameplay anomalies within a few minutes of them first appearing.

To do this, Light & Wonder needed to decide on a threshold for the deviation from expected MAE that would trigger an alert. Light & Wonder’s training/validation data had no anomalies (semi-supervised), just natural variation.[DGD1] When Light & Wonder inspected the “Mean Error Distribution”[DGD2] for this data. Light & Wonder could see that this value rarely exceeded a value of 1.4 and set that as an initial threshold.

Model Results

Light & Wonder fed the model with streams of production data where incidents had occurred to see how effective the model was at identifying issues, 2 of which are shown below. For each test data set we include graphs that show:

· How far the wager, latency, error data deviated from the values expected by the model. — The red line shows the threshold at which the deviation exceeds the acceptable level, and an alert should be triggered.
· The wagering activity over that period. Red dots on this graph show when the model would identify an anomaly and trigger alerts.
· The subsequent graphs show the scaled latency, error, and rollback volumes for the same period.

As you can see, the model was able to accurately identify the time periods where there were gameplay anomalies and did not raise issues when no anomalies were occurring. This proves that standard machine learning techniques can effectively identify problems on the Light & Wonder systems and could be used as an alternative to hand-crafted rules.

Stay tuned for a follow up blog post showing how to use a classification model when anomalies occur to automatically identify the root cause, potentially automate the resolution, and simplify the lives of the 24x7 support teams monitoring the platform.

The opinions expressed in this blog post are strictly those of the author in their personal capacity. They do not purport to reflect the opinions or views of Light & Wonder or of its employees.

WE’RE HIRING

Find out more about life at Light & Wonder and the roles we are looking to fill: https://igaming.lnw.com/careers/

--

--