Building an Automated Detection of Anomalous RFID Behaviour for a Smart Intercom
Having to monitor multiple custom-built intercoms can become a tedious process. A daily routine of checking their state and fearfully awaiting for a Slack notification telling you about one crashing and forbidding access to a building.
However, the true problem occurs when the monitoring system fails to detect a malfunction in the device. Then the time that is required to detect the error elongates drastically. The intercoms have two main methods of permitting access, through calling a person or by using the RFID card. We decided to start off with detecting when the RFID system is down because the initial analysis of the problem showed that it has a higher tendency to malfunction.
After having the devices running for a couple of months we had access to a decent amount of data.
Let’s find something in that data!
The gathered data was a list of timestamped events. As this form is not ideal for analysis, we decided to group the events into 15-minute wide bins and count the number of occurrences. This provided us with the first feature found in the data and allowed us to draw out a time series dataset showed in Diagram 1.
Looking at the visualisation put us on track of finding the second feature. We noticed that there was a significant difference in activity between nighttime and daytime, as one would expect. However, there still remained the question of how long should the model be. Daily, weekly, monthly or maybe yearly?
- The daily would not comprehend the differences between weekdays and weekends.
- The weekly manages to encapsulate weekends, but it will not know if there are holidays.
- The monthly, in fact, does not provide us with much extra information over the weekly model.
- The yearly could potentially handle most variations, but requires a lot of data and could easily overfit. Also at the time of writing this article we only had a few months of data.
Based on the pros and cons of each method we decided to stick with the weekly model.
Coming up with the first prototype
Having these two features:
- The number of events per 15 minutes
- The day of the week
we started thinking about a predictive model, which would fit the data.
As of its definition, the Poisson Distribution describes the probability of observing n events in a set period of time. This sounds a lot like what we are doing with finding the number of events that occurred over 15 minutes. So we decided to give a shot.
Poisson takes only one parameter l, which is equal to the expected value of the distribution. So to teach this model we had an analytical method. Simply calculate the average value for each 15-minute bin. As a result, we received an array of weights that gave us a probability density function of observing n events for any point in time.
To receive a probability of observing n events we had to use the survival function, which is 1 minus the cumulative distribution function (CDF). The CDF grows with the rise in the number of events until it reaches 1. We use the survival function as we are more interested in the lack of events than observing too many.
Having a model we couldn’t wait and see if it works, so we let it run on the remaining data. Points in time that we were expecting to be anomalies light up like a Christmas tree! However, there were also multiple points where the probability was high although the device was clearly functioning normally. We decided that this is something that would require further improvement.
Back to the drawing board
Having a specific type of anomaly in mind, we decided that the next step will be to characterise the anomalous activity. A malfunctioning RFID system should not create any event since it stopped working. Hence, in this case, we should be only interested in cases where the activity was 0.
This condition excludes most cases when the activity simply drops below the usual value. However, what if 0 was also only a single drop? It could be caused for example by a late bus that all the users take. After some time the bus will arrive and there will be an activity later in time. Looking at the data as independent points in time won’t let us determine that. To solve the problem let’s classify by using more than one point.
But waiting for another 15 minutes will increase the time of detection to 30 minutes! To solve this we decided to shorten the bin length to 5 minutes and if we detect an anomalous drop in activity, we would look at the 2 next time periods. This allows detecting false alarms significantly faster. Let’s consider an example with a bus. If it was late by 15 minutes, then within the first 15 minutes there would be 0 activity, however, 5 minutes later some activity would be observable. In the first case, it would take 30 minutes to realize that everything is working properly, in the second case only 20 minutes would be required.
Great! But we had one more dilemma. If we are looking into the “future” does the first value truly need to equal zero? After all, the device could have been broken for 4 minutes and still catch an event while working. So the idea was to allow the first value to be non-zero.
This resulted in a recurrent architecture where the prediction value is influenced by the value of the previous predictions.
Let’s test it!
Having a testing environment is a key factor when we want to determine the quality of the algorithm. However, testing unsupervised algorithms can become tricky as it renders most classical validation methods useless.
For this reason, we designed a test scenario specific to our data. Using the Poisson distribution we generated a random weekly activity series. This series was then treated as anomaly-free. On the other hand, we created a perfectly anomalous case, an array of 0s.
We run the classification algorithm on both series and we received a distribution for each case. Thinking about what the result should be we can come up with a prediction that:
- The anomaly-free series will be mostly predicted a 0 probability of an anomaly occurring. The number of predictions higher than 0 will be minor.
- The anomalous series will have a couple of cases predicted as non-anomalous (for example at 2 am when no one ever uses the device), however, the number of non-zero predictions will be significantly higher than in the anomaly-free series.
That is exactly what happened! Although these were perfect case scenarios they allowed us to create a metric with which we can check if the model is working on a particular device.
Finally using the anomalous distribution we could derive thresholds, at which to send an alert about some anomalous behaviour. However, for calculating the thresholds the predictions equal to 0 where omitted. We decided to use two levels of notifying: warning and critical. The warning level was defined at the 20th percentile and the critical level was defined at the 80th percentile.
Let’s put that bad boy to work
For this purpose, we designed a pipeline for detecting anomalies
Using this structure we will be able to create further models, like detecting anomalies in the card usage, in the project.
We are launching the monitoring system and we will keep on fine-tuning over time. We hope that the maintenance of many devices should become significantly simpler, less time-consuming and far less costly. Provided that it gives us the desired effects, we will look into further automation of the monitoring process. So stay tuned for more insight into how we engineer our products.