Testing Behaviour-Based Authentication

Stephan Schultz
Apr 6, 2018 · 6 min read
Image for post
Image for post
Pre-release build of the BAuth app

Insights into how we measure the performance of BAuth, the new authentication method in development at neXenio in cooperation with the Bundesdruckerei and the Hasso Plattner Institute. If you haven’t heard of BAuth before, you may want to start with this overview post.

As you might have guessed, the performance of BAuth is key. If it’s not capable of distinguishing users, it’s not applicable for being used as an authentication factor. We need to make sure that our classifiers operate at the lowest possible time and space complexity while keeping the error rate close to zero.


As mentioned in the last post, there are strict privacy requirements that BAuth has to fulfill. We enforce that no personal data ever leaves the user’s device, for obvious reasons.

Image for post
Image for post
BAuth’s opt-in preference to share data, available in pre-release builds

Privacy as requirement adds complexity for machine learning. This is referred to as one-class classification and means that we can only use data from one user for training, while not being able to compare to data of other users. The fact that we don’t have access to our users data makes it hard to evaluate the performance of our classifiers:

  • How well are they trained?
  • How well can they detect the real user (true positives)?
  • How well can they detect fraud (true negatives)?

Test Environment

Of course we have unit tests to make sure that our algorithms work as intended. But unit testing single methods doesn’t help us verifying how the whole systems works together. This would require simulating user behaviour over time in order to see how BAuth operates. So we’ve spent a whole lot of time to set up a quite fancy way to do just that:

BAuth aggregates sensor data from the device’s hardware sensors in the background. This data is used for feature extraction. We have development and pre-release builds of the app that allow us to record and export this raw data if some given conditions are met. We thus have recordings from different users that contain samples of them performing some given activities unsupervised.

Image for post
Image for post
A recording of raw sensor data while a user was walking

We can use these recordings in our test environment. We trick the system into thinking that the previously recorded sensor data is currently being aggregated, simulating the exact same behaviour that was performed during recording.

We remap the timestamps of the recorded data and use Robolectric to mock the system clock, allowing us to replay minutes of recording in just a few milliseconds.

Now that we have some real sensor data in our test system, we let it operate just as it would on a real device. Features are being extracted, classifiers are being run, data is being persisted. All the good stuff that we want to see happening.

When the recalculation is done (when there’s no more sensor data in the recording that we’re currently replaying), we can again export all the aggregated data into a recording. This time it not only contains the raw sensor data, but also the features and classifications.

You might be wondering why we don’t record the features and classifications while we record the raw sensor data. That is because if we change one minor thing (e.g. a feature), the whole system might produce completely different data (e.g. classifications). We want to be able to test the current state, not the state that was present during the initial recording.


We are now able to visualize the state of the system at any given time. We use Jupyter Notebooks and Matplotlib to generate different kind of charts. These charts reduce the complexity that naturally comes with the thousands of multidimensional data points that the system deals with.

Image for post
Image for post
Walking classification of a user over time

To get an idea of what activities were being performed, we can plot classifications over time. In this record, a user was walking through our office. The 4 drops are doors that the user had to open, thus interrupting the walk.

Image for post
Image for post
Different features based on the accelerometer while a user is classified as walking

We often check how different features look compared to each other. Here we see some features that are based on the raw acceleration sensor data. This type of plot allows us to evaluate which features might be useful for our classifiers and which are not required.

Image for post
Image for post
Accelerometer Absolute Difference while different users are classified as walking

If we load multiple recordings from different users, we can also check how the same features produce different clusters, depending on the user. Features that create unique clusters are super important for our specific classifications, as they allow us to distinguish users.

We can also animate these charts to see the values changing over time.


Now we have some understanding of what the system is doing and how the values look like. However, we need some metrics to figure out if our system has actually improved or not. For that we have tests that calculate a classification matrix.

Let’s say we have recorded sensor data from multiple users while they were walking. We replay some of the data from one user to train the classifiers and then challenge the classifiers with data from another user. Doing that over and over again for all possible user combinations results in a nice matrix.

Image for post
Image for post
Classification matrix for specific walking classifications

What you see above are classifications based on data from some of the team members. Bold values indicate that the system classified the data as belonging to the same user. That’s why all the values in the diagonal are bold — they are true positives. All other values should not be bold, otherwise they are false positives.

We expect that the system recognizes a user if the classifiers have been trained for that user. We also try to minimize the false positives by making the classifiers more strict.

Extracting some commonly used metrics from that matrix — namely the F1 score, precision, recall and accuracy — helps us to instantly see how code changes affect the overall performance. We always keep an eye on the F1 Score and make sure that it stays close to 1.

By the way, the matrix shown above is actually markdown that our tests output when running. You can read more about that here:


Obviously a lot more data is required to infer the real world performance of BAuth. To get data from more users in different situations, we hand out pre-release versions to selected people. They can opt-in to sharing their usage data, which allows us to calculate our classification matrices for people that don’t work at neXenio.

We’re confident that BAuth is capable of providing a level of security that lets it compete with existing authentication methods while being much less cumbersome or intrusive.

If you’re interested in this project, feel free to get in touch!


Engineering the digital workplace

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store