Axolotl: A Keylogger for iPhone and Android

A Guide to Reasoning About Unintuitive Machine-Learning Problems

Note: This post was co-authored by Greg Foster (Medium won’t let us add co-authors), definitely check out his profile!

Image for post
Image for post
Green is the ground truth touch on an iPhone screen, red is where our algorithm predicts the touch occurred based solely on accelerometer and gyroscope data (which can be read by apps in the background).

TL;DR This post motivates and describes an attack where accelerometer/gyroscope readings and machine learning are used to develop a keylogger for mobile devices. While previous research has been conducted in this space, we hope that our narrative is useful for someone tackling an unintuitive machine learning problem (also the results and graphs are just really darn cool).

Hypothesis Formation

After a bit of brainstorming, we formed our 4-part hypothesis:

(1) When you tap your phone, it moves.
(2) This movement can be picked up by the gyroscope and accelerometer.
(3) This movement is distinct from other movement and can be identified by machine learning and statistical techniques.
(4) The movement is sufficiently unique to identify where on-screen it occurred.

While we were fairly certain about (1) and (2), we were not sure about either (3) or (4). Previous work in this space quickly lead us to believe it was possible, but we weren’t sure how robust these methods were or if we’d be able to recreate them. Generally, we’ve learned to be weary with academic descriptions of attack vectors — as sometimes they only work in a lab setting, expect certain conditions, or simply aren’t as practical as their authors make them out to be (of course, this does not describe all papers — but there are certainly a good number for which this is the case).

Evil: Predicting Touches

With any deep learning problem, the first question you need to ask yourselves is what data do you care about. Neural nets are very specific devices that learn mappings from some well-defined input to some well-defined output. They do this by learning statistical correlations between labeled inputs and outputs. Many people, especially those newer to machine learning, seem to think that stellar algorithms are where people really differentiate themselves. While that’s true, a large majority of modern machine learning problems are solved with fairly canonical algorithms, and a really awesome data set.

On the iPhone, one accesses accelerometer and gyroscope readings by listening to events emitted by iOS at seemingly-random intervals (accelerometer and gyroscope readings are different events, and emitted separately). We developed an app to continuously log these events along with touch data. We used this code to collect 8 data sets to experiment on. Our data sets contained rows of data, where each row represented one of the data events, including the unix timestamp it was collected at, whether this data was an accelerometer or gyroscope event, the x/y/z of the data (both accelerometer and gyroscope data are 3 dimensional: accelerometers can move along the x/y/z axis, and gyroscopes have 3 degrees of rotational freedom); additionally, we also collected whether the user was touching the screen at the time of the accelerometer / gyroscope event, and, if so, where the touch had started (the beginning of the touch likely started with the user tapping the phone).

Image for post
Image for post
Source. Note: We normalized locations to be [-1, 1]; -2 is a special value reserved to indicate that no touch was occurring at that time.

We then graphed the data by time.

Image for post
Image for post
Source: visualize.py, option #1. Red denotes the duration that the user was actually touching the screen; as you can see from the short touch duration, these are a sequence of taps. The x-axis is time and the y-axis is raw reading from the sensor.

This was super encouraging because it showed us that taps began with a distinct spike in readings. When doing machine learning on sources where you’re not sure how much signal there is, it is usually a good idea to start with small predictions and get progressively more complex. For this reason, we decided to see whether a neural net could predict whether a randomly sampled 200ms window was a tap or not (from visual inspection, it seemed all the spikes indicating a spike fell within a 200ms span).

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Source: visualize.py, option #2–3. Left: raw data. Blue windows indicate tap samples, and green windows indicate not-tap samples. Center: Tap samples. Right: Not-tap samples. As you can see, there is a clear visual difference between tap samples and non-tap samples.

Traditional neural networks require fixed-sized inputs, and our windows contained variable amounts of data events (emitted at the whim of the operating system). In order to overcome this, we decided to interpolate N points at equal distances in windows. Then we could feed our neural net an input with 6N values (3 gyroscope and 3 accelerometer values for each of the N points). We tried various values for N and ultimately settled on N=20 (the trade-off is more information vs slower training times).

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Source: visualize.py, option #4–6. Left: N=10. Center: N=20. Right: N=30. The one on the left seems to lose valuable information about the curve, while the middle and right seem fairly similar.

After a little bit of experimenting, we settled on a 5-layer fully-connected neural network with relu activations as our binary classifier. This method leads to the following prediction accuracies on held-out testing sets:

Sample 0: 97.5% correct
Sample 1: 94.7% correct
Sample 2: 81.8% correct
Sample 3: 96.7% correct
Sample 4: 91.2% correct
Sample 5: 87.7% correct
Sample 6: 83.2% correct
Sample 7: 92.6% correct
Source: learn_touches.py

Note: another, possibly “more correct,” way to do this would use an RNN to analyze the sequence data. We chose not to do that primarily because densely connected neural nets tend to be less expensive to train than RNNs over long streams of data.

Eviler: Predicting Location

Sample 0: 0.366 inches
Sample 1: 0.373 inches
Sample 2: 0.4077 inches
Sample 3: 0.4190 inches
Sample 4: 0.4552 inches
Sample 5: 0.3935 inches
Sample 6: 0.3923 inches
Sample 7: 0.3946 inches
Source: learn_location.py

However, as is usually the case with aggregate statistics, the number is only part of the story. When we graph the distribution of errors, the majority are fairly close (with some outliers skewing the mean).

Image for post
Image for post
Image for post
Image for post
Source: learn_location.py. Left: Normalized histogram of error values, as you can see many of the errors are actually very small. Right: Green is the true location of the touch, and red is our model’s predicted location.

Evilest: A Unified Pipeline

Unsuspecting user downloads “Evil Flappy”, an app where they have to tap on the screen mindlessly to advance some objective. During this tapping, the app uses transfer learning to tailor the model to the user and test its own predictive capacity. Once the app detects it has sufficient accuracy, it lets the user win and prompts them to log into Facebook (or any other social network) to share their high score. At this point, the app will be sent to the background, where it can begin key logging the known process: type in a password and press log-in.

In order to do this, the app collects accelerometer and gyroscope data, and then either analyze it locally or sends it to a server for analysis. Whether it’s happening locally or remotely, a sliding window can be run over the data to detect when taps happen, and, if one did, where that tap occurred.

Image for post
Image for post
Source: predict_touches_sequence.py .Top: Raw sensor data. Bottom: The neural net’s prediction of whether there is currently a touch. Red bars indicate the true times of touches.

The accuracy of this attack could be increased by using a Markov model over what letters might follow other letters in user’s passwords, replacing the neural network to predict letters instead of location (thus giving us the probability it is each individual letter and allowing us to create some kind of Kalman filter over all possible potential passwords), or asking the user to input their password multiple times and building a statistical model to integrate the multiple inputs.

Some potential fixes that smartphone operating systems manufacturers could implement are: requiring permissions to access the accelerometer / gyroscope, fuzzing the data from those sensors while typing into password sensitive fields, or simply downsampling the rate at which apps can read those sensors while running in the background.

General Lessons about Data Science and Machine Learning

(1) Start small, scale up.

(2) Visualize regularly.

(3) Aggregate statistics don’t necessarily tell the whole story, make sure you’re choosing correctly.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store