Axolotl: A Keylogger for iPhone and Android

A Guide to Reasoning About Unintuitive Machine-Learning Problems

Note: This post was co-authored by Greg Foster (Medium won’t let us add co-authors), definitely check out his profile!

Green is the ground truth touch on an iPhone screen, red is where our algorithm predicts the touch occurred based solely on accelerometer and gyroscope data (which can be read by apps in the background).

TL;DR This post motivates and describes an attack where accelerometer/gyroscope readings and machine learning are used to develop a keylogger for mobile devices. While previous research has been conducted in this space, we hope that our narrative is useful for someone tackling an unintuitive machine learning problem (also the results and graphs are just really darn cool).

Hypothesis Formation

In Fall 2016, we were tasked with creating a final project for CS263 (Harvard’s Systems’ Security Class): implementing some attack. Mostly to spite our hype-hating professor, we committed to integrating the greatest buzzword of all into our project — machine learning. We had been intrigued by previous attacks that used machine learning on audio data of keyboards and pin pads to extract passwords. This type of attack is called a side channel attack and takes advantage of the physical properties of the real-world implementation of a security system. Side-channel attacks seem like good candidates for machine learning because they often involve picking up on subtle, non-evident patterns. We decided to devise a side-channel attack for mobile devices, which are chock-full of incredibly-accurate, and permission-less, sensors. Specifically, we wanted to develop an attack that exploited the accelerometer (measures movement) and gyroscope (measures rotation), as they don’t require permissions and can be read by apps running in the background. On iPhone, apps can run for up to 10 minutes in the background after they’ve been closed.

After a bit of brainstorming, we formed our 4-part hypothesis:

(1) When you tap your phone, it moves.
(2) This movement can be picked up by the gyroscope and accelerometer.
(3) This movement is distinct from other movement and can be identified by machine learning and statistical techniques.
(4) The movement is sufficiently unique to identify where on-screen it occurred.

While we were fairly certain about (1) and (2), we were not sure about either (3) or (4). Previous work in this space quickly lead us to believe it was possible, but we weren’t sure how robust these methods were or if we’d be able to recreate them. Generally, we’ve learned to be weary with academic descriptions of attack vectors — as sometimes they only work in a lab setting, expect certain conditions, or simply aren’t as practical as their authors make them out to be (of course, this does not describe all papers — but there are certainly a good number for which this is the case).

Evil: Predicting Touches

Repository here:

With any deep learning problem, the first question you need to ask yourselves is what data do you care about. Neural nets are very specific devices that learn mappings from some well-defined input to some well-defined output. They do this by learning statistical correlations between labeled inputs and outputs. Many people, especially those newer to machine learning, seem to think that stellar algorithms are where people really differentiate themselves. While that’s true, a large majority of modern machine learning problems are solved with fairly canonical algorithms, and a really awesome data set.

On the iPhone, one accesses accelerometer and gyroscope readings by listening to events emitted by iOS at seemingly-random intervals (accelerometer and gyroscope readings are different events, and emitted separately). We developed an app to continuously log these events along with touch data. We used this code to collect 8 data sets to experiment on. Our data sets contained rows of data, where each row represented one of the data events, including the unix timestamp it was collected at, whether this data was an accelerometer or gyroscope event, the x/y/z of the data (both accelerometer and gyroscope data are 3 dimensional: accelerometers can move along the x/y/z axis, and gyroscopes have 3 degrees of rotational freedom); additionally, we also collected whether the user was touching the screen at the time of the accelerometer / gyroscope event, and, if so, where the touch had started (the beginning of the touch likely started with the user tapping the phone).

Source. Note: We normalized locations to be [-1, 1]; -2 is a special value reserved to indicate that no touch was occurring at that time.

We then graphed the data by time.

Source:, option #1. Red denotes the duration that the user was actually touching the screen; as you can see from the short touch duration, these are a sequence of taps. The x-axis is time and the y-axis is raw reading from the sensor.

This was super encouraging because it showed us that taps began with a distinct spike in readings. When doing machine learning on sources where you’re not sure how much signal there is, it is usually a good idea to start with small predictions and get progressively more complex. For this reason, we decided to see whether a neural net could predict whether a randomly sampled 200ms window was a tap or not (from visual inspection, it seemed all the spikes indicating a spike fell within a 200ms span).

Source:, option #2–3. Left: raw data. Blue windows indicate tap samples, and green windows indicate not-tap samples. Center: Tap samples. Right: Not-tap samples. As you can see, there is a clear visual difference between tap samples and non-tap samples.

Traditional neural networks require fixed-sized inputs, and our windows contained variable amounts of data events (emitted at the whim of the operating system). In order to overcome this, we decided to interpolate N points at equal distances in windows. Then we could feed our neural net an input with 6N values (3 gyroscope and 3 accelerometer values for each of the N points). We tried various values for N and ultimately settled on N=20 (the trade-off is more information vs slower training times).

Source:, option #4–6. Left: N=10. Center: N=20. Right: N=30. The one on the left seems to lose valuable information about the curve, while the middle and right seem fairly similar.

After a little bit of experimenting, we settled on a 5-layer fully-connected neural network with relu activations as our binary classifier. This method leads to the following prediction accuracies on held-out testing sets:

Sample 0: 97.5% correct
Sample 1: 94.7% correct
Sample 2: 81.8% correct
Sample 3: 96.7% correct
Sample 4: 91.2% correct
Sample 5: 87.7% correct
Sample 6: 83.2% correct
Sample 7: 92.6% correct

Note: another, possibly “more correct,” way to do this would use an RNN to analyze the sequence data. We chose not to do that primarily because densely connected neural nets tend to be less expensive to train than RNNs over long streams of data.

Eviler: Predicting Location

Being able to detect taps reliably gave us faith that we might be able to detect the location of the taps on the screen. We modified our neural net to output an X, Y coordinate for the predicted location and retrained it. We got the following mean error (the average distance between our predicted touch location and the true location of the touch).

Sample 0: 0.366 inches
Sample 1: 0.373 inches
Sample 2: 0.4077 inches
Sample 3: 0.4190 inches
Sample 4: 0.4552 inches
Sample 5: 0.3935 inches
Sample 6: 0.3923 inches
Sample 7: 0.3946 inches

However, as is usually the case with aggregate statistics, the number is only part of the story. When we graph the distribution of errors, the majority are fairly close (with some outliers skewing the mean).

Source: Left: Normalized histogram of error values, as you can see many of the errors are actually very small. Right: Green is the true location of the touch, and red is our model’s predicted location.

Evilest: A Unified Pipeline

Having these two building blocks, we can now assemble a full app that could key-log. A potential attack vector might look something like this:

Unsuspecting user downloads “Evil Flappy”, an app where they have to tap on the screen mindlessly to advance some objective. During this tapping, the app uses transfer learning to tailor the model to the user and test its own predictive capacity. Once the app detects it has sufficient accuracy, it lets the user win and prompts them to log into Facebook (or any other social network) to share their high score. At this point, the app will be sent to the background, where it can begin key logging the known process: type in a password and press log-in.

In order to do this, the app collects accelerometer and gyroscope data, and then either analyze it locally or sends it to a server for analysis. Whether it’s happening locally or remotely, a sliding window can be run over the data to detect when taps happen, and, if one did, where that tap occurred.

Source: .Top: Raw sensor data. Bottom: The neural net’s prediction of whether there is currently a touch. Red bars indicate the true times of touches.

The accuracy of this attack could be increased by using a Markov model over what letters might follow other letters in user’s passwords, replacing the neural network to predict letters instead of location (thus giving us the probability it is each individual letter and allowing us to create some kind of Kalman filter over all possible potential passwords), or asking the user to input their password multiple times and building a statistical model to integrate the multiple inputs.

Some potential fixes that smartphone operating systems manufacturers could implement are: requiring permissions to access the accelerometer / gyroscope, fuzzing the data from those sensors while typing into password sensitive fields, or simply downsampling the rate at which apps can read those sensors while running in the background.

General Lessons about Data Science and Machine Learning

Throughout this project, we found ourselves grappling with a problem that’s not super intuitive and with too much data to analyze by hand. Lessons learned are:

(1) Start small, scale up.

(2) Visualize regularly.

(3) Aggregate statistics don’t necessarily tell the whole story, make sure you’re choosing correctly.