Acoustic Eavesdropping: Predicting Keystrokes With Google AutoML Vision

Charith De Silva
The Startup
Published in
5 min readJul 22, 2020
Bolling Air Field horn amplifiers via Wikimedia Commons (Unknown author / Public domain)

Eavesdropping, has nothing to do with the Garden of Eden, or Eve, or.. well you get the picture, since that bad joke is out of the way, let’s focus on Acoustic Eavesdropping.

Acoustic Eavesdropping is the process of gathering information/intelligence using sound and has been used in various forms since as long as WWI, or even earlier. The picture above is a device used to listen to the sound of enemy aircraft during WWI as a warning of an air strike, this was the pre-radar era. Gathering intelligence using acoustics is quite widely used and nothing new. So I thought of using it to see if I can predict keystrokes using the sound of a computer keyboard. If you listen closely, you’ll notice that each key makes a slightly different sound. A quick Google search revealed a ton of research, which was promising. So I went about doing it.

The basic idea is as follows;

  • Capture the sound of keystrokes multiple times and build a data set. Since my laziness is one of my many virtues, I only bothered to record 8 keys (a-h) fifteen times each. Why 8 and 15? ¯\_(ツ)_/¯
  • Create a histogram of the audio clips using python (librosa¹ & matplotlib²). This is a visual representation of the audio file.
Python code to convert wave file to a histogram
Histogram of “C”
  • At this point, you have the required information to start training your model. I’ve decided to use Google AutoML Vision, simply because it’s easy. You can follow the AutoML documentation³ to get this part done, the steps in summary are;
  1. Create a single-label classification dataset, since each image has one object or one character, in this case.
  2. Upload your images from a storage bucket or directly from your computer.
  3. Upload a CSV file with the list of images, if each image is used for training, validation or testing and the label for the image. The label in this case is the character associated with the sound.
  4. Train!, I suggest using at least 16 node hours or else the precision would be less. This will take a couple of hours to complete.

Once the training is complete, Google is good enough to email you about it and also give you the code needed to test the model. You can also test it online using the web UI. To make sure your model is trained enough, record another keystroke, and convert it to a histogram and test it using the UI before using the code. You’ll be surprised, at least I was.

Now to put everything together, and give it a test. I have uploaded the full code on Github⁴, if you want to try it out. This was the result

As far as an experiment and an initial PoC go, the results were promising. It only takes a couple of hours to hack together something which can guess what you are typing by listening. OK, so it wasn’t perfect, It didn’t recognize some of the characters, but the idea works. If you are guessing a password using a brute force attack, giving part of the password would significantly improve the chance of identifying the password.

There are also some things I can do to improve this to make it more accurate. Run noise reduction on the recording to isolate the sound of the keystroke. Also to create a much larger datasets. Google suggests at least having 1000 images per label (not going to happen son!) for better accuracy. And you can take it to the next level by analyzing the typing pattern of each user, next time you type something, be aware of how you type. I’m sure there are some keys you press harder or lighter and the duration between key combinations will be different, i.e. time between a-e and a-i would be different. Now the more observant of you would say, this is not practical, in a normal work environment there’s so much noise, you’ll never capture the sound of a single keystroke. True, I did this experiment in a room with no interference. But if you read enough, you’ll find that the audio can be scrubbed of noise, there are massive datasets that can be used to detect and remove ambient noises like A/C, fan, traffic, etc… So even in a noisy environment, isolating keystrokes is not impossible. The next time you are typing something sensitive, make sure no one is looking at your screen and also that no one is listening.

But here’s the moral of the story. With technology being the double-edged sword it is, the morality of it falls to the wielder of the sword. And privacy is becoming the square root of minus one. If you look at the recent trends in cybercrime, you’ll notice the attention is moving towards social engineering. The recent Twitter hack is a good example of this. When every aspect of you becomes personally identifiable information (PII), from your fingerprint, your voice, your heartbeat, the way you type, to your posture and behavior itself, Your privacy, is persona non grata, finis!.

P.S. Please note that Python is not my first language, do excuse for the poor quality of work.

--

--