A noob’s attempt at reverse engineering Google’s Cash/Tez mode — part 1

G-pay complaining when replaying an 8 digit token.

A week ago, I happen to install Google pay (a payments app that used to be called ‘Tez’) and it asked me for permission to access my phone’s microphone and speaker. My initial reaction was …. WHAT …. WHY…. you’re a payments app, why do you need access to my phone’s speaker/microphone. It’s not like I’m expecting a jingle after every successful transaction.

So, I started digging a bit and learned from Google’s help section that the app uses ultrasound to discover and pair parties in a ‘cash’ mode transaction. Ooooh, that’s cool. But what the heck is ultrasound .. really!

It’s just ‘sound’ at a higher frequency which happens to be inaudible to (at least adult) humans. So ‘cash mode’ is a mechanism that allows G-pay to avoid having to rely on RF-based technologies like (Bluetooth, Wi-Fi, NFC etc.) for data transfer by leveraging ‘sound’, which has the advantage that it doesn’t pass through walls.

In G-pay’s case, it tries to establish physical co-presence by transmitting a short 8 digit token as inaudible sound.

  • Quick note — The pairing code or the 8 digit token is the only piece of data that uses this ultrasonic data channel. Cash mode still needs an internet connection to exchange payee information.

This whole thing was really intriguing enough for me that I ended up spending a week, unraveling how the thing works (Uhhh well, so much for my vacation).

For starters, I learnt that data over sound isn’t really new and there are a few companies working on this stuff, with some releasing apps (CUE Audio, Shopkick, Signal 360 etc.) and frameworks (LISNR, chirp). Google for its part is in love with this tech and has been using it in a number of its products.

  1. Google Nearby platform — to discover and send messages between nearby devices. BTW, this is built into Android.
  2. Chromecast’s guest mode — to authenticate a guest’s mobile device automatically
  3. Google Play Games — to find nearby players
  4. Audio QR in Google pay (or erstwhile Tez) — to discover and pair users in a ‘cash’ mode transaction.

With the above context, lets dive into the reverse engineering part of the title

The goal — see if we can sniff out the 8 digit tokens in-flight (i.e. during a broadcast). In G-pay’s case, it continuously broadcasts 8 digit tokens when activated.

Ok, so what do we know. G-pay uses ultrasound to transmit data when activated (i.e. placed in receive mode). Shit! that’s all I have?? What am I going to do with just that piece of information? So, the first rule of blind reverse engineering, test what you know.

I used audacity (a professional audio capture tool) to see if I could record/capture ultrasonic audio and kept looking for something in the ‘ultrasound part of the audio spectrum’ i.e. 20khz and above. After a entire afternoon of hurling abuses at the tool, I didn’t get much. Turns out conventional hardware like phones, laptops etc. can’t record/play audio beyond ~24khz. But it didn’t add up as the all marketing literature pointed to the use of an ordinary phone’s speaker to transmit data. Until I realized I missed a tiny detail in the help section. Cash mode uses near-ultrasound and not actual ultrasound. Adjusting for this little detail and aiming my phone’s speaker directly at the microphone in my laptop produced a pretty strong (but kind of noisy) signal in the 16.5 -18.5khz frequency band (i.e. a signal/channel with a bandwidth of about 2khz).

2Khz wide signal in the near (16.5–18.5 khz) ultrasound spectrum

The first thing I did was to replay the captured signal just to see if that would work. It was worth a try and I can confirm the use of random (or seemingly random) one-time tokens every time it’s activated. It’d interesting to see just how random these tokens are but that’s a problem for another time.

If you do this a couple of times, you’ll see that that the band keeps flipping around the 18.5khz mark i.e. you’ll see your spectrogram has a band in the 18.5–20.5khz.

Single sideband modulation

This is sort of an indication that the signal may be SSB (single sideband) filtered i.e. a single band of the signal is being transmitted (lower or upper part of a band) where one’s probably a copy of the other. If our assumption is true, it means 18.5khz is our center/carrier frequency.

  1. One way to confirm this is to export a copy of the signal as a .wav file and analyze it in GNU Radio — a tool that’s usually used to do ‘software defined radio’ but is actually more of a broad-based signals analysis platform.
  2. Sure enough, plotting the signal in the frequency domain via a GnuRadio flowgraph gives you the frequency components in the signal and also validates our earlier observation.
  3. Next step was isolating my signal -

Multiplied the received signal (complex valued) with that of another signal at the same frequency (this has the effect of downshifting the (18.5khz) carrier to {0} Hz (in other words you’ve just demodulated/removed the carrier) and applied a bandpass filter to extract just the part of the spectrum we’re interested in

Gnu-Radio flowgraph to extract our baseband

Now we have our actual complex valued signal (this is our baseband). Next step was to figure out the type of modulation. This took me a while.

  • Some hard earned Gyan — always operate on the premise that your signal(s) will never look like the one from a textbook (or in my case online tutorial videos) due a number of reasons (noise, doppler, time delay etc.). Welcome to digital signal processing!
Top — Received Signal, Bottom — filtered complex valued signals

After a day or two of bugging google for help, I found a way to add more fine-grained resolution to my FFT plot. The tool SDR-sharp includes a hobbyist spectrum-analyzer that provides an easy way to visualize your spectrum with a lot more detail. If you look closely at this plot, you’ll see a series of distinct bumps right between -18khz to -16.5khz. Zooming in reveals multiple frequencies spread across a 1.5khz wide band. I actually ran this a couple of times and counted the peaks, turns out there are 63 of them i.e. 63 distinct frequencies to be exact. Now’s that’s interesting. This could be MFSK or multiple frequency shift keying modulation. But how do you know?

Possible MFSK modulation

Only one way to find out, test the theory. What do we know? MFSK is a modulation scheme that uses a sequence of (not just 2) frequencies to transmit data. Frequencies/tones have constant spacing i.e. each tone in the sequence is equidistant from the next one and the one before it. In other words, if we have a signal with 63 tones, each tone represents a symbol. The transmission of one of these tones essentially means we’ve transmitted 1 data symbol. Usually, a symbol means a ‘1’ bit or ‘0’ bit but with MFSK you can pack in more bits per tone i.e. a short burst of one of the 63 frequencies could represent 1 symbol of several bits of data. Applying this knowledge to our waveform analysis — things start to make sense. Our waveform is a 63-ary frequency shift keying (FSK) modulation

  • with 63 orthogonal tones (or frequencies)
  • spaced 23.6 Hz apart (with the first tone at 16510.99 Hz),
  • The baud rate or symbol rate in MFSK is usually equal to the spacing (or an integer multiple of it) between tones i.e. in other words the baud rate is 23.6 symbols per second.

Now how do you demodulate a random MFSK signal? A traditional MFSK receiver employs a bank of matched filters with center frequencies tuned to each of the N tones. In my case, I already have a digitized signal. So, we can make use of DSP techniques

  • perform an FFT on the 1.5kHz slice .
  • And threshold the magnitude of the values from the FFT frequency bins associated with the 63 tones. I was expecting to see how many tones exceed prescribed thresholds, at a given instant and use that info to predict the corresponding symbol(s).
  • At this point, I was pretty excited and did manage to get something but the data didn’t seem to make any sense (wasn’t sure of what to make of it).
Oh and if you’re wondering if this is normal, yes scratching your head is part of the process and in fact that’s the only part you’re sure about.
Direct MFSK modulation doesnt yield much

Another day goes by and I still can’t seem to find a pattern or make sense of the data. So, went back to acquire some more OSINT (open source intelligence). If you think about it, intelligence gathering is essentially what we’re doing, and we’ll have to do this over and over as we acquire new pieces of information.

So, OSINT is your only friend when nothing works

Just when I felt like this was going take a lot longer than I expected, I found a tiny talk on YouTube about ‘google nearby’, where 8 minutes into the talk, the presenter mentions the word DSSS. Oh boy … what’s that now… never heard of it. This little piece of information changed the course of my investigation and the fact that I was getting nowhere, were reasons enough for me to take a detour.

DSSS or direct sequence spread spectrum is a technology that allows you to spread your signal’s bandwidth. In simple terms, instead of encoding a ‘bit’ of information in a transmitted signal, you add many redundant bits that represent the same bit, inflating the transmission bandwidth. Purpose of doing this — makes your signal resilient to noise and interference.

Makes sense as G-pay uses sound (and not RF) and will be used in noisy environments, containing random pockets of interference.

After spending a couple of hours scouring the internet for random bits of information on the topic, I figured out the basics and boy was I relieved to learn that I wont have to throw out everything I did up until this point and start from scratch. Turns out DSSS is just an additional step. Adding this piece of new information to our analysis gives us a more complete picture of the signal. The received signal is actually a composition of 3 distinct signals

  1. Data signal — let’s call this d(t)
  2. Code signal — let’s call this c(t)
  3. High-frequency carrier — sin(2piFt) (this is just a sinusoid with a frequency of 18.5khz)

So essentially the received signal looks something like this

  1. Received signal — y(t) = sin(2piFt) * c(t) * d(t) where F = carrier freq, c(t) = code signal, d(t) = data signal
  2. Extracted signal — b(t) = c(t) * d(t) after downshifting and filtering

DSSS — just the basics

The new component here is the code signal which contains a spreading code.

  • What’s a spreading code?
  • Every bit in our data signal is multiplied with a pre-arranged code (a sequence of bits but they are called chips — just so that we can distinguish them from actual bits) resulting in a signal with a higher bitrate.
  • So in effect every bit (1 or 0) of data is represented by a bunch of chips.
DSSS chips represent data — Image taken from Michael Ossmann’s talk at RECON 2017

If you haven’t already figured it out, the code signal is at a higher frequency relative to the data signal. Multiplying an HF signal with an LF data signal results in a signal with a similar HF. In our case, it’s a 3khz signal. Recollect that we observed a 1.5khz wide band of 63 frequencies in our FFT plot and considering this is a SSB filtered signal, we can make the assumption that we are dealing with a 3khz frequency. So our baseband b(t) is actually a 3khz signal. Finally, we need to figure out the number of chips or chip length of our spreading code (also called PN sequence) and the actual sequence in use. Summing up, we know

  • DSSS signals usually use what are called maximal length PN sequences that are of the form 2^k-1 (where k is an integer) and generated via an LFSR or linear feedback shift register. The specifics don’t really matter that much except that we looking for a particular type of PN sequence. This narrows it down a bit
  • PN sequences are supposed to look like random strings of 1’s and 0’s but are NOT and possess some properties (like balance, run length and autocorrelation). Correlation is a way to test how well a given signal matches another i.e. the degree of similarity.
  • Correlation or code acquisition and tracking as its called in DSSS is kind of the most important step in synchronizing a transmitted signal at the receiving end before you can de-spread data and start reading your symbols/bits i.e. in other words, the time when the signal begins is unknown to us (the receiver).
  • So we’re going to have to correlate our code signal with the received signal until we find a good match. When that happens, it means we’ve aligned or acquired the position of the individual chip sequences within the received signal and we can now start de-spreading our data.

With this information, I went back to do some more OSINT and this time I hit the jackpot, a patent document in Google’s patent repository that contains pretty much everything I needed including chip length (and validates all of our findings so far). All that’s missing is the chip sequence. https://patents.google.com/patent/US9319096.

We now have the final (hopefully) possible description for our signal. It’s a product of the 2 types of modulation -

  1. MFSK modulation of a DSSS code signal with a 127-bit spreading code or chip sequence i.e. == b(t)
  2. which in-turn single-side band modulates a sinusoidal carrier i.e. == y(t)
The numbers also seem to add up now — i.e. bandwidth/chip length i.e. 3khz/127 gives us 23.62 == symbol rate, which we discovered by measuring the spacing between frequencies.

This means we can move onto the next step of retrieving the chip sequence and performing the correlation operation to check for similarity and some clock recovery to get our pairing code bits (remember the 8 digit token — that’s the whole point of all of this)

Needless to say, this turned out to be one hell of a journey (from false starts to tears of joy). I split this story into 2 parts — Although I don’t see any obvious security concerns here, just to be sure that I’m not giving out information that’s deemed ‘sensitive’ (or whatever EULA verbiage) by google. Stay tuned for part 2

**PS: Anyone who wants to help with this little project or throw some light on better ways of doing this, leave a comment or hit me up on twitter @npashi.

Also, if you’ve managed to read through the entire post (yeah I know, it’s a long one), it should be clear that there aren’t any obvious security concerns here, considering G-pay only uses this mode of ultrasonic communication to discover and pair parties, while exchanging payee information still requires an internet connection. But it does confirm — google has access to and stores personal information in the cloud (at the very least a verified phone number for you).