Get Secret Message from an Audio File

CurlS
Analytics Vidhya
Published in
7 min readOct 12, 2019

Audio challenges are quite common in CTFs. I solved one recently which included a secret text in a wav-file and want to summarize my steps and learnings in this post, also for myself as a reference for future challenges. I have never invested time in audio stuff before, so I hope the following can be of interest for others, too.

Photo by Alphacolor on Unsplash

When I googled for flags or secret texts in audio files, I mostly found recommendations as: (1) Use Audacity or Sonic Visualiser, (2) check the waveform and spectrum for hints, (3) LSB might have been used to hide text or(4) maybe infra- or ultrasound range is used to transmit a secret text. Wohoo yes that sounds very easy, but doing it for the first time, it can get very interesting.

Before I dive into the challenge itself let’s see how the wav format looks like.

WAV Files

I will not go much into detail, but out of curiosity I wanted to understand more about the wav file format and found a great source.

A wave file contains a header containing important information for playing the audio file (such as frames per second, bits per sample, num of channels, etc.). Furthermore, there is a sequence of data bytes. To extract the metadata one can use various commands, such as exiftool, file, mediainfo or others

# file audio.wav
audio.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 48000 Hz
# exiftool audio.wav
ExifTool Version Number : 11.65
File Name : audio.wav
File Size : 1183 kB
File Type : WAV
MIME Type : audio/x-wav
Encoding : Microsoft PCM
Num Channels : 2
Sample Rate : 48000
Avg Bytes Per Sec : 192000
Bits Per Sample : 16
Duration : 6.31 s

If someone wants to have a look at the hexdump:

What one can see from the below hexdump (default order is little-endian), is that RIFF (Resource Interchange File Format) acts as a wrapper for the wav format. The file format can store many kinds of data, formost multimedia data like audio and video. As we will see below it is based on chunks and sub chunks.

0x52 0x49 0x46 0x46 stands for RIFF
0x04 0x7a 0x12 0x00 refers to the Chunk Size (1210884)
0x57 0x41 0x56 0x45stands for WAVE
0x66 0x6d 0x74 0x2 refers to fmt subchunk
0x10 0x00 0x00 0x00 Subchunk Size = 16
0x01 0x00 AudioFormat=1 -> PCM (Pulse Code Modulation)
0x02 0x00 NumChannels = 2

0x64 0x61 0x74 0x61 stands for Sound data.
0xe0 0x79 0x12 0x00 stands for SubchunkSize (1210592)

The data identifier indicates that the data chunk is coming next.

So we know that a wav file contains of several chunks of data. Each chunk tells something about the data in the file. There is the format chunk were the metadata resides and also the data chunk with the actual audio data. In order to get a clear view of the data chunk, see below image. The data file consists of samples, categorized into right and left channels:

a left and right channel form a sample frame

There might be optional chunk types included in a wave format. Please refer to one of the references below for a deeper insight.

The above sample has two channels, which stands for “Stereo”, meaning that it consists of two different sound waves that are played at the same time. One sound wave goes to the left speaker, the other to the right speaker. The sample rate gives information about the samples/frames that exists for each second of data. The sample has 48000 samples per seconds, which means that 48'000 samples are used to create 1 second of sound.

Solving the challenge

As with image files, stegonagraphy might be used to embed a secret message/flag in the (meta-) data, thus a quick win is to use tools like exifool, strings, mediaInfo, file and binwalk. Other tools include Audacity or Sonic Visualiser, which might give some information about encoded text in the audio waveforms or spectogram.

Based on above hexdump, it is clear, that the file format is correct and there is no manipulation as to the file type.

Step 1: The basic quick win commands

  1. exiftool -> see above printscreen
  2. strings -> strings audio.wav | awk ‘length($0)>8’ -> nothing interesting
  3. mediaInfo -> same as exiftool (use one or another)
  4. binwalk-> no intersting information

Step 2 : Usage of known Sound Visualization Tools

Sonic Visualisier
I ran the audio sample in Sonic Visualiser, analyzing the spectrum of frequencies and waveform for a hidden text/flag with common approaches (tweaking around the brightness/contrast, etc.). As the sample has two channels of audio we also see two waveforms. As I am not very familiar with making decisions based on the waveform and based on the fact, I did not find a hint here, the research goes on.

waveform of audio.wav
Spectogram function of audio.wav

Step 3: LSB Analyzing

Ok, in the challenge it was said, that one should listen with care. Maybe this is a hint not to use the classic techniques such as spectrum analysis etc. As this methods do not induce noice in the signal. Also I should have noticed the strange pattern in the hexdump:

hexdump of audio file

Let check out, if the least-significant-bits (LSB) are used to hide a flag or secret text. LSB algorithm is actually a classic steganography method.

LSB algorithm replaces the LSB of each Byte

If one replaces the LSB of each byte in the data, it is possible to embed a secret message. So the next approach is to extract the data and to read the LSB bits of each byte of the data. Let’s check if that way a secret text can be reconstructed. If not successfull, maybe every 2nd or 3rd byte is used to hide a bit from the secret message.

For that approach I used the Python wave library. There are other libraries such as PySoundFile, scripy.io.wavfile, etc. I might try out another library the next time.

Lets get the the Metadata first:

#!/usr/bin/python
import wave
wav= wave.open("audio.wav", mode='rb')
print (wav.getparams())
Output
_wave_params(nchannels=2, sampwidth=2, framerate=48000, nframes=302712, comptype='NONE', compname='not compressed')

Nothing new here (channel number =2, sample with in Bytes=2, sampling frequency=48000, no of audio frames=302712, no compression). Next step, let’s get the an the first frames:

# Read Frames into an byte array
frame_bytes = bytearray(list(wav.readframes(wav.getnframes())))
print(frame_bytes[:100])
Output
bytearray(b'\xf4\xff\xf1\xff\x03\x00\xfd\xff\xea\xff\xf5\xff\x00\x00\x00\x00\xf9\xff\xfd\xff\xf1\xff\xf1\xff\xfc\xff\x00\x00\x00\x00\xfc\xff\xf8\xff\xf9\xff\xf9\xff\xf5\xff\xf5\xff\xf9\xff\xf1\xff\xf4\xff\x01\x00')

If we compare this data, we can see that the are the same as in the above hexdump, this is our chunk data to be manipulated. Each sample has a width of 16 bit. The next step to do is to extract the LSB of each byte. For better understanding, see my comments directly below.

import wave
import struct
# Convert audio to byte array
wav = wave.open("audio.wav", mode='rb')
frame_bytes = bytearray(list(wav.readframes(wav.getnframes())))
shorts = struct.unpack('H'*(len(frame_bytes)//2), frame_bytes)

# Get all LSB's
extractedLSB = ""
for i in range(0, len(shorts)):
extractedLSB += str(shorts[i] & 1 )
# divide strings into blocks of eight binary strings
# convert them and join them back to string
string_blocks = (extractedLSB[i:i+8] for i in range(0, len(extractedLSB), 8))
decoded = ''.join(chr(int(char, 2)) for char in string_blocks)
print(decoded[:500])
wav.close()

Unfortunately, this gave me gibbersish output:

tð~ÿl~7|÷Nd~çf_o{7>÷nb|2|ý~ö>ÿ?n.&_)Z§6nf~cz÷~s_rlòN>o|ýZ¼=Mx5|M=~{sNlf|g>v|ã{b>ç{o>O~§~º^?nb~S~ö~ÃvlöNfo~W~6l$>V~ÿjF~szç=Wó>¿�r."{T^ux=bÿYJ,fXÇ<ü~m~çxv^<R}W|þvN&}wV.f~öze^J|ÿj~wnF~w=vndzt^û~ô~ÿJ^Sn$>×>G{^Þ>Gn&%:ö|çye7~eNþNf3w?Vl&~7|Ü^ç>³Jb~A6nf>÷~Ç~º~§^Õ&_>~s~¾~å^#~ón¶nf{1~ç{onf|þ~ÿo}Vn?w~R

I played a bit around with the script, tried every 2nd, 3nd Frame, the below worked (alternate between left and right channel).

import wave
import struct
wav = wave.open("audio.wav", mode='rb')
frame_bytes = bytearray(list(wav.readframes(wav.getnframes())))
shorts = struct.unpack('H'*(len(frame_bytes)//2), frame_bytes)
extracted_left = shorts[::2]
extracted_right = shorts[1::2]
extractedLSB = ""
for i in range(0, len(extracted_left)):
extractedLSB += (str(extracted_left[i] & 1)) if i%2==0 else (str(extracted_right[i] & 1))

string_blocks = (extractedLSB[i:i+8] for i in range(0, len(extractedLSB), 8))
decoded = ''.join(chr(int(char, 2)) for char in string_blocks)
print(decoded[0:500])wav.close()

And we got the secret text (yeah some gibberish data at the end, we would have to adapt the code a little bit for that…)

python3 audio_stego.py
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.¿wÛñ}û<ÿÿ¿¿÷¾þÿÏþ{¯ï¿÷Û^×û¿óÚÿ¯ÿýºøí·~w½÷w¿n·ûÿÏÿÿ¿ÿ

If that would not have worked, the next stept would have been to do a frequency modulation, as this method does neither induce noice in the signal

Yeah, some stuff are out of scope for solving the challenge. But at the end I learned something new, gained some information about the wav file format and used a new library to solve the challenge.

Input, Comments or Feedback is very much appreciated.

--

--

CurlS
Analytics Vidhya

Working in Infosec. Interested in many things, from technical perspective -> security, ctfs, coding, reverse engineering,… and in general -> love life. She.