Get Secret Message from an Audio File
Audio challenges are quite common in CTFs. I solved one recently which included a secret text in a wav-file and want to summarize my steps and learnings in this post, also for myself as a reference for future challenges. I have never invested time in audio stuff before, so I hope the following can be of interest for others, too.
When I googled for flags or secret texts in audio files, I mostly found recommendations as: (1) Use Audacity or Sonic Visualiser, (2) check the waveform and spectrum for hints, (3) LSB might have been used to hide text or(4) maybe infra- or ultrasound range is used to transmit a secret text. Wohoo yes that sounds very easy, but doing it for the first time, it can get very interesting.
Before I dive into the challenge itself let’s see how the wav format looks like.
WAV Files
I will not go much into detail, but out of curiosity I wanted to understand more about the wav file format and found a great source.
A wave file contains a header containing important information for playing the audio file (such as frames per second, bits per sample, num of channels, etc.). Furthermore, there is a sequence of data bytes. To extract the metadata one can use various commands, such as exiftool, file, mediainfo or others
# file audio.wav
audio.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 48000 Hz# exiftool audio.wav
ExifTool Version Number : 11.65
File Name : audio.wav
File Size : 1183 kB
File Type : WAV
MIME Type : audio/x-wav
Encoding : Microsoft PCM
Num Channels : 2
Sample Rate : 48000
Avg Bytes Per Sec : 192000
Bits Per Sample : 16
Duration : 6.31 s
If someone wants to have a look at the hexdump:
What one can see from the below hexdump (default order is little-endian), is that RIFF (Resource Interchange File Format) acts as a wrapper for the wav format. The file format can store many kinds of data, formost multimedia data like audio and video. As we will see below it is based on chunks and sub chunks.
0x52
0x49
0x46
0x46
stands for RIFF0x04
0x7a
0x12
0x00
refers to the Chunk Size (1210884)0x57
0x41
0x56
0x45
stands for WAVE0x66
0x6d
0x74
0x2
refers to fmt subchunk 0x10
0x00
0x00
0x00
Subchunk Size = 16 0x01 0x00
AudioFormat=1 -> PCM (Pulse Code Modulation) 0x02 0x00
NumChannels = 2
…0x64
0x61
0x74
0x61
stands for Sound data.0xe0
0x79
0x12
0x00
stands for SubchunkSize (1210592)
The data identifier indicates that the data chunk is coming next.
So we know that a wav file contains of several chunks of data. Each chunk tells something about the data in the file. There is the format chunk were the metadata resides and also the data chunk with the actual audio data. In order to get a clear view of the data chunk, see below image. The data file consists of samples, categorized into right and left channels:
There might be optional chunk types included in a wave format. Please refer to one of the references below for a deeper insight.
The above sample has two channels, which stands for “Stereo”, meaning that it consists of two different sound waves that are played at the same time. One sound wave goes to the left speaker, the other to the right speaker. The sample rate gives information about the samples/frames that exists for each second of data. The sample has 48000 samples per seconds, which means that 48'000 samples are used to create 1 second of sound.
Solving the challenge
As with image files, stegonagraphy might be used to embed a secret message/flag in the (meta-) data, thus a quick win is to use tools like exifool, strings, mediaInfo, file and binwalk. Other tools include Audacity or Sonic Visualiser, which might give some information about encoded text in the audio waveforms or spectogram.
Based on above hexdump, it is clear, that the file format is correct and there is no manipulation as to the file type.
Step 1: The basic quick win commands
- exiftool -> see above printscreen
- strings -> strings audio.wav | awk ‘length($0)>8’ -> nothing interesting
- mediaInfo -> same as exiftool (use one or another)
- binwalk-> no intersting information
Step 2 : Usage of known Sound Visualization Tools
Sonic Visualisier
I ran the audio sample in Sonic Visualiser, analyzing the spectrum of frequencies and waveform for a hidden text/flag with common approaches (tweaking around the brightness/contrast, etc.). As the sample has two channels of audio we also see two waveforms. As I am not very familiar with making decisions based on the waveform and based on the fact, I did not find a hint here, the research goes on.
Step 3: LSB Analyzing
Ok, in the challenge it was said, that one should listen with care. Maybe this is a hint not to use the classic techniques such as spectrum analysis etc. As this methods do not induce noice in the signal. Also I should have noticed the strange pattern in the hexdump:
Let check out, if the least-significant-bits (LSB) are used to hide a flag or secret text. LSB algorithm is actually a classic steganography method.
If one replaces the LSB of each byte in the data, it is possible to embed a secret message. So the next approach is to extract the data and to read the LSB bits of each byte of the data. Let’s check if that way a secret text can be reconstructed. If not successfull, maybe every 2nd or 3rd byte is used to hide a bit from the secret message.
For that approach I used the Python wave library. There are other libraries such as PySoundFile, scripy.io.wavfile, etc. I might try out another library the next time.
Lets get the the Metadata first:
#!/usr/bin/python
import wave
wav= wave.open("audio.wav", mode='rb')
print (wav.getparams())Output
_wave_params(nchannels=2, sampwidth=2, framerate=48000, nframes=302712, comptype='NONE', compname='not compressed')
Nothing new here (channel number =2, sample with in Bytes=2, sampling frequency=48000, no of audio frames=302712, no compression). Next step, let’s get the an the first frames:
# Read Frames into an byte array
frame_bytes = bytearray(list(wav.readframes(wav.getnframes())))
print(frame_bytes[:100])Output
bytearray(b'\xf4\xff\xf1\xff\x03\x00\xfd\xff\xea\xff\xf5\xff\x00\x00\x00\x00\xf9\xff\xfd\xff\xf1\xff\xf1\xff\xfc\xff\x00\x00\x00\x00\xfc\xff\xf8\xff\xf9\xff\xf9\xff\xf5\xff\xf5\xff\xf9\xff\xf1\xff\xf4\xff\x01\x00')
If we compare this data, we can see that the are the same as in the above hexdump, this is our chunk data to be manipulated. Each sample has a width of 16 bit. The next step to do is to extract the LSB of each byte. For better understanding, see my comments directly below.
import wave
import struct# Convert audio to byte array
wav = wave.open("audio.wav", mode='rb')
frame_bytes = bytearray(list(wav.readframes(wav.getnframes())))shorts = struct.unpack('H'*(len(frame_bytes)//2), frame_bytes)
# Get all LSB's
extractedLSB = ""
for i in range(0, len(shorts)):
extractedLSB += str(shorts[i] & 1 )# divide strings into blocks of eight binary strings
# convert them and join them back to string
string_blocks = (extractedLSB[i:i+8] for i in range(0, len(extractedLSB), 8))
decoded = ''.join(chr(int(char, 2)) for char in string_blocks)print(decoded[:500])
wav.close()
Unfortunately, this gave me gibbersish output:
tð~ÿl~7|÷Nd~çf_o{7>÷nb|2|ý~ö>ÿ?n.&_)Z§6nf~cz÷~s_rlòN>o|ýZ¼=Mx5|M=~{sNlf|g>v|ã{b>ç{o>O~§~º^?nb~S~ö~ÃvlöNfo~W~6l$>V~ÿjF~szç=Wó>¿�r."{T^ux=bÿYJ,fXÇ<ü~m~çxv^<R}W|þvN&}wV.f~öze^J|ÿj~wnF~w=vndzt^û~ô~ÿJ^Sn$>×>G{^Þ>Gn&%:ö|çye7~eNþNf3w?Vl&~7|Ü^ç>³Jb~A6nf>÷~Ç~º~§^Õ&_>~s~¾~å^#~ón¶nf{1~ç{onf|þ~ÿo}Vn?w~R
I played a bit around with the script, tried every 2nd, 3nd Frame, the below worked (alternate between left and right channel).
import wave
import structwav = wave.open("audio.wav", mode='rb')
frame_bytes = bytearray(list(wav.readframes(wav.getnframes())))
shorts = struct.unpack('H'*(len(frame_bytes)//2), frame_bytes)extracted_left = shorts[::2]
extracted_right = shorts[1::2]extractedLSB = ""
for i in range(0, len(extracted_left)):
extractedLSB += (str(extracted_left[i] & 1)) if i%2==0 else (str(extracted_right[i] & 1))
string_blocks = (extractedLSB[i:i+8] for i in range(0, len(extractedLSB), 8))
decoded = ''.join(chr(int(char, 2)) for char in string_blocks)print(decoded[0:500])wav.close()
And we got the secret text (yeah some gibberish data at the end, we would have to adapt the code a little bit for that…)
python3 audio_stego.py
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.¿wÛñ}û<ÿÿ¿¿÷¾þÿÏþ{¯ï¿÷Û^×û¿óÚÿ¯ÿýºøí·~w½÷w¿n·ûÿÏÿÿ¿ÿ
If that would not have worked, the next stept would have been to do a frequency modulation, as this method does neither induce noice in the signal
Yeah, some stuff are out of scope for solving the challenge. But at the end I learned something new, gained some information about the wav file format and used a new library to solve the challenge.
Input, Comments or Feedback is very much appreciated.