Do you even need a binary to reverse engineer?

Network Traffic

The handout only contains a single audio_capture.pcap file with UDP packets. Every UDP packet is 70 bytes long and there is no obvious way to tell which audio codec was used.

Naturally, the first step is to extract all UDP packets into a single .bin file so they can easily be processed. In Wireshark this can be done with “Follow UDP stream”, then decode it as “raw” and “save as” file.

Analyzing the Audio Codec

Now it is very useful to know what various audio codecs look like, and this one in particular looks a lot like some kind of ADPCM.

The first 6 bytes in every packet look a lot like some kind of header: 3x 16bit little endian values. The second and third header fields seem to be signed integers, because in the first few packets the values are very close to zero, which makes sense given all the zeros in the remaining packet. This remaining part of the packet has to be the compressed audio data.

We took a wild guess and guessed that the first header field is a “scale” value, the second and third header fields are the initial history for a predictor / filter. The rest is then a quantized error signal. How many bits does a sample take? At most 4 bit, because there are some 0x0C and 0xC0 bytes which have to decode to a similar and small value in the first few packets. In general, the first few packets seem to contain silence, hence the initial history close to zero and a lot of zero values in the compressed data.

The packet format in C syntax:

typedef struct {
    u16     scale;
    s16     yn1;
    s16     yn0;

Decoding the Audio Data

Now that we guessed it’s ADPCM, we can try to decode the signal. ADPCM works according to a simple formula:

y = compressed_data * scale + predictor

Since we have no idea how the predictor works, we can simply start by setting it to zero. This gives some kind of high-pass filtered signal which already resembles music. Now we know for sure this has to be some form of ADPCM. Unfortunately it’s quite distorted and hard to hear anything meaningful in it, but one can clearly hear there is some spoken text which is probably our flag.

With a better predictor we can turn our error signal into a proper audio signal, so we used some trial and error and eventually came up with this formula:

y = compressed_data * scale + yn0 / 2 + yn1 / 3
yn1 = yn0
yn0 = y

The result is still very distorted, but we can hear our flag now. After one of our team members listened to the distorted rickroll for a long time, we finally got the flag:



Submitting the Flag

One would assume that the flag can be submitted as is, but it turns out: it’s not upper case, it’s not lower case, it’s lower case and converted to the flag format:


Info from the Challenge Author

After the CTF ended, we heard that the codec was in fact 2bit ADPCM, which makes sense, given the 0x03 and 0x0C bytes in the “almost silence” packets which then decode to -1.

With this extra info, we can decode a much cleaner version of the audio signal. To remove some noise, we chose filter coefficients from the SNES “BRR” codec for a low-pass filter:

y = compressed_2bit * scale + yn0 * 61 / 32 - yn1 * 15 / 16
yn1 = yn0
yn0 = y

Program code can be found here, the resulting audio file sounds much cleaner.


Although this updated 2bit decoder already sounds much cleaner, we still don’t know what the real predictor / filter coefficients should look like. We could probably brute-force them, since we know the sample values at the beginning of every packet and we could therefore try different coefficients until we get close to the initial history of the next block, but apparently the full implementation of the decoder will be released by the challenge author in the future, so we decided to not spend any more time on it.