[Cryptopals, 1.6] Cracking “Vigenère cipher”

This challenge asks you to decrypt a blob encrypted XOR’ing each byte with the corresponding byte of the key. That is, to encrypt the word “foo” using the key “bar”, the procedure to obtain the ciphertext is

(f ^ b), (o ^ a), (o ^ r)

As we can see, in general the key must be extended such to be as long as the plaintext. If we were to encrypt an HTTP response with the key “bar” again, we have to concatenate the key to itself until the length of the plain HTTP response is reached; thus the key must be “barbarbarbarbarbar…”

The vulnerability of this “scheme” is hinted by Matasano itself.

  1. If we know the size of the key |k|, we can compose |k| chunks, each encrypted using a simple substitution cipher — which we can break via frequency analysis.
  2. If we were to encrypt always the same byte, say “AAAAAAAA…”, if we pick a keysize of 5 and compute the average Hamming distance among chunks of the plaintext each keysize long, and by picking a keysize of 4 or 6 we get a higher Hamming distance, it can be that the right keysize is 5 — the Hamming distance of two bit patterns is the number of bits that are different. Now we won’t deal with a plaintext of “AAAAAA…” but intuitively the same principle applies.
    I didn’t tried to prove it but I believe one could forge a specific (plaintext,key) pair such that this analysis is hijacked to wrong conclusions.

So, the idea for solving this challenge is to guess the most likely keysize, collect a keysize number of chunks of characters taken from the ciphertexts then try to solve each of them by bruteforcing the (small) keyspace and perform frequency analysis to distinguish valid (i.e. likely to be in English) plaintexts.

Yay!, how do you distinguish valid plaintext’s from non valid ones? The good news is that we know which are the frequencies of a random English text. The even better news is that someone uploaded this information encoded as json.

Note that I had to add the space character — which we know is as frequent as an ‘e’ — to make the below code working. The above json doesn’t do consider spaces instead.

The part that I missed was how to properly treat bytes that are not in [a-zA-Z]? Should I just say the text is not English just because I’ve seen a \xff? What about ‘#’ then?, ‘this a #hashtag’ should be classified as non English then?

After tinkering a bit and researching a lot (maybe too much, solution was really easy) I’ve found out that punctuation, digits and other “rubbish”, as a whole, should appear for the 8.5% of the text on average.

Also, to compare two frequencies I’ve used a simple difference of two values — taking the absolute value of each and summing up the total.

I know that there’s the famous chi-squared index to use, but I wanted to keep things simple, really straightforward, and under control.

The following code computes a valid score for a plaintext to be English.

That was the hardest part!

For the rest we can just switch to code monkey mode. :3

break_single_byte_XOR() just makes a call to compute_letter_frequencies_match()


% python3 six.py 
b"I'm back and I'm ringin' the bell \nA rockin' on the mike while the fly girls yell \nIn ecstasy in the back of me \nWell that's my DJ Deshay cuttin' all them Z's \nHittin' hard and the girlies goin' crazy \nVanilla's on the mike, man I'm not lazy. \n\nI'm lettin' my drug kick in \nIt controls my mouth and I begin \nTo just let it flow, let my concepts go \nMy posse's to the side yellin', Go Vanilla Go! \n\nSmooth 'cause that's the way I will be \nAnd if you don't give a damn, then \nWhy you starin' at me \nSo get off 'cause I control the stage \nThere's no dissin' allowed \nI'm in my own phase \nThe girlies sa y they love me and that is ok \nAnd I can dance better than any kid n' play \n\nStage 2 -- Yea the one ya' wanna listen to \nIt's off my head so let the beat play through \nSo I can funk it up and make it sound good \n1-2-3 Yo -- Knock on some woo"