Attacking a polyalphabetic cipher with coincidence testing

If random letters are strung out in two rows, one beneath the other, the probability that any letter in the top row will be the same as the corresponding letter in the bottom row is 1/26. When the same thing is done with English text the probabilities will be quite different because not all letters appear with equal probability in normal text.

For example, here are two strings of 100 letters of English text taken at random from Mark Twain's Captain Stormfield's visit to Heaven, with all the blanks and punctuation squeezed out:


wellwhenihadbeendeadaboutthirtyyearsibeguntogetalittleanxiousmindyouihadbeenwhizzingthroughspaceallth
flushacometoccasionallythatwassomethinglikewehaventgotanysuchcometsoursdontbeginonenightiwasswinginga
....................................^.................^^...............^......^............^.........

If the letters were random we would expect to see an average of 3.85 coincidences per 100 letters. Instead we find that there are 6 coincident letters in these two samples. Longer samples would give us number we could be more confident about, but this sample is good enough to illustrate the point.

Now suppose that a message has been enciphered using a polyalphabetic cipher with a period length of 4; for example, a Vigenere cipher with the keyword "HOME". If we were to line up two sections of the ciphertext and count the number of letters that coincided in our sample then we would expect to see lower numbers when the keys for the two sections do not coincide. For example, suppose the keys in our two sections are aligned like this:


     HOMEHOMEHOMEHOMEHOME...
     MEHOMEHOMEHOMEHOMEHO...

If the keys do not coincide then the alphabets used to the encipher the top row are different from those used to encipher the same letter position in the bottom row and we should see random coincidences only.

Now suppose the portions of the message are shifted so that the keys are aligned:


     HOMEHOMEHOMEHOMEHOME...
     HOMEHOMEHOMEHOMEHOME...

Now each letter on the top row is enciphered by the same alphabet as the corresponding letter on the bottom row and we can expect to see the number of coincidences jump up to the kind of numbers we see in normal English text. So if we take two portions of a message encoded with a polyalphabetc cipher, as we shift them over a letter at a time and count the number of coincidences, we should see a jump in the number of coincidences when the key cycles are matched up.

As an example, let's take a look at message in a typical, but long-period polyalphabetic cipher. (This is the ciphertext from the "Bored Guru" Cipher Challenge.) Here are the first 100 letters, with spaces and punctuation squeezed out, followed by the same message starting 10 letters further to the right:


WNXWBCOIDVPQOVEJSKIIVEPEXHBMZFWWMVOZUUBKSBYSWBGGYEZTBEMCTKHUWILLSIYLLQZBKQNIBLZYXUEVNHQCPDFPLTYRPTO
PQOVEJSKIIVEPEXHBMZFWWMVOZUUBKSBYSWBGGYEZTBEMCTKHUWILLSIYLLQZBKQNIBLZYXUEVNHQCPDFPLTYRPTOQAVYXNPXTT
.................................................................^........^......................^.

We count 3 coincidences which means the cycles are probably not lined up. Next we shift the bottom portion of the message one letter to the left and try again:


WNXWBCOIDVPQOVEJSKIIVEPEXHBMZFWWMVOZUUBKSBYSWBGGYEZTBEMCTKHUWILLSIYLLQZBKQNIBLZYXUEVNHQCPDFPLTYRPTO
QOVEJSKIIVEPEXHBMZFWWMVOZUUBKSBYSWBGGYEZTBEMCTKHUWILLSIYLLQZBKQNIBLZYXUEVNHQCPDFPLTYRPTOQAVYXNPXTTS
.......^.^...............................^.......................................................^.

Here we count 4 coincidences, still very close to the expected random average of 3.85, so we try again. Eventually, after shifting the bottom alphabet many more times we arrive at this arrangement:


WNXWBCOIDVPQOVEJSKIIVEPEXHBMZFWWMVOZUUBKSBYSWBGGYEZTBEMCTKHUWILLSIYLLQZBKQNIBLZYXUEVNHQCPDFPLTYRPTO
BLZYXUEVNHQCPDFPLTYRPTOQAVYXNPXTTSOVCNDRXZEVONVNJDMMGMHLWZXKJPKEKVAKORLJRRMGOTUAKODAZZNCQOBAOKOQPMW
..................................^....................................................^...........

Which has 2 coincidences and is followed after one more shift of the bottom portion by:


WNXWBCOIDVPQOVEJSKIIVEPEXHBMZFWWMVOZUUBKSBYSWBGGYEZTBEMCTKHUWILLSIYLLQZBKQNIBLZYXUEVNHQCPDFPLTYRPTO
LZYXUEVNHQCPDFPLTYRPTOQAVYXNPXTTSOVCNDRXZEVONVNJDMMGMHLWZXKJPKEKVAKORLJRRMGOTUAKODAZZNCQOBAOKOQPMWI
...................................................................................................

Which has 0 coincidences and is followed after one more shift of the bottom portion by:


WNXWBCOIDVPQOVEJSKIIVEPEXHBMZFWWMVOZUUBKSBYSWBGGYEZTBEMCTKHUWILLSIYLLQZBKQNIBLZYXUEVNHQCPDFPLTYRPTO
ZYXUEVNHQCPDFPLTYRPTOQAVYXNPXTTSOVCNDRXZEVONVNJDMMGMHLWZXKJPKEKVAKORLJRRMGOTUAKODAZZNCQOBAOKOQPMWIG
..^.......^......................^.......................^..........^...............^.^............

Which has 7 coincidences! Bingo. We have hit a large enough coincidence count that the letters cannot be random. Each letter must be enciphered with the same alphabet that the letter beneath it was enciphered with. This happened when the 79th letter of the message was aligned under the first letter of the message. From this we know that the cipher is a polyalphabetic periodic cipher with a repeating period of 79.

< Back Home