Coincidence Testing a Polyalphabetic Cipher

Gary J. Shannon
Created 2002
Updated June 26, 2010

What is Coincidence Testing?

If random letters are strung out in two rows, one beneath the other, the probability that any letter in the top row will be the same as the corresponding letter in the bottom row is 1/26. When the same thing is done with English text the probabilities will be quite different because not all letters appear with equal probability in normal text.

For example, here are two strings of 100 letters of English text taken at random from Mark Twain's Captain Stormfield's visit to Heaven, with all the blanks and punctuation squeezed out:

    wellwhenihadbeendeadaboutthirtyyearsibeguntogetalittleanxiousmindyouihadbeenwhizzingthroughspaceallth
    flushacometoccasionallythatwassomethinglikewehaventgotanysuchcometsoursdontbeginonenightiwasswinginga
                                        ^                 ^^               ^      ^            ^
      

If the letters were random we would expect to see an average of 100/26 = 3.85 coincidences per 100 letters. Instead we find that there are 6 coincident letters in these two samples. Longer samples would give us an average of 6.6 or so coincident letters per hundred letters of text.

Now suppose that a message has been enciphered using a polyalphabetic cipher with a period length of 4; for example, a Vigenere cipher with the keyword "HOME". If we were to line up two sections of the ciphertext and count the number of letters that coincided in our sample then we would expect to see lower numbers when the keys for the two sections do not coincide. For example, suppose the keys in our two sections are aligned like this:

      HOMEHOMEHOMEHOMEHOME...
      MEHOMEHOMEHOMEHOMEHO...
    

If the keys do not coincide then the alphabets used to the encipher the top row are different from those used to encipher the same letter position in the bottom row and we should see random coincidences only.

Now suppose the portions of the message are shifted so that the keys are aligned:

     HOMEHOMEHOMEHOMEHOME...
     HOMEHOMEHOMEHOMEHOME...
    

Now each letter on the top row is enciphered by the same alphabet as the corresponding letter on the bottom row and we can expect to see the number of coincidences jump up to the kind of numbers we see in normal English text. So if we take two portions of a message encoded with a polyalphabetc cipher, as we shift them over a letter at a time and count the number of coincidences, we should see a jump in the number of coincidences when the key cycles are matched up.

The Online Coincidence Tester

If you want to do a coincidence test on a possilbe polyalphabetic cipher you can skip ahead to the Online Coincidence Tester, or you can stick around a take a look at exactly how this particular version of coincidence testing works. You can also download a zipped copy of the PHP code for this tester. Like all my software, this PHP program is in the public domain. Use it any way you like.

This coincidence tester uses a method to compute what is called the auto correlation coefficient, which while different from the traditional index of coincidence, serves the same purpose. We will be computing, for each possible key length, the ratio between the number of coincidences per hundred and the expected number of coincidences if the text were just random letters instead of sentences made up of English words. The ratio for random letters would be 3.85/3.85 = 1.0, while the ratio for English text would be 6.6/3.85 = 1.7. We are looking, therefore, for possible key lengths that have auto correlation coeficients closer to 1.7 than to 1.0

This tester counts the number of coincidences between the ciphertext and that same ciphertext as shifted over by some number of letters equal to the key length being tested. But with a short polyalphabetic cryptogram to solve there is not a whole lot of data on which to base decisions about key length. Note, however, that if the key length is 4, for example, that we will also see increased coincidences with shift displacements of 8, 12, 16, and every other multiple of 4. This is because shifting a 4-character key by 8 has the same effect on the key alignment as shifting it by 4. For a given candidate key length, K, therefore, this coincidence tester averages the rate of coincidence for offsets of K, 2K, 3K, and so on, up to the length of the available ciphertext, using a much larger sample, and consequently giving a much more stable estimate of the auto correlation coeficient for a given K.

If you would like more details on how the program works, feel free to download the PHP code and take a look at how it's done. If you read up on "Index of Coincidence" in the literature, you will see that this auto correlation method is easier to perform, and easier still to understand, yet gives the same results. Besides, I've always prefered explanations like:

Slide 'em over and count 'em up.

to the Greek letter explanations like:

< Back Home