Solving Cryptograms Automatically

Gary J. Shannon
Created July 1, 2010
Updated July 2, 2010

Go Climb a Tree

The are a number of kinds of ciphers, but for this project I limit myself to simple substitution cryptograms where the word divisions are preserved. Later projects will tackle ciphertext with no word divisions, and polyalphabetic ciphers.

The method is based on a tree search for the correct plaintext alphabet. Hence the subtitle. (Which also serves to contrast this method with the more commonly used hill climbing method.) This program works by trying out a plaintext letter for each ciphertext letter, one by one. Each time a new plaintext letter is tried out the plaintext is searched for letter patterns that can't make words. For example, suppose I have the partially decrypted text:

        XFF MKKL PDQ
        -ll ---- m-n
      

If I try to substitute plaintext 'i' for ciphertext 'K' (K:i) that gives me:

        XFF MKKL PDQ
        -ll -ii- m-n
      

With a possible word -ii-. But there is no English word that has that pattern, so the K:i trial fails and we can go on to try something different, perhaps K:o giving -oo- which looks much more promising.

Knowing which words and partial words are valid and which are not requires a dictionary file. The dictionary is a simple text file so that we can add any words we discover to be missing. For example, in an early test run I discovered a cryptogram that wouldn't solve. It turned out that the word "mountainside", which was in the cryptogram, was not in the dictionary. I added that word to the dictionary and the cryptogram was solved in 58 tries, and less than 1/10th of a second.

The process begins by counting how many occurances there are of each letter in the cryptogram. Then when we begin trying out letters we will try the most frequent ciphertext letters first, and we will try them with the most frequent English letters. That makes it more likely that we will find a solution sooner than if we tried the letters in straight alphabetical order. Here's an example from an actual run to show you how these substitutions are made and tested for failure:

ct Order:  BEKJ PSNR UWFH IMTA CDGL OQVX YZ
pt Order:  etai onsr hldc umfp gwyb vkxj qz

BIBNUJPB SKE SME RKU KPR EJHB RKUE FKEW FJPTBN WSKP JWSBNE
-------- --- --- --- --- ---- ---- ---- ------ ---- ------      0
e-e----e --- --- --- --- ---e ---- ---- ----e- ---- ---e--      1 B:e OK
e-e----e --t --t --- --- t--e ---t --t- ----e- ---- ---e-t      2 E:t OK
e-e----e -at --t -a- a-- t--e -a-t -at- ----e- --a- ---e-t      3 K:a OK
e-e--i-e -at --t -a- a-- ti-e -a-t -at- -i--e- --a- i--e-t      4 J:i OK
e-e--ioe -at --t -a- ao- ti-e -a-t -at- -io-e- --ao i--e-t      5 P:o FAIL on e-e--ioe
e-e--ine -at --t -a- an- ti-e -a-t -at- -in-e- --an i--e-t      6 P:n FAIL on e-e--ine
e-e--ise -at --t -a- as- ti-e -a-t -at- -is-e- --as i--e-t      7 P:s OK
e-e--ise oat o-t -a- as- ti-e -a-t -at- -is-e- -oas i-oe-t      8 S:o FAIL on i-oe-t
e-e--ise nat n-t -a- as- ti-e -a-t -at- -is-e- -nas i-ne-t      9 S:n FAIL on nat
e-e--ise rat r-t -a- as- ti-e -a-t -at- -is-e- -ras i-re-t     10 S:r FAIL on -ras
e-e--ise hat h-t -a- as- ti-e -a-t -at- -is-e- -has i-he-t     11 S:h FAIL on -has
e-e--ise lat l-t -a- as- ti-e -a-t -at- -is-e- -las i-le-t     12 S:l FAIL on lat
e-e--ise dat d-t -a- as- ti-e -a-t -at- -is-e- -das i-de-t     13 S:d FAIL on dat
e-e--ise cat c-t -a- as- ti-e -a-t -at- -is-e- -cas i-ce-t     14 S:c FAIL on -cas

... SNIP ...

e-e--z-e -at --t -a- a-- tz-e -a-t -at- -z--e- --a- z--e-t    112 J:z FAIL on e-e--z-e
e-e----e -it --t -i- i-- t--e -i-t -it- ----e- --i- ---e-t    113 K:i OK
e-e--a-e -it --t -i- i-- ta-e -i-t -it- -a--e- --i- a--e-t    114 J:a FAIL on e-e--a-e
e-e--o-e -it --t -i- i-- to-e -i-t -it- -o--e- --i- o--e-t    115 J:o OK
e-e--oae -it --t -i- ia- to-e -i-t -it- -oa-e- --ia o--e-t    116 P:a FAIL on e-e--oae
e-e--one -it --t -i- in- to-e -i-t -it- -on-e- --in o--e-t    117 P:n OK
e-e--one ait a-t -i- in- to-e -i-t -it- -on-e- -ain o-ae-t    118 S:a FAIL on ait
e-e--one sit s-t -i- in- to-e -i-t -it- -on-e- -sin o-se-t    119 S:s FAIL on -sin
e-e--one rit r-t -i- in- to-e -i-t -it- -on-e- -rin o-re-t    120 S:r FAIL on rit
e-e--one hit h-t -i- in- to-e -i-t -it- -on-e- -hin o-he-t    121 S:h FAIL on o-he-t
e-e--one lit l-t -i- in- to-e -i-t -it- -on-e- -lin o-le-t    122 S:l FAIL on -lin
e-e--one dit d-t -i- in- to-e -i-t -it- -on-e- -din o-de-t    123 S:d FAIL on dit
e-e--one cit c-t -i- in- to-e -i-t -it- -on-e- -cin o-ce-t    124 S:c FAIL on cit
e-e--one uit u-t -i- in- to-e -i-t -it- -on-e- -uin o-ue-t    125 S:u FAIL on uit

...SNIP...

e-e----e -zn --n -z- z-- n--e -z-n -zn- ----e- --z- ---e-n    591 K:z FAIL on -zn
e-e----e --s --s --- --- s--e ---s --s- ----e- ---- ---e-s    592 E:s OK
e-e----e -ts --s -t- t-- s--e -t-s -ts- ----e- --t- ---e-s    593 K:t FAIL on -t-s
e-e----e -as --s -a- a-- s--e -a-s -as- ----e- --a- ---e-s    594 K:a OK
e-e--t-e -as --s -a- a-- st-e -a-s -as- -t--e- --a- t--e-s    595 J:t FAIL on e-e--t-e
e-e--i-e -as --s -a- a-- si-e -a-s -as- -i--e- --a- i--e-s    596 J:i FAIL on i--e-s
e-e--o-e -as --s -a- a-- so-e -a-s -as- -o--e- --a- o--e-s    597 J:o OK
e-e--ote -as --s -a- at- so-e -a-s -as- -ot-e- --at o--e-s    598 P:t FAIL on e-e--ote
e-e--oie -as --s -a- ai- so-e -a-s -as- -oi-e- --ai o--e-s    599 P:i FAIL on e-e--oie
e-e--one -as --s -a- an- so-e -a-s -as- -on-e- --an o--e-s    600 P:n OK
e-e--one tas t-s -a- an- so-e -a-s -as- -on-e- -tan o-te-s    601 S:t FAIL on tas
e-e--one ias i-s -a- an- so-e -a-s -as- -on-e- -ian o-ie-s    602 S:i FAIL on ias
e-e--one ras r-s -a- an- so-e -a-s -as- -on-e- -ran o-re-s    603 S:r FAIL on ras
e-e--one has h-s -a- an- so-e -a-s -as- -on-e- -han o-he-s    604 S:h OK
e-et-one has h-s -a- an- so-e -a-s -as- -on-et -han o-hets    605 N:t FAIL on e-et-one
e-ei-one has h-s -a- an- so-e -a-s -as- -on-ei -han o-heis    606 N:i FAIL on e-ei-one
e-er-one has h-s -a- an- so-e -a-s -as- -on-er -han o-hers    607 N:r OK
e-er-one has h-s ta- ant so-e ta-s -as- -on-er -han o-hers    608 R:t FAIL on ant
e-er-one has h-s ia- ani so-e ia-s -as- -on-er -han o-hers    609 R:i FAIL on ia-
e-er-one has h-s la- anl so-e la-s -as- -on-er -han o-hers    610 R:l FAIL on anl
e-er-one has h-s da- and so-e da-s -as- -on-er -han o-hers    611 R:d OK
e-ertone has h-s dat and so-e dats -as- -on-er -han o-hers    612 U:t FAIL on e-ertone

...SNIP...

eferyone has h-s day and some days last lon-er than others    633 I:f FAIL on eferyone
eperyone has h-s day and some days last lon-er than others    634 I:p FAIL on eperyone
egeryone has h-s day and some days last lon-er than others    635 I:g FAIL on egeryone
eweryone has h-s day and some days last lon-er than others    636 I:w FAIL on eweryone
eberyone has h-s day and some days last lon-er than others    637 I:b FAIL on eberyone
everyone has h-s day and some days last lon-er than others    638 I:v OK
everyone has his day and some days last lon-er than others    639 M:i OK
everyone has his day and some days last loncer than others    640 T:c FAIL on loncer
everyone has his day and some days last lonuer than others    641 T:u FAIL on lonuer
everyone has his day and some days last lonfer than others    642 T:f FAIL on lonfer
everyone has his day and some days last lonper than others    643 T:p FAIL on lonper
everyone has his day and some days last longer than others    644 T:g OK
everyone has his day and some days last longer than others    644 T:g SOLVED

This whole solution process took less than a second on a Pentium 3.2GHz PC.

Very short cryptograms a likely to have errors in their decryption because there are too many possible alternate ways to decrypt the message. For example how would you decrypt the message: GPLX? It could be any one 933 four-letter words in the dictionary file. Up to a point, as a cryptogram gets longer it takes more steps to solve because there are more possibilities to consider. However...

As a general rule the longer the cryptogram the faster it gets solved. This is because when there are more words to test it becomes more likely that a wrongly placed letter will be discovered sooner, cutting short many long and pointless dead end searches. Here are some figures from a cryptogram that I kept adding words to, to see how it affected solution time.

All of the solutions were correct except for the very first version that only had 5 words to work with.

        5 words: 1317 steps to get a wrong solution!
        10 words: 89 steps
        15 words: 99 steps
        20 words: 265 steps
        25 words: 124 steps
        30 words: 69 steps
        35 words: 74 steps
        40 words: 71 steps
        45 words: 68 steps
        50 words: 64 steps
        55 words: 61 steps
        60 words: 59 steps
        65 words: 58 steps
    

Get the Code Here

The program is written in generic C++, all in one cpp file. It's a quick-and-dirty with no OOP and lots of global variables. Just the way I like my code! The ciphertext is just hard coded into a variable, but you can add fancy input routines if you like. As always my code is placed in the public domain. use it any way you like. The zip file includes the complete 9300 word dictionary file. Download the package here (34.9KB)

Things That Need Fixing

There are a few things on my To-Do list that need to be done before I'll be happy with the program. Look for an updated version of the source code and dictionary soon. The list so far is:




< Back Home