Solving Cryptograms Automatically
Gary J. Shannon
Created July 1, 2010
Updated July 2, 2010
Go Climb a Tree
The are a number of kinds of ciphers, but for this project I limit myself to simple substitution cryptograms where the word divisions are preserved. Later projects will tackle ciphertext with no word divisions, and polyalphabetic ciphers.
The method is based on a tree search for the correct plaintext alphabet. Hence the subtitle. (Which also serves to contrast this method with the more commonly used hill climbing method.) This program works by trying out a plaintext letter for each ciphertext letter, one by one. Each time a new plaintext letter is tried out the plaintext is searched for letter patterns that can't make words. For example, suppose I have the partially decrypted text:
XFF MKKL PDQ
-ll ---- m-n
If I try to substitute plaintext 'i' for ciphertext 'K' (K:i) that gives me:
XFF MKKL PDQ
-ll -ii- m-n
With a possible word -ii-. But there is no English word that has that pattern, so the K:i trial fails and we can go on to try something different, perhaps K:o giving -oo- which looks much more promising.
Knowing which words and partial words are valid and which are not requires a dictionary file. The dictionary is a simple text file so that we can add any words we discover to be missing. For example, in an early test run I discovered a cryptogram that wouldn't solve. It turned out that the word "mountainside", which was in the cryptogram, was not in the dictionary. I added that word to the dictionary and the cryptogram was solved in 58 tries, and less than 1/10th of a second.
The process begins by counting how many occurances there are of each letter in the cryptogram. Then when we begin trying out letters we will try the most frequent ciphertext letters first, and we will try them with the most frequent English letters. That makes it more likely that we will find a solution sooner than if we tried the letters in straight alphabetical order. Here's an example from an actual run to show you how these substitutions are made and tested for failure:
ct Order: BEKJ PSNR UWFH IMTA CDGL OQVX YZ pt Order: etai onsr hldc umfp gwyb vkxj qz BIBNUJPB SKE SME RKU KPR EJHB RKUE FKEW FJPTBN WSKP JWSBNE -------- --- --- --- --- ---- ---- ---- ------ ---- ------ 0 e-e----e --- --- --- --- ---e ---- ---- ----e- ---- ---e-- 1 B:e OK e-e----e --t --t --- --- t--e ---t --t- ----e- ---- ---e-t 2 E:t OK e-e----e -at --t -a- a-- t--e -a-t -at- ----e- --a- ---e-t 3 K:a OK e-e--i-e -at --t -a- a-- ti-e -a-t -at- -i--e- --a- i--e-t 4 J:i OK e-e--ioe -at --t -a- ao- ti-e -a-t -at- -io-e- --ao i--e-t 5 P:o FAIL on e-e--ioe e-e--ine -at --t -a- an- ti-e -a-t -at- -in-e- --an i--e-t 6 P:n FAIL on e-e--ine e-e--ise -at --t -a- as- ti-e -a-t -at- -is-e- --as i--e-t 7 P:s OK e-e--ise oat o-t -a- as- ti-e -a-t -at- -is-e- -oas i-oe-t 8 S:o FAIL on i-oe-t e-e--ise nat n-t -a- as- ti-e -a-t -at- -is-e- -nas i-ne-t 9 S:n FAIL on nat e-e--ise rat r-t -a- as- ti-e -a-t -at- -is-e- -ras i-re-t 10 S:r FAIL on -ras e-e--ise hat h-t -a- as- ti-e -a-t -at- -is-e- -has i-he-t 11 S:h FAIL on -has e-e--ise lat l-t -a- as- ti-e -a-t -at- -is-e- -las i-le-t 12 S:l FAIL on lat e-e--ise dat d-t -a- as- ti-e -a-t -at- -is-e- -das i-de-t 13 S:d FAIL on dat e-e--ise cat c-t -a- as- ti-e -a-t -at- -is-e- -cas i-ce-t 14 S:c FAIL on -cas ... SNIP ... e-e--z-e -at --t -a- a-- tz-e -a-t -at- -z--e- --a- z--e-t 112 J:z FAIL on e-e--z-e e-e----e -it --t -i- i-- t--e -i-t -it- ----e- --i- ---e-t 113 K:i OK e-e--a-e -it --t -i- i-- ta-e -i-t -it- -a--e- --i- a--e-t 114 J:a FAIL on e-e--a-e e-e--o-e -it --t -i- i-- to-e -i-t -it- -o--e- --i- o--e-t 115 J:o OK e-e--oae -it --t -i- ia- to-e -i-t -it- -oa-e- --ia o--e-t 116 P:a FAIL on e-e--oae e-e--one -it --t -i- in- to-e -i-t -it- -on-e- --in o--e-t 117 P:n OK e-e--one ait a-t -i- in- to-e -i-t -it- -on-e- -ain o-ae-t 118 S:a FAIL on ait e-e--one sit s-t -i- in- to-e -i-t -it- -on-e- -sin o-se-t 119 S:s FAIL on -sin e-e--one rit r-t -i- in- to-e -i-t -it- -on-e- -rin o-re-t 120 S:r FAIL on rit e-e--one hit h-t -i- in- to-e -i-t -it- -on-e- -hin o-he-t 121 S:h FAIL on o-he-t e-e--one lit l-t -i- in- to-e -i-t -it- -on-e- -lin o-le-t 122 S:l FAIL on -lin e-e--one dit d-t -i- in- to-e -i-t -it- -on-e- -din o-de-t 123 S:d FAIL on dit e-e--one cit c-t -i- in- to-e -i-t -it- -on-e- -cin o-ce-t 124 S:c FAIL on cit e-e--one uit u-t -i- in- to-e -i-t -it- -on-e- -uin o-ue-t 125 S:u FAIL on uit ...SNIP... e-e----e -zn --n -z- z-- n--e -z-n -zn- ----e- --z- ---e-n 591 K:z FAIL on -zn e-e----e --s --s --- --- s--e ---s --s- ----e- ---- ---e-s 592 E:s OK e-e----e -ts --s -t- t-- s--e -t-s -ts- ----e- --t- ---e-s 593 K:t FAIL on -t-s e-e----e -as --s -a- a-- s--e -a-s -as- ----e- --a- ---e-s 594 K:a OK e-e--t-e -as --s -a- a-- st-e -a-s -as- -t--e- --a- t--e-s 595 J:t FAIL on e-e--t-e e-e--i-e -as --s -a- a-- si-e -a-s -as- -i--e- --a- i--e-s 596 J:i FAIL on i--e-s e-e--o-e -as --s -a- a-- so-e -a-s -as- -o--e- --a- o--e-s 597 J:o OK e-e--ote -as --s -a- at- so-e -a-s -as- -ot-e- --at o--e-s 598 P:t FAIL on e-e--ote e-e--oie -as --s -a- ai- so-e -a-s -as- -oi-e- --ai o--e-s 599 P:i FAIL on e-e--oie e-e--one -as --s -a- an- so-e -a-s -as- -on-e- --an o--e-s 600 P:n OK e-e--one tas t-s -a- an- so-e -a-s -as- -on-e- -tan o-te-s 601 S:t FAIL on tas e-e--one ias i-s -a- an- so-e -a-s -as- -on-e- -ian o-ie-s 602 S:i FAIL on ias e-e--one ras r-s -a- an- so-e -a-s -as- -on-e- -ran o-re-s 603 S:r FAIL on ras e-e--one has h-s -a- an- so-e -a-s -as- -on-e- -han o-he-s 604 S:h OK e-et-one has h-s -a- an- so-e -a-s -as- -on-et -han o-hets 605 N:t FAIL on e-et-one e-ei-one has h-s -a- an- so-e -a-s -as- -on-ei -han o-heis 606 N:i FAIL on e-ei-one e-er-one has h-s -a- an- so-e -a-s -as- -on-er -han o-hers 607 N:r OK e-er-one has h-s ta- ant so-e ta-s -as- -on-er -han o-hers 608 R:t FAIL on ant e-er-one has h-s ia- ani so-e ia-s -as- -on-er -han o-hers 609 R:i FAIL on ia- e-er-one has h-s la- anl so-e la-s -as- -on-er -han o-hers 610 R:l FAIL on anl e-er-one has h-s da- and so-e da-s -as- -on-er -han o-hers 611 R:d OK e-ertone has h-s dat and so-e dats -as- -on-er -han o-hers 612 U:t FAIL on e-ertone ...SNIP... eferyone has h-s day and some days last lon-er than others 633 I:f FAIL on eferyone eperyone has h-s day and some days last lon-er than others 634 I:p FAIL on eperyone egeryone has h-s day and some days last lon-er than others 635 I:g FAIL on egeryone eweryone has h-s day and some days last lon-er than others 636 I:w FAIL on eweryone eberyone has h-s day and some days last lon-er than others 637 I:b FAIL on eberyone everyone has h-s day and some days last lon-er than others 638 I:v OK everyone has his day and some days last lon-er than others 639 M:i OK everyone has his day and some days last loncer than others 640 T:c FAIL on loncer everyone has his day and some days last lonuer than others 641 T:u FAIL on lonuer everyone has his day and some days last lonfer than others 642 T:f FAIL on lonfer everyone has his day and some days last lonper than others 643 T:p FAIL on lonper everyone has his day and some days last longer than others 644 T:g OK everyone has his day and some days last longer than others 644 T:g SOLVED
This whole solution process took less than a second on a Pentium 3.2GHz PC.
Very short cryptograms a likely to have errors in their decryption because there are too many possible alternate ways to decrypt the message. For example how would you decrypt the message: GPLX? It could be any one 933 four-letter words in the dictionary file. Up to a point, as a cryptogram gets longer it takes more steps to solve because there are more possibilities to consider. However...
As a general rule the longer the cryptogram the faster it gets solved. This is because when there are more words to test it becomes more likely that a wrongly placed letter will be discovered sooner, cutting short many long and pointless dead end searches. Here are some figures from a cryptogram that I kept adding words to, to see how it affected solution time.
All of the solutions were correct except for the very first version that only had 5 words to work with.
5 words: 1317 steps to get a wrong solution!
10 words: 89 steps
15 words: 99 steps
20 words: 265 steps
25 words: 124 steps
30 words: 69 steps
35 words: 74 steps
40 words: 71 steps
45 words: 68 steps
50 words: 64 steps
55 words: 61 steps
60 words: 59 steps
65 words: 58 steps
Get the Code Here
The program is written in generic C++, all in one cpp file. It's a quick-and-dirty with no OOP and lots of global variables. Just the way I like my code! The ciphertext is just hard coded into a variable, but you can add fancy input routines if you like. As always my code is placed in the public domain. use it any way you like. The zip file includes the complete 9300 word dictionary file. Download the package here (34.9KB)
Things That Need Fixing
There are a few things on my To-Do list that need to be done before I'll be happy with the program. Look for an updated version of the source code and dictionary soon. The list so far is: