Huffcode - A New/Old Kind of Code

Gary J. Shannon
Created July 19, 2010
Updated July 20, 2010

What is a Code?

In common usage the word "code" is often used to mean any method of altering a message to make it unreadable by anyone except the intended recipient. But in the technical jargon of cryptography the word "code" refers to a very specific way of encrypting a message. While a cipher disguises a message by changing it letter by letter, a code changes the message word by word. Unlike a cipher, which only requires the alphabetic key to unlock, a code requires a large book of codes and the words that they stand for. A cryptographer enciphering the word "apple" takes it letter by letter and might end up with A P P L E becoming R T T M Q. To encode the same word the cryptographer opens his code book and looks up the word "apple" and finds something like 07241, which has nothing whatsoever to do with the letters A P P L E and everything to do with the whole word APPLE.

One example of an early published code book is The Secret Corresponding Vocabulary by Francis O. J. Smith, published in 1845. It uses a five-place code group, one letter followed by four digits, as the code for each word in the dictionary. One problem with such code books is that the code groups have to be long enough to leave room for the maximum number of words you might want to encode. This usually means that all the code groups are 5 characters long, even for short English words like "a" and "of", resulting in encoded messages that are longer than the original plaintext.

A New/Old Kind of Code

The code presented here is a new kind of code, yet it is based on some old ideas. Hence the description of a "New/Old Code". The old ideas are the , and the Huffman code which uses binary bits to encode and compress digital data in the form of bytes in a computer file. The new idea is to use letters of the normal Roman alphabet instead of binary bits, to encode and compress English words instead of computer bytes, providing not only security, but space savings in the process. Because most of the words used in normal English writing will be represented by two-letter codes, and the rest by three or four-letter codes, encoded message will end up much shorter than the original plaintext.

In a Huffman code the most common letters get the shortest binary codes. In this version, which I, rather unimaginatively, call "HuffCode", the most commonly used English words get the shortest alphabetical codes. Here's how it works: There are 676 two-letter codes { AA, AB, AC, ... ZA, ZB, ZC, ... ZW, ZX, ZZ }, but we reserve the 156 codes ending in { -U, -V, -W, -X, -Y, -Z } to indicate that the code group will be 3 or more letters long. That leaves us 520 two-letter codes for the most common 520 English words.

To those 156 reserved codes we can add any one of 26 different third letters, giving us 4056 three-letter codes, { AUA, AUB, ... ZZY, ZZZ }, but again, we reserve the codes ending in { --U, --V, --W, --X, --Y, --Z } to mark four-letter codes. This time there are 936 three-letter codes ending in one of those six letters.

Finally, there are 24,336 four letter codes { AUUA, AUUB, ... ZZZZ }. Of these four-letter codes, the 936 ending in { ---Z } are reserved for special purpose codes that an individual user might want to define for his own private use. For example, a salesman might want to send the message to the main warehouse:

     "Ship   2   cases of item     #7422-1   to the Pittsburgh warehouse.
    

That might coded as the user code AUXZ 2 #7422-1, where AUXZ is the code for "Ship to the Pittsburgh warehouse this many cases of this item...". User code groups are all four letters in length. If more than 936 user code groups are needed the user can use the code ZZZZ to mean that the next three letters are not in regular code, but are a special-purpose user code group. This allows for up to 17,576 additional user code groups. Thus ZZZZ KUS would mean that KUS does not have its usual meaning (careful) but has a special meaning assigned by the user.

Certain common prefixes and suffixes also have code groups assigned to them so that words not in the dictionary might be constructed by putting a prefix or suffix on an existing word such as "familiarize" = "familiar" + "ize", or "disassembling" = "dis" + "assemble" + "ing". The most frequent use of these will be in making plurals with the "-s" and for possessives ("congressman's") with the "-'s" prefixes. These are all two-letter codes and are found at the front of the dictionary.

The most common punctuation marks, such as period, comma, exclamation point, question mark, and so on, also have codes assigned to them, but their use is optional. These are all four-letter codes and are found at the end of the dictionary.

Unusual Words and Proper Names

In addition to punctuation are two special codes GZVD (START-NAME) and GZVE (END-NAME) which mark a section of the text that is spelled out either in plaintext or in some enciphered form agreed upon by the users. This covers any special names or words that are not in either the regular code dictionary or the extended proper names dictionary. For most proper names, including names of people, places, companies, geographical features, and so on, use the special code ZT which indicates that the following 4 letters are a proper name code which will be found in a separate proper names code dictionary. See note below under the heading "Extensions and Modifications".

How Much Space is Saved?

Since the top 520 words make up an average of 73% of English text, and are represented by two-letter codes, they will contribute an average of 2 * 0.73 = 1.46 letters per words to the text. The next most common 3,120 words, encoded as three-letter codes, make up 19% of all text, contributing an average of 0.19 * 3 = 0.57 letters per word. Virtually all of the remaining 8% is made up of four-letter code groups, contributing an average of 0.08 * 4 = 0.32 letters per words, for a grand total average of 1.46 + 0.57 + 0.32 = 2.35 letters per English plaintext word after encoding, which, since English averages 5 letters per word not counting the space character, is less than half the number of letters of the original plaintext.

Decoding the Message

Because the code groups are of variable length, unless the "enemy" knows the code, there is no way for someone intercepting the message to divide the text into individual words. But because of the way the code groups are organized the recipient always knows how to divide up the incoming stream of letters. For example, the stream KQNRPZYRTDPRCXM can only be divided one way: KQ NR PZYR TD PR CXM because whenever you see U, V, W, X, Y, or Z at the end of a two or three-letter code group you know to grab the next letter too and make it a part of the code group. Obviously, for greater security the code groups could be enciphered using any suitable letter-by-letter enciphering method to further frustrate the enemy. Then, even having the code book would be of no use to the enemy who did not know your enciphering scheme.

How the Codes Were Assigned

In assigning code groups to words, the words were first sorted into descending order of frequency so the most frequently occurring English words get the shortest code groups. Initially, all code groups ending with T have been left unassigned and set aside to provide code space for the inevitable corrections and additions that will have to be made to the dictionary without disturbing any of the pre-existing words and codes. This insures that a message encoded with the earliest edition of the code dictionary can always be successfully decoded with a later, updated edition of the code dictionary.

The original source file of English words sorted by frequency can be downloaded here (44K zip file with 9322 words including their average frequency of occurrence per one million words of typical English text.)

An Example

Here is a sample of the code used to encrypt this very sentence: IHCLC GOYMC DCCWU GLACF GXZUD EHOUX S
Notice how much shorter the code is than the original. (31 letters vs 51) Thirty one letters are used to encode 13 words for an average of 2.4 letters per word, very close to the theoretical 2.35 letters per word we computed above. Dividing it up, word for word, it looks like this:

      Here is a  sample of the code used to encrypt this very sentence.
      IH   CL CG OYM    CD CC  WUG  LA   CF GXZU    DE   HO   UXS 
    

The Dictionary Files

There are two dictionary files. The first is eng_to_huffcode.txt and is arranged in alphabetical order by the English words. The second is huffcode_to_eng.txt and is arranged in alphabetical order by code groups according to their lengths. All two-letter code groups come first, then three-letter groups, and finally four-letter groups.

The Online Coder/Decoder

My next project will be an online coder/decoder that will include a user interface that will allow the user to add words to the code dictionary when he or she finds that a needed word is missing. Eventually the goal is to have such added words automatically Incorporated into the downloadable dictionary files so that the latest version, with all the newest words, is always available right here.

If you have any ideas or suggestions, please drop me an email.

A Challenge

Just for fun, here is a short selection coded using this system. See if you can decode it.

  CCBYW VAJGD GVDHF HDUQG ZUHEB CJDLB YWVGZ UHEIC JENAV VIAGA XVECD GDGVD GZUHC
  ICKEX IDSCP SBCHJ WKGZU GCKGL DGBXF KJCHC CRIFE CPMXE GZULA WKGLC KBVBC FJIVO
  CFCCL ZIDMC CDVZX GZUHB UMCNC CQYLF NDYYS ECCNV ZMAEC PGDGC DGZUG CKDCC GTFLZ
  KCNIB ZOCDC CHNGZ ULCEC RCDGC FKGUS DMBYW VGZUH DNCLC VXXCF KFGZU HGZUP CKCLD
  ZPCHG UHGZU HGZUP CMCJM HEECD ECGZU HGZUP CCBYW VCLDZ PCHCP GVZRG ZUGGZ UPGUF
      

Extensions and Modifications

All extensions and modifications will be made is such a way as to leave undisturbed the original word/code assignments. In this way newer editions of the code dictionaries will always be "upward compatible", in that text encoded with an older edition can always be decoded using a newer edition.

Any extensions or modifications will be listed here, with the most recent listed first.




< Back Home