Auto Lex Word Generator
Created July 25, 2010
Last Updated July 26, 2010
Preface
This word generator has the virtue that most of the words look and sound just fine, and there in room in the universe of generated words for the largest imaginable conlang dictionary. Rather than trying to give the computer enough intelligence to pick out the good sounding words, I fell back on the technique of human-computer partnership. I did the part requiring judgement, and delegated the rote bookeeping to the computer. In fact, I didn't even have to write a program to implement this system. The whole thing was done in the freeware NoteTab text editor.
The Method
After a little bit of trial and error, I settled on an alphabet that gives consistently good results. The alphabet consists of these consonants:
14 initials: b d f j k l m n p s t v x z (x = "sh")
23 medials: b d f j k l m n p s t v x z mb mp nd nj nk ns nt nx nz
2 terminals: n or nothing
And the vowels: a e i o u
Syllables
A sylable may consist of a single vowel or a consonant followed by a single vowel with an optional n terminal consonant. Here are all the 140 possible intial syllables:
ba da fa ja ka la ma na pa sa ta va xa za
be de fe je ke le me ne pe se te ve xe ze
bi di fi ji ki li mi ni pi si ti vi xi zi
bo do fo jo ko lo mo no po so to vo xo zo
bu du fu ju ku lu mu nu pu su tu vu xu zu
ban dan fan jan kan lan man nan pan san tan van xan zan
ben den fen jen ken len men nen pen sen ten ven xen zen
bin din fin jin kin lin min nin pin sin tin vin xin zin
bon don fon jon kon lon mon non pon son ton von xon zon
bun dun fun jun kun lun mun nun pun sun tun vun xun zun
And all the 230 possible medial or terminal syllables:
-ba -ban -be -ben -bi -bin -bo -bon -bu -bun -da -dan -de -den -di -din -do -don -du -dun
-fa -fan -fe -fen -fi -fin -fo -fon -fu -fun -ja -jan -je -jen -ji -jin -jo -jon -ju -jun
-ka -kan -ke -ken -ki -kin -ko -kon -ku -kun -la -lan -le -len -li -lin -lo -lon -lu -lun
-ma -man -me -men -mi -min -mo -mon -mu -mun -na -nan -ne -nen -ni -nin -no -non -nu -nun
-pa -pan -pe -pen -pi -pin -po -pon -pu -pun -sa -san -se -sen -si -sin -so -son -su -sun
-ta -tan -te -ten -ti -tin -to -ton -tu -tun -va -van -ve -ven -vi -vin -vo -von -vu -vun
-xa -xan -xe -xen -xi -xin -xo -xon -xu -xun -za -zan -ze -zen -zi -zin -zo -zon -zu -zun
-mba -mban -mbe -mben -mbi -mbin -mbo -mbon -mbu -mbun
-mpa -mpan -mpe -mpen -mpi -mpin -mpo -mpon -mpu -mpun
-nda -ndan -nde -nden -ndi -ndin -ndo -ndon -ndu -ndun
-nja -njan -nje -njen -nji -njin -njo -njon -nju -njun
-nka -nkan -nke -nken -nki -nkin -nko -nkon -nku -nkun
-nsa -nsan -nse -nsen -nsi -nsin -nso -nson -nsu -nsun
-nta -ntan -nte -nten -nti -ntin -nto -nton -ntu -ntun
-nxa -nxan -nxe -nxen -nxi -nxin -nxo -nxon -nxu -nxun
-nza -nzan -nze -nzen -nzi -nzin -nzo -nzon -nzu -nzun
Notice that the medial/final syllable set has been divided into two groups. And here is why:
From that list, two-syllable, three-syllable and longer words can be built by appending new medial syllables to the starting initial seed syllable. For example, given the starting seed bu, we can choose any syllable from the first group. These are the medials that do not begin with one of the consonant clusters: mb mp nd nj nk ns nt nx nz. Those medials can only be appended to a syllable that ends in a terminal n. So we choose, for example, the syllable dan, giving us the word (so far) budan.
Now we need to add a syllable to budan For the purposes of appending, a final n matches any of the two-letter medials because the n in budan is actually the last letter of the second syllable, not the first letter of the third. As the last letter of the second syllable it can compound with b p d j t x z, (remembering that mb mp are really nb np, in diguise, for ease of pronunciation).
With those requirements, we have these 90 possible three-syllable words:
budamba budamban budambe budamben budambi budambin budambo budambon budambu budambun
budampa budampan budampe budampen budampi budampin budampo budampon budampu budampun
budanda budandan budande budanden budandi budandin budando budandon budandu budandun
budanja budanjan budanje budanjen budanji budanjin budanjo budanjon budanju budanjun
budanka budankan budanke budanken budanki budankin budanko budankon budanku budankun
budansa budansan budanse budansen budansi budansin budanso budanson budansu budansun
budanta budantan budante budanten budanti budantin budanto budanton budantu budantun
budanxa budanxan budanxe budanxen budanxi budanxin budanxo budanxon budanxu budanxun
budanza budanzan budanze budanzen budanzi budanzin budanzo budanzon budanzu budanzun
For a four syllable word we choose one of the three-syllable words, say budandi, and repeat the process, only this time we again cannot use medials beginning with compound consonants. We find there are 140 possible words:
budandiba budandiban budandibe budandiben budandibi budandibin budandibo budandibon budandibu budandibun
budandida budandidan budandide budandiden budandidi budandidin budandido budandidon budandidu budandidun
budandifa budandifan budandife budandifen budandifi budandifin budandifo budandifon budandifu budandifun
budandija budandijan budandije budandijen budandiji budandijin budandijo budandijon budandiju budandijun
budandika budandikan budandike budandiken budandiki budandikin budandiko budandikon budandiku budandikun
budandila budandilan budandile budandilen budandili budandilin budandilo budandilon budandilu budandilun
budandima budandiman budandime budandimen budandimi budandimin budandimo budandimon budandimu budandimun
budandina budandinan budandine budandinen budandini budandinin budandino budandinon budandinu budandinun
budandipa budandipan budandipe budandipen budandipi budandipin budandipo budandipon budandipu budandipun
budandisa budandisan budandise budandisen budandisi budandisin budandiso budandison budandisu budandisun
budandita budanditan budandite budanditen budanditi budanditin budandito budanditon budanditu budanditun
budandiva budandivan budandive budandiven budandivi budandivin budandivo budandivon budandivu budandivun
budandixa budandixan budandixe budandixen budandixi budandixin budandixo budandixon budandixu budandixun
budandiza budandizan budandize budandizen budandizi budandizin budandizo budandizon budandizu budandizun
From this list we select one word, say budandima, to be our new four syllable word. While there are exceptions for very short function words and particles, which are usually only one or two sylllables in length, as a general rule, once a two-syllable starting seed has been selected that seed should never be used for any other unrelated words. Since we can make 17,250 two-syllable words, these can represent 17,250 families of related words. In each family there can be at least 13,225 derived words that share the same first two syllables. That adds up to 228 million possible words, so there should not be a problem with lexicon size.
Put another way. for each of the one thousand categories in the original Roget's Theasurus there could be 17 two-syllable roots assigned to each category, and out of the 13 thousand possible derived words per root there should be no problem finding a hundred or so of them that are unique enough to be serviceable, allowing for at least 1700 words for each and every one of Roget's one thousand categories.
Accents and Compound Vowels
The realistic possible lexicon size is smaller than the theoretical 228 million because, except for the accented vowel, most of the time, most of the other vowels can be reduced to schwa. So if we accent the second syllable of budandima then it might actually be thought of as being of the pattern b-dand-m-, in which case we would want to avoid using any other words that reduced to the same pattern such as bidandumu. If we put a secondary accent on another vowel in the word we can regain some of the lost ground. For example, by also accenting the final a we have the pattern b-dand-ma, in which case, bidandumu, with a secondary accent on he final u becomes usable. Also, words which put the accent in a different place could be used in spite of being of similar shape. For example, badandema, if accented on the penultimate syllable has the pattern b-d-ndem- which distinguishes it from b-dand-m-.
Additional distinctions can be introduced by letting a derived word use a compound or two-syllable vowel in the accented syllable. For example: budandiamo with the vowel cluster ia in the accented penultimate, is distinct in sound, and one more syllable in length than the word with the similar shape.
Using This System with Roget's Thesaurus
Consider the online version of Roget's Thesaurus available from Project Gutenberg. Each word category has a number from 1 to 1000. To get words in the range aba to zunzun, the complete set of 2-syllable words, we would need a range of 1 to 17250, since that last number is the the total number of 2-syllable words we can make.
We could multiply the category number by 17.25 giving us the top of the range of two-syllable root words for each category. Category 1 gets everything up to 1 * 17.25, or the first 17 roots, aba through ado. Categry 1000 gets everything from the end of category 999 (17232) to 17250, or zunxen to zunzun. (Or zunshen if you prefer to spell out sh.)
Now if we want words for category #137. Infrequency.-- N. infrequency, rareness, rarity; fewness &c. we have our choice of words in the range: 136 * 17.25 + 1 to 137 * 17.25, or 2346 to 2363, or the 17 words from dakin to dama. (For convenience, a table of all possible two syllable words can be downloaded here.)
For paired-opposite categories like #137 (frequency) and #138 (infrequency) the word roots could be pooled for both categories and a negative prefix or suffix, or vowel infix added to get the opposite meaning. Those two categories between them have these roots to work with:
dafu dafun daja dajan daje dajen daji dajin dajo dajon
daju dajun daka dakan dake daken daki dakin dako dakon
daku dakun dala dalan dale dalen dali dalin dalo dalon
dalu dalun dama
Thus dali might mean frequently and udali, dalia, or duali for infrequently (using prefix, suffix, vowel infix in that order). In the same category, daja could mean common, and udaja, uncommon or rare, while rarely might be udajani, or dajiani. Related, but less frequently used words like seldom could extend the range by using another syllable or two, like udajunde (with negative prefix), or damianati (with negative vowel infix).
Alternatives
The online version of Roget's thesaurus (hopelessly outdated, but somewhat useful just the same) has 1000 conceptual categories. Many of those categories come in pairs of opposites that can be combined, like #325 Elasticity and #326 Inelasticity. Others form related sets that can be combined, like #618 Good and #648 Goodness or #890 Friend and #888 Friendship.
After combining pairs of opposites, and related concepts that can be derived from one another I ended up with are 663 categories left, although your mileage may vary. Some pairs of opposites, however, are so fundamental that I left them separate, like "good" and "evil".
Next I took a list of some 50,000 English words and extracted every initial consonant cluster and ended up with a list of 63 (after removing sound-alikes like "wr-" and "r-" or "ph-" and "f-"). Then, using the same list, I extracted clusters of consonants found between two vowels. Again removing sound-alikes, I ended up with a list of 763 usable medial consonant clusters.
By picking the "best" initials and the "best" medials I came up with a list of 1200 CVC roots, (downloadable here) with the vowel not specified. That makes one root for every one of the 663 Roget's categories with enough left over that many denser categories can get two roots. From each root longer words, or 2, 3, or more syllables can be coined by adding vowel-initial prefixes, and/or syllables to the end of the root. Since no internal vowel is specified, one has complete freedom to use whatever vowels or diphthongs work for a given root, and to use vowels for inflections or derivations.
Unlike the Semitic triconsonantal root system, there are only two consonants (or clusters), and the location of the internal vowel is fixed. Unlike a philosophical system, within each category lexical chaos may be allowed to reign, although subdividing the category could be done in a somewhat systematic manner as well.
For example, suppose a category gets the roots pr-k, and pr-ks those roots could form the basis of words like prokano, praksis, prokseniot, upriko, pruksi, priaksuma, apraksia, etc., all having assigned meanings related to the general category assigned to those roots.
Using such a system one could take a copy of the online Roget's and insert at the head of each category the assigned roots. Then when it came time to coin a specific word one could look up that word in Roget's, glance at the top of the category for the word to be coined and know which roots could be used to coin the new word.