Implementing Natural Language Machine Translation
of Text From High-Level Encoding of Meaning

by Gary J. Shannon

---------- Table of Contents ----------

1. What's New as of Jan. 17, 2004

The latest changes and updates to this web page.

2. Introduction  (Dec. 30, 2003)

The basic ideas behind Chomp.

3. The Language Generator (Dec. 30, 2003)

An overview of the function of the language generator module, or LGM.

4. Metacode Overview (Dec. 30, 2003 - revised Jan.17, 2004)

The basic syntax and formatting rules of Chomp metacode, and the design philosophy that guided these choices.

5. Noun Arguments (Dec. 30, 2003 - revised Jan. 17, 2004)

The arguments that need to be made available to nouns including how to format those arguments and why each one is needed.

6. Pronoun Arguments (Dec. 31, 2003 - revised Jan. 17, 2004)

Why pronouns need arguments, and how to format them.

7. Verb Arguments (Jan. 4, 2004 - revised Jan. 17, 2004)

The verb is the heart and soul of the Chomp LGM.  Verbs take the largest number of possible arguments of any word class in Chomp.  This brief introductions explains why and shows how those arguments are formatted.

8. Conditional and Structural Forms (Jan. 17, 2004)

These are the forms that glue the components together into a complete concept.

9. Word, Structure, Concept (Dec. 31, 2003 - revised Jan. 4, 2004)

An overview of how the pieces interact to represent a complete concept.

10. The Next Step (Dec. 31, 2003 - revised Jan. 4, 2004)

Introduces the internal workings of the Language Generator Module and the concept of pattern templates.

11. Template Repertoire (Dec. 31, 2003 - revised Jan. 4, 2004)

A preliminary look at the LGM template repertoire, and some issues affecting its eventual size.

12. Some Implementation Options (Dec. 31, 2003 - revised Jan. 17, 2004)

Mentions a few different options for implementing Chomp and the LGMs.

13. The Pilot Project (Dec. 31, 2003 - revised Jan. 4, 2004)

Introduces the pilot project to machine-translate a small selection from Moby Dick by Herman Melville into three different languages from the same Chomp encoding.

14. Some Pilot Project Assumptions (Jan. 2, 2004 - revised Jan. 2, 2004)

Some assumptions and presuppositions prior to beginning actual work on the project.  These are provisional assumptions and are expected to be revised or discarded as the project proceeds.

15. The Ishmael Text in Chomp (Jan. 2, 2004 - revised Jan. 6, 2004)

The text of the portion of Moby Dick to be used in the pilot study is translated into Chomp.

16. LGM Templates (Jan. 3, 2004 - revised Jan. 5, 2004)

This takes a much closer look at the operation of LGM phrase templates.

17. Sentence Template Generality (Jan. 4, 2004 - revised  Jan. 3, 2004)

A key question is whether a selection of LGM templates adequate to translate one text will work with a different text.  If LGM templates are not general enough then a new set would have to be created for each work to be translated, defeating the whole purpose of Chomp which is fast and accurate machine translation of a wide variety of text sources into any human language for which an LGM has been written.  This section explores the generality of the LGM template repertoire.

18. Categorizing Verbs (Jan. 6, 2004)

Linguistic scientists put a great deal of stock in categorizing the types of verbs and their argument structures.  Here we take a look at why all this theory can pretty much be dispensed with using the present method.

19. The Implementation - The English LGM (Jan. 6, 2004)

This is were we enumerate all the templates and verbal functions necessary to implement the English language LGM for the pilot text.

19.1 The Implementation of the Verb "Be" (Jan. 7, 2004) Code for the first detailed verb implementation.
19.2 Global Arguments (Jan. 7, 2004) Some parenthetical remarks on global arguments that might govern the translation.
19.3 Going Live (Jan. 7, 2004) The first live computer test of a verb generator function.

20. The Latin LGM (coming soon)

21. The German LGM (coming soon)


What's New - Jan. 17, 2004

Jan. 17, 2004. Made some corrections to sections 4, 5, 6

Jan. 7, 2004. Added the first relatively complete inplementation of a verb generator along with its templates in section 19.  This implementation will undergo live testing shortly.  Also explored a few ideas about global arguments.

Jan. 6, 2004. Added section 18 discussing how verbs are implemented in the Chomp language generation modules.  Made a few minor corrections to the Ishmael text in Chomp.  Added section 19 examining the expected translation and enumerating the templates necessary to achieve that translation.

Jan. 3 to 5, 2004, saw a complete rewrite of all the sections having anything to do with Chomp syntax and formatting since both of these underwent major revision and improvement to readability.  Many thanks to my sister, Kathleen Spracklen, for her valuable input in helping me to clarify some syntax and formatting issues.


Introduction

This is a brief overview of a method of implementing accurate machine translation of text into a variety of target languages.  Rather than base such translations on a source text in any particular natural language, the method described here bases its translations on a source document encoded in a meta-language which is designed to encode the underlying meaning of the text rather than the specifics of how that meaning is conveyed in any particular language.

Documents in the meta-language, called Chomp, can be created in one of two ways.  The author can write the document directly in Chomp or an original document written in some natural language can be interactively translated into Chomp in an iterative process  (see figure 1) whereby successive refinements of the meta-code are passed through the target language generator and the result compared to the original.  When the meta-code has been thus "debugged" the meaning of the original will have been correctly encoded and all translations to other target languages should then be correct as well.


Figure 1. Iterative Refinement, or debugging
of the Meta-Code source document.

The process of creating a Chomp document from a natural language document requires a great deal of human intervention to insure accuracy of the results.  Once such a source document has been created, however, translation of that source to other target natural languages is completely automated.  Thus for an expenditure of labor comparable to that of doing a single human translation, an unlimited number of translations can be obtained.  In other words, for the cost of a single English-to-French translation the document can be translated to French, German, Italian, Spanish, Japanese, Korean, Chinese, Swedish, Swahili, Arabic, Urdu, Latin, Greek, Tok Pisin, and so on.


The Language Generator

In order to translate to a particular target language a language generator  module, or LGM has to be written for that target language.  This is a matter of providing a specified set of sentence templates and recursive template evaluation rules along with a lexicon, which includes such vital information as word inflections, genders, and so on.  The task of writing a language generator is a highly specialized one requiring a firm grasp of the grammar of the target language and a good understanding of the inner workings of the language generation engine.  Once this job is complete, however, the same generator may be used whenever it is required to make a translation into that target language.

This pilot project includes developing basic language generators for English, Latin and German.

The language generator makes no attempt to include every subtle feature of a given language.  It will, in other words, write clear, correct, unambiguous and highly readable sentences, but it will not create great literature.  Stylistically an English translation, for example, will more closely resemble Basic English than Charles Dickens.  However, it is entirely possible to write a language generator for Elizabethan English, or Chaucerian English if those translations were needed, and a clever designer could no doubt write a language generator in the style of Shakespeare.  Such embellishments are possible because the job of writing the language generator, while not trivial, is fairly straight forward.


Meta-Code Overview

The Chomp statement is built around forms.  A form consists of a keyword, a possible value associated with that keyword, and an optional collection of arguments.  The arguments of a form are themselves forms, and the top-level form that represents an entire sentence is the parent form or p-form of every other form within the sentence.  Any form that is an argument of a parent form is refered to as a descendant form, or d-form.  The structure of forms and their descendants resembles the branching structure of a tree and the top-level parent form is called the root form.  A simple example would be the English language assertion "The rock broke the window."  In Chomp that would be encoded:


     Verb = break
        Agent = rock
        Object = window
        PAST

So far this doesn't look the least bit like a tree, but bear with us for a little while and the buds will begin to sprout.  Notice that the verb has been given the value "break".  This name is completely arbitrary and is not to be confused with the English word "break".  The verb could just as easily been named "glipnorp", just as the agent or actor of the verb, "rock", could have been named "kugel", and the target, "window", could have been named "zipitty_do_dah".  The whole form might then have been written:


     Verb = glipnorp
        Agent = kugel
        Object = zipity_do_dah
        PAST

Clearly, mnemonic names like "break" are to be preferred to names like "glipnop", but it must never be forgotten that these are arbitrary labels and not actual words in any language.  When a form is invoked in the LGM it be responsible for providing the proper words and sentence structures to convey its meaning in the target language.  Furthermore, once the Chomp meta-code is ready for translation it could be compiled into a binary format and names like "rock", and "break" would not even exist in the binary meta-code file.  In this way even though translators working in different languages might use different names for their objects and assertions ("stein" or "calculus" instead of "rock", for example) the compiled meta-code would be identical regardless of the language used in naming the objects.

The agent of an verb form is usually a noun, or something that performs the function of a noun, and which is responsible for the action or state being asserted.  In the example above "rock" is the agent of the verb "break" since it is the proximate cause of the breakage. The agent may also be an entire sub-tree containing numerous d-forms beneath it.  This occurs when the agent performing the action is more than just a single word as in "The clown with the big red nose and baggy yellow pants broke my balloon."

The verb form may be parent to one or more arguments.  These arguments, when they are present, provide information about what is being asserted and how it is being asserted.  In the example the verb is parent to the argument "Target = window" in order to express what it is that was to experience the action of  "break".  

D-forms are shown as indented below their parent p-form, but there is another way to write such a form.  If the number of arguments to a form is small it may be desirable to put all of the arguments on a single line.  For example, in the above form we failed to specify whether "rock" and "window" were singular or plural.  This would be entered into the form with what is called a boolean argument.  A boolean argument is a  word from a pre-defined list of words or tags that is either present or absent in the argument list.  Verb tense, mood and voice can be specified by boolean arguments such as "PAST", "IMPERATIVE", "PERFECT", "FUTURE", and so on.  Noun number, gender, article and indicative pronouns, among other attributes, are also coded with boolean arguments.  Boolean arguments are normally written in all upper case letters to distinguish them from value parameters.

To specify that the rock that broke the window is singular, and use the article "the" (in languages that use articles), the arguments "S" and "T" would be added under "rock" like this:


     Agent = rock
        S
        T

But to save space, and make the code more readable, we can put both arguments on the same line with "rock" provided that we place the arguments inside parentheses and separate them with commas in lieu of placing them each on a new line.


     Agent = rock( S , T )

Note that spaces are optional within the parentheses, however, spaces or tabs in front of the lines of the form are not optional.  To be considered as being on the same level of the tree each form must be at the same level of indentation.  Unlike many computer programming languages where indentation is optional, in Chomp, indentation is both mandatory and significant to the LGM that reads the Chomp source encoding.

A more complete version of the broken window scenario described above would look like this:


     Verb = break
        Agent = rock( S, T )
        Object = window( S, A )
        P-PERFECT
        Qualifier = might

     The rock might have broken a window.

The boolean "P-PERFECT" specifies present perfect tense, and the keyword "Qualifier" is followed by some value that will modify the manner in which the action is expressed by the verb.  The combination of stated tense and the modifier "might" (again, not to be confused with the English word "might") informs the LGM to apply whatever form is required to express the assertion in the target language: "The rock might have broken a window."

Note that the order of the argument d-forms is not significant.  Since each is identified by its own keyword they can be placed in any order within their parent form.

About parens and commas: A left parenthesis is equivalent to a new line indented one level deeper.  A right parenthesis is equivalent to a new line un-indented by one level.  A comma is equivalent to a new line at the same level of indentation as the previous argument.  That means that two arguments can be placed on the same line if they are separated by commas, and that once placed within parentheses all arguments separated by commas will be at the same level until a deeper nesting of parentheses occurs on that line.  Deep nesting of parentheses on a single line is discouraged, however, since it creates readability problems as anyone familiar with the programming language LISP can attest to.

Compare the above example with:

     Verb=break(Agent=rock(S,T),Object=window(S,A),P-PERFECT,Qualifier=might)

Now imagine that this tree was not one layer deep, but six or eight layers deep and you will appreciate why it is best to use more lines and fewer parentheses.  A source encoding with no parentheses or commas is probably over-indented, but there is a happy medium which all coders will discover for themselves.  It is good practise, however, to put certain boolean arguments in parentheses with the word they modify.  Verb tenses, voices and moods are a good example as in "Verb = run( FUTURE )".

Just to provide a hint of the flavor of more complex forms consider this example from Edgar Allen Poe's The Pit and the Pendulum:

     The agony of suspense grew at length intolerable,
     and I cautiously moved forward, with my arms extended,
     and my eyes straining from their sockets, 
     in the hope of catching some faint ray of light.

     Join
     |  Verb = Copula( PAST, grow_increase )
     |     Onset = progressive
     |     Object = agony(ST)
     |        Of = suspense
     |     Complement = intolerable
     >AND
        Verb = move_self( PAST )
           Agent = Pronoun( NARRATOR, MALE )
           Intent = forward
           Qualifier = cautiously
           While
           |  Verb = extend( PAST-PART )
           |     Object = arm(P)
           |        Belong = ( NARRATOR, BODY-PART )
           >AND
              Verb = strain( PRES-PART )
                 Agent = eye(P)
                    Belong = ( NARRATOR, BODY-PART )
                 From = socket(P)
                    Belong = ( THIS, BODY-PART )
                    Goal
                       Type = hope
                       Verb = catch_see( PRES-PART )
                          Agent = *NARRATOR
                          Object = ray_beam(SM)
                             Of = light
                             Qualifier = faint_dim

A note on formatting:  The vertical bar character "|" is ignored by the LGM and is treated as if were just another blank space on the line.  This makes it possible for the translator to use this character to visually align and connect arguments for greater clarity.  It's use, while optional, is highly recommended.


Noun Arguments

Here are some boolean arguments that might be used with a noun:


	S - Singular
	P - Plural
	D - Dual
	T - Definite article "the"
	A - Indefinite article "a/an/some"
	M - "some/any" used with singular or plural 
	    as in "searching for some/any sign of water"
	I - Demonstrative pronoun "this/these"
	O - Demonstrative pronoun "that/those"
	E - Plural pronoun "they" as in "they both..." or "both of them"
	U - Universal "U"="All x", "PU"="All the x" or "All of the x",
	    "Every X", "Every one of the X"
      

These single-letter arguments can be combined into a single argument as in this example:


     Verb = break( PAST )
        Agent = rock(PI)
        Object = window(SA)

Which translates: "These rocks broke a window."  Note that we must never specify the plural implicitly as in


     * Agent = rocks
     * Agent = children

Instead, it must be explicitly coded in the noun arguments as


    Agent = rock(P)
    Agent = child(P)

The reason is that different languages have different ways of forming the plural, and the language generator must be explicitly informed of the plurality of the noun. Some languages also have what is called a dual in addition to a plural form.  This form of the word is used only when there are exactly two of the objects in question.  Languages that do not have a dual form will express an explicit dual as a plural, but for compatibility with all languages, duals should be encoded.

Even English has dual forms of some words.  We would say "All of my kids like candy" only if there were more than two children involved.  If there were exactly two children under discussion we would say "Both of my kids like candy." Where "both" is the dual form of "all" or "many".  Now imagine that there was a dual and plural form of "the" as well, as in "theone", "theboth", "themany", and that will give you an idea of just one of the possibilities.  Now imagine that the word "the" was replaced by ten different words, each refering to a different class of nouns so that one version of "the" is used with inanimate objects, and another with domestic animals, and yet another with gods and other celestial beings.  Clearly there is a lot more that needs to be supplied in the Chomp code than just the presence or absence of the English word "the".

There are numerous other types of arguments that can be used with nouns, and these will be explored in detail later.  Two brief examples would be the keyword argument "That" which qualifies the noun with a phrase that has an adjectival function, and the keyword argument "Belong" which specifies ownership.  For example:


     Bring me all the toys that this careless boy broke.

     Verb = bring_fetch( IMPERATIVE )
        To = NARRATOR
        Object = toy(PU)
           That
              Verb = break( PAST )
                 Agent = boy(SI)
                    Qualifier = careless


Give me some of my books. Verb = give_transfer( IMPERATIVE ) To = NARRATOR Object = book(PA) Belong = ( NARRATOR, OWN )

Since we use the keyword "Belong" to indicate ownership you might wonder why the additional keyword "OWN" is given as an agrument to the function.  The reason is that the possesive is formed in a number of different ways in different languages, and some languages (members of the Maylao-Polynesian group, for example) use a different posseive for things which are owned ( John brought his books ), things which are possesed but not owned ( John brought his library books ), things which are attached parts ( John looked at his fingers ), people who are related upwards ( John visited his mother ), downwards ( John brought his son ), or sideways ( John and his brother came home ), and people with whom we have some other kind of relationship (John saw his secretary walk in ).  In English we use the word "his" in every instance, but this is not the case in many languages, and so the possesive function "Belong" needs to know the nature of the relationship in order to form the proper sentence in those languages.


Pronoun Arguments

In specifying pronouns the English speaker might believe that it is sufficient to specify whether it is first person singular or third person plural, and so on.  Take for example the sentence "John threw a rock and it broke the window."  It is not enough just to specify that the pronoun "it" is the agent of the verb "break".  In English the word rock uses the neuter third person pronoun "it" and that's the end of the story.  However, the German word "stein" and the Latin words "calculus" and "lapis" would take the masculine pronoun, while other languages might treat their equivalent word as neuter or even feminine, and languages of the Bantu family might have as many as ten or fifteen different noun classes that each get treated differently.  Some languages have two forms of the pronoun "we", one of which includes the person spoken to and one which does not.  In the first case "We went to the store" means "you and I went to the store." and in the second case it means "We went to the store without you."

For these reasons it is never sufficient to specify the person and number of a pronoun.  Instead, the noun to which the pronoun applies, along with whatever additional parameters are required, must be passed to the function "Pronoun" which will return the appropriate form depending on the instructions found in the specific LGM.  

Likewise, proper names may be treated differently in different languages (see, for example, Harrius Potter et Philosophi Lapis in Latin), and must be explicitly identified as such.  Thus the simple assertion "John threw a rock and it broke the window" is encoded:


   Join
   |  Verb = throw( PAST )
   |     Agent = Proper_Person
   |        Given = "John", male
   |     Object = rock(SA)
   >AND
      Verb = break( PAST )
         Agent = Pronoun( rock(SA) )
         Object = window(ST)

   John threw a rock and it broke the wondow.

There are a couple of details to notice about the pronoun example.  The first is that the name "John" is enclosed in quotation marks.  This is because it is an actual word as opposed to a symbolic tag.  Enclosing it in quotations ensures that it will be compiled and passed to the language generator module properly, and not as a symbolic binary tag.  Just remember that any word not in quotation marks may never be seen by the LGM as anything more than a symbolic binary tag.  

The second detail is the use of the form "Join". This form is used to connect separate forms into a compound form.  We will explore this type of form in more detail a bit further on.

As an aside, it is possible to predefine certain argument forms that will be used often in a given document.  For example, if we were going to be talking about John a lot in our Chomp document we could define the new word, John, that could be used where he is to be mentioned.  For example:


     Define "John" as "Proper_Person( Given = "John", male )"

After which this form:


     Verb = throw( PAST )
        Agent = Proper_Person
           Given = "John", male
        Object = rock(SA)

Can be written this way:


     Verb = throw( PAST )
        Agent = John
        Object = rock(SA)

Now the word "John" no longer needs to be placed in quotes because it is no longer an actual word, but is a symbolic tag for the form that defines it.

Remember that when we defined the symbol "John" we could just as easily have defined the symbol as "flocinapot" like this:


     Define "flocinapot" as "Proper_Person( Given = "John", male )"

After which this form:


     Verb = throw( PAST )
        Agent = flocinapot
        Object = rock(SA)

Would still be talking about a person with with the masculine proper name "John", and would still cause the English LGM to create the sentence "John threw the rock."  We choose the symbol "John" only to make it easy to remember that this symbol refers to a person with the first name of John.  This point cannot be emphasized too often.  Symbols like "rock" and "throw" that are seen in these forms are not words, but are only symbolic references to objects within the LGM; symbolic objects that will enable the LGM to choose words to put into the sentences it builds.


Verb Arguments

When we get into the details of how the LGM works we will see that the verb template is the pivot around which the whole language generation process revolves.  Verbs come in a variety of types including active, passive, transitive, intransitive, and on and on.  In Chomp, the way the verb is used in the sentence is governed by which arguments are present in the verb's argument list.  Recall that boolean arguments are either present or absent but will not be followed by an equal sign or a value.  Some of these arguments include:


Agent:  The actor, person or thing, that is performing the action, or is primarily responsible for the action taking place. In "John threw the rock" John is the agent, while in "The rock broke the window" the rock is the agent.

Object:  The person or thing toward which or for whose benefit (or detriment) the action is being performed. In "John threw the rock" the rock is what is being affected, and is the object.  In "The rock broke the window" it is the window that is being acted upon and "window" is the object.

Target:  The person or thing toward which, about which, or on whose behalf the action is performed. In "John threw the rock at the window" the window is the target.  In "John gave the book to Mary", Mary is the target.

NOT:  This boolean argument specifies that the verb is to be expressed as not having been performed.  For example, "Verb = see( NOT )" represents "did not see".

IMPERATIVE: (or IMPER)  This boolean argument indicates that the verb is in the imperative mood.  This is a verb that is giving a command as in "Bring me the book", or "Sit! Roll over!"

INDICATIVE: (or INDIC)  This argument is optional since if no mood is stated the indicative mood is assumed.  The is the verb mood used to make straightforward statements or ask questions.

SUBJUNCTIVE: (or SUBJ)  This argument indicates the verb is in the subjunctive mood.  

PASSIVE: (or PASS)  This argument indicates that the verb is to be expressed in the passive voice. Thus "John gave the book to Mary" becomes the passive "The book was given to Mary by John."

PRESENT: (or PRES)  This argument indicates that the action of the verb is taking place at the present moment.  In some languages this may be expressed in only one way, and in others, such as English, it may be expressed in different ways: "I run." and "I am running" for example.

PAST:  This aregument indicates that the action of the verb took place in the past and is now complete. "John read the book."

PERFECT: (or PERF)  This argument indicates that the action of the verb took place in the past and was complete in the past. "John had read the book."

IMPERFECT: (or IMPF )  This argument indicates that the action of the verb was taking place in the past in a continues, and possibly not complete manner.  "John was reading the book."

FUTURE: (or FUTR)  This argument indicates that the action of the verb has not yet taken place, but will take place in the future.  "John will read the book."

There are numerous other tenses, voices moods, and modifiers of all sorts that can be used with verbs. These will be explained and demonstrated as they are encountered.


Conditional and Structural Forms

Take a look at this assertion from chapter one of the Chinese classic Tao Te Ching: "The Tao(path) that can be described is not the eternal Tao."


     If a path can be described then it is not the eternal path.

     IF ( Is_Possible )
     |  Verb = describe
     |     Object = way_path(SA)
     >THEN
        Verb = Copula( NOT, PRES )
           Object = Pronoun( way_path(S) )
           Complement = way_path(ST)
              Qualifier = eternal

There are other ways to encode this assertion but this particular encoding ("If a path can be described then it is not the eternal path.") demonstrates the use of the "If" form. The first subtree under the "If" describes some condition, and the second subtree describes what the result of that condition is.

The "Join" form joins its arguments together with a logical "and":


     The ball and bat are here.

     Verb = Copula( PRES )
        Location = here
        Object
           JOIN
           |   ball_toy(ST)
           >AND
               bat_stick(ST)

Notice that the two agents jointly own the location, as opposed to:


     The ball is here and the bat is over there.

     Join
     |  Verb = Copula( PRES )
     |     Location = here
     |     Object = ball_toy(ST)
     >AND
       Verb = Copula( PRES )
          Location = overthere
          Object = bat_stick(ST)

The "If" form which we looked at above, and which can also be thought of as "Implies" asserts that the first argument is a conditional assertion which, if satisfied, implies that the second argument is also true.  Thus:


     If the ball is not here then it must be lost.

     IF
     |  Verb = Copula( NOT, PRES )
     |     Object = ball_toy(ST)
     |     Location = here
     >THEN
        Verb = Copula( PRES )
           State = lost
           Modifier = must_necessarily
           Object = Pronoun( ball_toy(ST) )

The first argument is the conditional and the keyword argument ">THEN" is the consequence of that conditional being true.  A similar, but slightly different meaning is generated by these encodings:


     Since the ball is not here, it's probably lost.

     Since_Because
     |  Verb = Copula( NOT, PRES )
     |     Object = ball_toy(ST)
     |     Location = here
     >THUS
        Verb = Copula( PRES )
           State = lost
           Modifier = probably
           Object = Pronoun( ball_toy(ST) )


The ball is not here so it seems to be lost. Assert_That | Verb = Copula( NOT, PRES ) | Object = ball_toy(ST) | Location = here >THUS Verb = Copula( PRES, mod = seem ) State = lost Object = Pronoun( ball_toy(ST) )

In these cases the phrase "the ball is not here" is asserted to be an unconditionally true fact rather than possibly true, and the "THUS:" keyword clause asserts that its argument is true as a consequence of the truth of the first argument.

You will also notice that the label "ball_toy" is used rather than "ball".  This is to distinguish it from "ball_bullet", "ball_party", "ball_call", and "ball_testicle", which refer respectively to ammunition for a muzzle loading gun, a formal party, a call made by a baseball umpire, and the slang for a portion of the male anatomy.  Almost every other language on earth will use five separate words for these five senses of the English word "ball" and they must be distinguished.  The verb ball, as in to wad up something into a ball would be treated separately since they are not nouns.

Notice also the verb "Copula".  This refers to what is called the copulative, or linking verb in English, although this concept might be rendered in a very different way in another language.  The copula represent the concept and not any of the English copulative verbs (is, seem, appear, become, etc.).  The copula may also be modified with a "mod =" argument which gives the alternate copulative verb or concept such as those mentioned above.  The copula may take a generic complement, or a complement of the type "Location" or "State" as well as a few others that we will encounter later.


Word, Structure, Concept

Some of the things encoded in Chomp are written as words, or symbolic tags that represent words. Others are written as concepts embodied in some structural form.  For example, in the sentence "If I see him I'll say hello" we treat the word "if" as a structure of the type "IF <condition> THEN <occurance>".  Rather than try to express this concept in words, we leave it to each language generator to decide how best to realize this structure in that particular language.

In general, Chomp encodings are more about meaning than about words.  Ideas like "It's so dark I can't see a thing" get encoded in structures like "INTENSITY <something> IS_SUCH_THAT <event or state>" which captures the meaning without bothering to attempt to capture the exact wording.  Several examples of this type will be seen in the Ishmael Text below.

Where words are concerned it cannot be expected that one language will have the exact equivalent of every word in another language.  The universe is divided up differently from one language to the next, and a word that has several meanings in English, for example, might be translated with several different words in Spanish.  Consider the simple word "to".  The American Heritage Dictionary lists 15 prepositional definitions and 4 adverbial definitions.  A native English speaker with no exposure to other languages might assume that any other language would have some single word that duplicated all 19 of those meanings.  This is simply not the case.  That is why it is required that the meanings of verbs and nouns be clarified with additional words in many cases.  

Examples might include verbs like call_phone, call_summon, call_shout, call_visit, call_designate, call_invoke, and so on. The translator would need to be familiar with all the forms in order to select the proper ones for something like "I called him to say that the umpire made a bad call against him but that he shouldn't have called him an idiot to his face."  Here the same word expresses three different meanings and must encoded accordingly since it will very possibly be translated into three different words in the target language..  

It is also a good idea to use a selection of words that indicate a range of meanings such as "dismal_grim_gloomy".  Given a range of meanings the language generators can do a better job of choosing the appropriate word in any given language.

This is also why idiomatic phrases like "call it a day", or "he's a push over" will never be found encoded in Chomp.  Instead, the underlying meaning of the idiom is encoded, i.e. "Cease current activity, go home from work, ...", "he is easily influenced_swayed_convinced,..."


The Next Step

Beyond these basics it is "merely" a matter of enumerating all the forms and structures and their possible argument values and detailing the behavior of each type of symbolic tag.  The assertions made in the meta-code are evaluated by the language generator by a process resembling the execution of a computer program.  The language translator consists of definitions for each of the verbs which, when called, return a value in the form of a string of characters representing the meaning of that assertion centered on that verb in some particular natural language.

We can follow the details of the process with a few examples to better understand the workings of the language generator module. Each form contains a template and is treated very much like a function is treated in a programming language.  In other words, the form is called upon to return some value to the LGM when some other values are passed to it.  This will become more clear as we examine some examples. Consider this Chomp code:


     I didn't eat dinner because I don't like eggs.

     Assert
     |  Verb = eat( NOT, PERF )
     |     Agent = Pronoun( NARRATOR )
     |     Object = dinner(S)
     >BECAUSE
        Verb = like_fancy( NOT, PRES )
           Agent = Pronoun( NARRATOR )
           Object = egg(P)

The Assert->BECAUSE form has an internal template that it seeks to fill with values passed to it as arguments.  In the Latin LGM, for example, that template might take the form:


     Assert-argument + "quod" + BECAUSE-argument

Substituting the arguments into this template we get:


     Assert-argument is replaced by "eat( NOT,PERF,Agent=Pronoun(NARRATOR),Object=dinner(S))"
     BECAUSE-argument is replaced by "like_fancy(NOT,PRES,Agent=Pronoun(Narrator),Object=egg(P))"

     eat( NOT,PERF,Agent=Pronoun(NARRATOR),Object=dinner(S)) + 
     "quod" + 
     like_fancy(NOT,PRES,Agent=Pronoun(Narrator),Object=egg(P))

Since there are still forms that have not been resolved into actual words the process continues recursively by passing parameters to the "eat" form and to the "like_fancy" form.  That "eat" form, which is also very much like a function that might be called in a programming language, has (among others) the internal template:


     Object.Form[ pIndex, ACCUS ] + "non" + v_Form[ pIndex, tense ]

Without worrying too much about the details this template calls upon the object argument ("dinner") to state its form in the accusitive case for the person and number appropriate to the agent (Pronoun) of the action.  This is a simple table lookup that returns the character string "cenam".  In the same spirit the table "v_Form" is consulted for the appropriate form of the verb "to eat" which returns the string "edebam".  These two evaluated arguments are slotted into the template which is concatenated to become:


     Object.Form[ pIndex, ACCUS ] is replaced by "cenam"
     v_Form[ pIndex, tense ] is replaced by "edebam"

     "cenam" + "non" + "edebam"
     "cenam non edebam"

This string is then passed back to the function that called for it, namely "Assert".  By replacing the first argument with this value the template being built up by "Assert" then becomes:


     eat( NOT,PERF,Agent=Pronoun(NARRATOR),Object=dinner(S)) is replaced by "cenam non edebam"

     "cenam non edebam" + "quod +
     like_fancy(NOT,PRES,Agent=Pronoun(Narrator),Object=egg(P))

Next the "Assert" function calls upon the "like_fancy" function to evaluate its arguments.  "like_fancy" goes through the same kind of table lookup and template stuffing process based on gender and grammatical case of the nouns and tense, mood and voice of the verbs and generates the character string:


     "ova non amo"

which it returns to the "Assert" function.  Finally the "Assert" function makes this last substitution and returns the completed character string:


     like_fancy(NOT,PRES,Agent=Pronoun(Narrator),Object=egg(P)) is replace by "ova non amo"

     "cenam non edebam" + "quod" + "ova non amo"
     "cenam non edebam quod ova non amo"

Which is proper Latin for "I didn't eat dinner because I don't like eggs."  More complex sentences are built up recursively in this same manner


Template Repertoire

The "literary quality" of a translation will depend on the size of the template repertoire and the skill of the programmer who wrote the language generator.  Even with a very small repertoire of templates a technically accurate translation can be made, although it would be sadly lacking in literary style.  Even a handful of small sentence templates would suffice to convey an accurate rendition of a complex sentence by breaking it down into smaller pieces.  For example:


     John broke the red crayon that I gave him and now
     it lies here broken, my crayon so red, so forlorn.

Could be accurately translated with just a very few template types as:


     I had a crayon.
     It was red.
     I gave it to John.
     John broke it.
     It lies here now.
     It is broken.
     It belonged to me.
     It is so red.
     It is so forlorn.

In spite of being completely accurate, this translation leaves something to be desired in the style department.  However, with the addition of a few recursive templates the possible complexity of sentence structure grows dramatically.  The intent is not, however, to duplicate every possible kind of sentence structure in a given language.  That would clearly be  impractical if not impossible.  Instead, the intention is to provide a large enough variety of sentence structure templates to avoid the kind of choppy translation sentences shown above.

The manner in which a more complex sentence can be generated recursively can be demonstrated by recursively building such a sentence.  


     [preamble][event]

     preamble is replaced by "With [some description]"
     event is replaced by [someone] "throws [something ][ somewhere ]"

     With [some description][someone] throws [something ][ somewhere ]

     some description is replaced by "[some kind of] flourish"
     someone is replaced by "Cato"

     With [some kind of] flourish Cato throws [something ][ somewhere ]

     some kind of is replaced by "philosophical"
     something is replaced by "himself"
     somewhere is replaced by "[preposition][thing]"

     With philosophical flourish Cato throws himself [preposition][thing]

     preposition is replaced by "upon"
     thing is replaced by "[somebody's] sword"

     With philosophical flourish Cato throws himself upon [somebody's] sword

     somebody's is replaced by "his"

     With a philosophical flourish Cato throws himself upon his sword

Each piece of the sentence is built up from smaller pieces, defined recursively, that can each be used in a very large variety of larger sentence structures.  Further examples of this principle in practice will be encountered in the pilot project.


Some Implementation Options

The Chomp meta-code might be read directly by the language generator module, or it might be compiled into a binary, or tokenized form before being processed by the generator module.

The generator modules, one for each target language, might be written in a standard programming language such as C++, or they might be implemented in a programming language designed specifically for language generators using syntax like that presented later on.

Language generators might introduce variety by selecting randomly from alternate phrasings.  For example, the generator might randomly choose between expressing "Enjoy( Run(...), PRESENT  ) as "enjoy running" vs "like to run."    In a debugging mode the generator might be programmed to provide all the alternatives so that it can be verified that they are all correct.

Given the number of additional arguments that need to be passed to functions in order to cover what is required by any language it could well become necessary to have a kind of aduiting program which will read and scan the Chomp source file to be sure that all the necessary arguments are present for every function called.  For example, did we specify whether the first person plural pronoun is inclusive or exclusive, and did we specify which version of the possesive to use with each use of "Belong"?


The Pilot Project

The pilot project is a proof-of-concept demonstration that will include language generators for English, Latin, and German. These will not be complete generators in that they will include only the limited vocabulary necessary to translate the selected source material.  The source material selected for translation is this excerpt from Moby Dick by Herman Melville:

Call me Ishmael.

Some years ago -- never mind how long precisely -- having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

It is a way I have of driving off the spleen and regulating the circulation.

Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off- then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.

The selected text is long enough to be convincing yet short enough to be encoded quickly and verified thoroughly.  It contains enough variety of sentence structure and a sufficiently large vocabulary to demonstrate the feasibility of this approach to machine translation.

With respect to the English translation it should be noted that the sample text translated into English from Chomp cannot be precisely the same as the original.  What is retained is the precise meaning, not the precise manner of expressing that meaning.  If it is found that the precise meaning has been conveyed in the English, Latin, and German translations then the demonstration will have succeeded.

 It should be noted that we will not begin by assuming that the categories used by the scientific linguist will be of use to us. The working engineer tinkering in his workshop may be use a different set of tools than the physicist in his laboratory even though engineering ultimately rests on a foundation of physics.  Likewise the language engineer tinkering with sentences may need a different set of tools from those used by the scientific linguist who studies the nature of sentences.  We may find it useful to lump together categories that the linguist distinguishes and to distinguish among things that the linguist lumps together.  Rather than start with preconceived notions of what will and will not be useful we will let the data guide us in determining the boundaries of our categories.


Some Pilot Project Assumptions

We will begin by assuming that every imaginable distinction will be of use; "verbish-nouns", "nouns of time", "nouns of state", "adverbish nouns", "adjectivish nouns" and other odd categories categories that probably never occur in serious linguistics.  Then after decomposing all the sentences of the source text we will see which distinction really are of consequence and which can be dispensed with.  The determination will be strictly utilitarian in that we will keep only the tools we need to tinker together a good variety of sentences.

Another reason for discarding the standard categories is that whatever categories remain must be universal regardless of the target language.  We might, for example, distinguish between nouns of relative location ("the top of...", "the inside of...") from nouns of absolute location ("Chicago", "home", "the closet") and from nouns describing material objects ("box", "car", "tree")  The reason is that while in English we might say "put it on the top of the box" (put it on the <noun> of the <noun>") there is no guarantee that other languages will treat these two concepts in a noun-like manner.  It is conceivable that some language might express it as "Put it on the box topishly." in which case the "topness" is expressed adverbally.  Clearly the single category of "noun" would not suffice to distinguish such cases.

Likewise "My return to this place reminds me ..." uses the noun "return", but it is a verbish sort of noun and expresses a concept that might well be expressed in other ways in other languages.  In fact we can express it in English as "When I return to this place...", or "Having returned to this place..." both of which use "return" as a verb.  Another example of a verbish sort of noun is "dance" in "Having performed my dance, I rested." Very nearly the same meaning could be obtained with "Having danced, I rested." Some languages may, in fact, require one form or the other, and to tie down "dance" as definitely being a verb or a noun might cripple the ability of the language generator module to treat it in the way required of a particular language.  Consider the all noun phrase "during the course of my exploration ..." which might need to be expressed in some languages as "while I explored..." which is purely verbal.

"An idea of merit...", treats "merit" as a noun, as in "the merit of the idea", which is a thing that can be discussed.  But it is a very adjectivish sort of noun.  It could as well have been written as "A meritorious idea...".  Also, the phrase "The shape of the room..." uses the noun "shape", but it is a sort of adjectivish noun.  We can say "the shape of the room is round" or we can say "the round room...". That "shape" is somehow used in a descriptive manner with respect to "room" needs to be noted. In "His delayed arrival...", "arrival" is a noun and "delayed" is an adjective, but not only is "arrival" a verbish noun, but "delayed" is a sort of verbish adjective as well implying the act of delaying occurred sometime, somewhere.  All these various mutant and hybrid categories will be retained until their usefulness is ruled out.

In addition, sentences will not always be chopped up on word boundaries.  Certain phrases will be taken to be single compound words simply because they represent single concepts.  Many English words like "take-note-of", "became-aware-of", "make-haste-to", "take-place", and "take-leave-of" will certainly not be duplicated in other languages, word for word with words of similar individual meanings. In other cases a single English word might actually imply words that are not actually present. "The book, seemingly popular, was taken." "Seemingly" here really means "which  appears to be...".  In this sentence the word acts as if it were a verb, or at least somehow "contains" the verb "to appear".  Therefore, multi-word but single-meaning groupings are taken to be single words, and single words with multi-word meanings are taken to be multiple words in the analysis of sentence templates because the language generator module for each language will decide for itself when to split any given concept into a series of words.


The Ishmael Text in Chomp

The first step in the pilot project is to translate the source text into Chomp.  This will be done by hand.  In the future, tools may become available that will interactively assist the translator in this task, but for now such tools are not available.

As an aid to locating the meta-code associated with a given line of source text the translator may insert reference numbers preceded by a '#' sign.  The language generator has an option to generate the translation with or without those reference numbers present in the translated text.  Using these reference numbers the translator or proof reader can, when a problem is located in the translation, refer back to the Chomp source code in order to correct that problem.  Once the debugging of the Chomp source code is complete then the translation is generated without these reference numbers.

The translator may also insert comments in the Chomp source code.  Whenever the character string "//" appears on a line of source code, everything after that will be ignored by the language generator, and so the translator is free to include whatever notes or explanatory remarks he or she would like to place in the source code.


     // User defined tags

     DEFINE "NARRATOR" as "Pronoun( Proper_Person( male ))"

     // Call me Ishmael.

     #1
     Verb = Call_Designate( IMPERATIVE )
        Target = NARRATOR
        Name
           GivenName = "Ishamel", Male

     // Some years ago -- never mind how long precisely -- having
     // little or no money in my purse, and nothing particular to
     // interest me on shore, I thought I would sail about a little
     // and see the watery part of the world.

     #2
     Context
        Past
        |  Unit = year
        |  Amount = some
        Inject
        |  Verb = ask( NOT )
        |     Voice = Imperative
        |     Query = duration
        |        Qualifier = exact
        Event
           Since_Because
           |  Verb = have_posses( PERF )
           |     Agent = NARRATOR
           |     Object = money
           |     |  Qualifier = little
           |     >OR
           |        Qualifier = no
           |     Locate = ( in, purse )
           |        Belong = NARRATOR
           >AND
           |  Verb = exist( NOT, PERF )
           |     Object = anything
           |        Locate ( on, shore )
           |        Which
           |           Verb = interest
           |              Target = NARRATOR
           Thus
              Verb = decide( PERFECT )
              |  Agent = NARRATOR
              |  Verb = Sail( SUBJ )
              |     Agent = NARRATOR
              |     Destination = NULL
              |     Duration = short
              >AND
                 Verb = See( SUBJ )
                    Agent = *
                    Target = portion(ST)
                       Of = world
                       That_Is = watery

     // It is a way I have of driving off the spleen and
     // regulating the circulation.

     #3
     Verb = Copula( PRES )
        Object = Pronoun( * )
        Complement = method
           Belong = NARRATOR
           Purpose
              Join
              |  Verb = evict_banish
              |     Object = irritability_melancholy
              >AND
                 Verb = regulate_moderate
                    Object = circulation(T)

     // Whenever I find myself growing grim about the mouth; whenever it is a 
     // damp, drizzly November in my soul; whenever I find myself involuntarily
     // pausing before coffin warehouses, and bringing up the rear of every
     // funeral I meet; and especially whenever my hypos get such an upper hand
     // of me, that it requires a strong moral principle to prevent me from
     // deliberately stepping into the street, and methodically knocking people's
     // hats off -- then, I account it high time to get to sea as soon as I can.

     #4

     Concatenate
     |  Conditional
     |     Condition = whenever
     |     Event
     |        Verb = find_realize( PRES )
     |           Agent = NARRATOR
     |           Object
     |              That
     |              Verb = become_manifest( PRES_PART )
     |                 Agent = NARRATOR
     |                 ToBe = dismal_grim_gloomy
     |                    Around_Near( mouth(ST) )
     >PLUS
     |  Conditional
     |     Condition = whenever
     |     Event
     |        Verb = feel_like
     |           Object = soul( Belong = NARRATOR )
     |           Target = november<SA>
     |              Qualifier = ( damp, drizzly )
     >PLUS
     |  Conditional
     |     Condition = whenever
     |     Event
     |        Verb = find_realize( PRES )
     |           Agent = Reflex( NARRATOR )
     |           Object
     |              Join
     |              |  Verb = pause( PRES_PART )
     |              |     Agent = Reflex( NARRATOR )
     |              |     Mode = involuntary
     |              |     Location = in_front_of( warehouse(P))
     |              |        Type = coffin
     |              >AND
     |                 Verb = follow_behind( PRES_PART )
     |                    Agent = *Reflex( NARRATOR )
     |                    Object = funeral(PU)
     |                       That
     |                       Verb = encounter
     |                          Agent = NARRATOR
     >PLUS
     |   Conditional
     |      Condition = whenever( EMPHASIS )
     |      Event
     |         Verb = overwhelm( PRES )
     |            Agent = depression(S)
     |               Belong = NARRATOR
     |            Object = NARRATOR
     |            To_Degree_That
     |               Require
     |               |  Object = principle(SA)
     |               |     Type = moral
     |               |        Qualifier = strong
     |               >TO
     |                  Verb = prevent
     |                     Agent = NARRATOR
     |                     From
     |                        Join
     |                        |  Verb = enter( PRES_PART )
     |                        |     Agent = NARRATOR
     |                        |     Qualifier = deliberately
     |                        |     Object = street(ST)
     |                        >AND
     |                           Verb = knock_off_dislodge( PRES_PART )
     |                              Agent = *NARRATOR
     |                              Qualifier = methodically
     |                              Object = hat(P)
     |                                 Belong = Person(P)
     >THEN
        Verb = conclude( PRES )
           Agent = NARRATOR
           That
              Verb = BE_time_for( PERF, EMPHASIS )
                 Verb = go( PRES )
                    Agent = NARRATOR
                    Target = To( sea )
                    Modifier = ASAP
                              
     // This is my substitute for pistol and ball.

     #5
     Verb = BE_equate( PRES )
        Object = Pronoun( *(I))
        ToBe = substitute
           For
              Join( pistol(SA), ball_bullet(SA) )

     // With a philosophical flourish Cato throws himself upon
     // his sword; I quietly take to the ship.

     #6
     Concatenate
     |   Preamble
     |      With = flourish(SA, Descriptor = philosophical )
     |      Event
     |         Verb = throw_cast( PRES )
     |            Agent = ProperPerson( "Cato", male )
     |            Object = SELF
     |            Target = Upon( sword( Belong = SELF ))
     >AND
        Verb - board_enter( PRES )
           Agent = NARRATOR
           Object = ship(ST)
           Qualifier = quietly

     // There is nothing surprising in this.

     #7
     Verb = Be_equate( NOT, PRES )
        Object = anything
           Descriptor = surprising
        Target = about( Pronoun( *(I)))

     // If they but knew it, almost all men in their degree, some
     // time or other, cherish very nearly the same feelings
     // towards the ocean with me.

     #8
     If
     |  Verb = know_realize( PERF, EMPHASIS )
     |     Agent = Pronoun( man(P))
     |     Object = Pronoun( NEXT )
     >THEN
        Verb = have( PRES )
           Agent = man(P)
              Qualifier = most_of
           Object = feeling(P)
              Target = Toward_About( sea(ST) )
              Qualifier = SameAs( *, Belong( NARRATOR ), nearly )
           Qualifier = sometime
           Qualifier = somewhat


Sentence Templates

Let's take a closer look at sentence templates since these are the heart and soul of the language generator modules. Unlike Chomp meta-code, sentence templates are strictly language dependent.  A sentence template written for English can only be used for English translations.  This is why a separate language generator module must be written for each target language.

Consider this sentence:

What is understood easily and rapidly by the reader may be the result of five or six painstaking revisions by the writer.
--E.D. Hirsch in The Philosophy of Composition.

The top-level template for this sentence is simply:


     p1 + qual + adv + "be" + p2

The meanings of the various parameters can be more clearly seen from the context in which the template is found.  Here is a portion of the definition of the assertion function "Is": (Note: This is not written in Chomp, but in the programming language of the language generator module.)


     DEFINE BE_State( Object:p1, State:p2 Adverb:adv, Qualifier:qual, NOT, ... )
     {
        if ( NOT )
           neg = "not";
        else
           neg = NULL;

        if( qual )
        {
           Template = p1 + qual + neg + adv + "be" + p2;
        }
        else
        {
           if ( Plural( p1 ))
           {
              Template = p1 + "are" + neg + adv + p2;
           }
           else
           {
              Template = p1 + "is" + neg + adv + p2;
           }
        }
     }

The keyword parameter "OBJECT" is marked with an asterisk which identifies it as the first positional argument, i.e., the keyword is optional when it is called. Now if this assertion is called from some source document in Chomp in this manner:


     Verb = BE_State
        Object = apple(P)
        State = red
        Averb = usually
        Qualifier = should

The result will be:


     p1 = "apple(s)"
     adv = "usually"
     qual = "should"
     p2 = "red"

Since "qual" exists this pattern is chosen:


     p1 + qual + neg + adv + "be" + p2
     "Apples" + "should" + NULL + "usually" + "be" + "red"
     "Apples should usually be red"

It should be noted that a number of minor but important details have been omitted here.  For example, it is not the actual arguments that are substituted into the sentence, but the word found in the English dictionary that corresponds to the plural of the symbolic tag "apple".  Remember, as was pointed out earlier, that this could just as easily have been "glipnorp".  This data indirection has been left out of these examples for clarity, but it should be remembered that if the same Chomp document is going to work for any language then clearly "apple" cannot actually be the word to be used, but only a symbolic reference into the lexicon of the particular language.

Now let's look at another partial  (and not quite real) template definition, and an example of how it might be called from a Chomp source document. The output from this form has the qualities of a noun and can be the agent of subsequent verbs.


     DEFINE Relative( *Object:p1, Which_Is:p2 )
     {
        // Defines a relative clause of the type "that which is X ..."

        ...
  
        if ( Plural( p1 )
        {
           Template = "things that are" + p2;
        }
        else
        {
           Template = "a thing that is" + p2;
        }
     }

Which might be invoked from Chomp in this manner:


     Things that are red should usually not be eaten.

     Verb = eat( NOT, PASS )
        Object
           Relative
              Object = that_what(P)
              Which_Is = red
        Adverb = usually
        Qualifier = should 

Now when the verb "eat" is called it selects this template for the passive voice with "NOT":


     Object + Qualifier + Adverb + "not" + "be eaten"

     Object is replaced by "Relative(Object=that_what(P),Which_Is=red)"
     Qualifier is replaced by "should"
     Adverb is replaced by "usually"

     Relative(Object=that_what(P),Which_Is=red) + "should" + "usually" + "not" + "be eaten"

Since the evaluation is not complete at this point, the newly defined "Relative" form called with its arguments and, finding that its object is plural it chooses the appropriate template and evaluates to:


     "things that are" + Which_Is

     Which_Is is replaced by "red"

     "things that are" + "red"
     "things that are red"

This value is then returned to the "eat" verb function which uses it to complete the sentence:


     Relative(Object=that_what(P),Which_Is=red) + "should" + "usually" + "not" + "be eaten"

     Relative(Object=that_what(P),Which_Is=red) is replaced by "things that are red"

     "things that are red" + "should" + "usually" + "not" + "be eaten"
     "things that are red should usually not be eaten"

Next we look at a very simple form called "Result_Of". It is treated like an object or noun and can take the usual noun modifiers such as article and singular/plural.  The arguments of the form are simply the object that is stipulated and that which is the cause of what is asserted.  For example:


     This fog is the result of moisture.

     Result_Of(ST)
        fog(SI)
     >CAUSE
        moisture

     From the template: Object + "is" + [article] + result of" + Cause

However, it can also be used as part of a more complex form as in:


     What is understood easily and rapidly by the reader may be
     the result of five or six painstaking revisions by the writer.

     Result_Of(ST)
     |  Relative
     |     Object = that_what(S)
     |     Which_Is
     |        Verb = understand( PRES )
     |           Target = by( reader(ST) )
     |           Adverb = ( rapidly, easily )
     >CAUSE
        Verb = do( PERF )
           Object = revision(P)
              Quantity = ( Or( five, six ))
              Qualifier = painstaking
           Agent = writer(ST)

By applying the few simple templates shown above this relatively complex sentence is easily and rapidly generated.

On a side note, notice that this sentence is missing a verb.  The verb "done", or "performed" is implied, but not stated: "What is understood easily and rapidly by the reader may be the result of five or six painstaking revisions (done) by the writer."  Literally speaking, a noun cannot be "by" someone, yet we imply the missing verb commonly  in English as in "A portrait by Picasso" or "A book by Melville" which really mean "A portrait (painted) by Picasso" and "A book (written) by Melville." Because we cannot assume that every other language is able to leave out these verbs as casually as English does, we must put them back in to assure that the form is comprehensible to the LGM for any specific language.


Sentence Template Generality

One serious concern that must be addressed is whether or not a set of sentence templates derived from one literary source will be sufficient to translate a different source text.  If the sentence templates found in Poe's The Pit and the Pendulum are only good for translating that specific work then the whole concept is worthless.  If that were the case then each new translation would require the creation of a new set of sentence templates specific to that work.

In order to put this question to the test the next step in the project was to look at sentences found in one specific literary work, decompose them into templates, and then test those templates against our chosen Moby Dick text to see if templates derived from one source will apply to a second, unrelated source.  Eventually templates will have to be drawn from a variety of sources including classics of literature as well as today's news stories.  But for now, since we are dealing with a literary target text we collected our template from another literary source; the above mentioned The Pit and the Pendulum.

The complete text of the story was downloaded from the Guttenberg Project and examined sentence by sentence to find the same constructions that appear in the Ishmael text. (The details of the templates found and how they matched will be found in Appendix A which is under construction.)

We began this analysis with nouns.  Nouns and their parameters appear in many forms in the Ishmael text. Since there are a limited number of ways in which nouns and their modifiers can be strung together this part of the excersise is almost trivial.  We asked simply if each of the forms found in Ishmael also appear in the Pendulum text?  Again, it is not the exact word order, but the underlying meanings that matter. It is not a matter of looking for identical sentence structures, but of looking for sentence structures capable of carrying the identical meaning.

Since the pilot project is not meant to be a complete language generator we will use some simplifications.  For example, something rendered in the original text as "noun of noun" might be rendered "adj noun" in Chomp.  Thus "slippery wall of stone" becomes "slippery stone wall".  This is not to be seen as a limitation of Chomp in general, but only as a simplification in this first implementation.  Future, and more complete versions, could easily include such constructions.

While we found that Poe's style is distinctly different from Melville's we also found that for every way in which Melville used and modified nouns there was an equivalent usage in Poe.  Having thus eliminated nouns and their modifiers from the picture we moved on to examine verbs and their arguments. (The table of matching patterns will eb found in Appendix A, which is under construction.)

Here again we used certain simplifications for the purpose of the pilot study.  To reiterate, these are not inherent limitations of Chomp, but simplifications used only to keep the labor of his study within reasonable bounds.  Typical of these simplifications are constructions like these, where the second form of the pair is the simplification:


     Having little or no money in my purse, ... I thought I would sail...
Because I had little or no money in my purse ...(therefore) I thought I would sail ... Running down the street, I shouted for him to follow While I ran down the street (also) I shouted ...

As with the noun/adjective simplifications mentioned above, the more complete forms can be added once the method as a whole is proven to be practical.

What remains are some structural items having to do with concatenation, qualification, and context setting, as well as a few details such as modifiers to adjectives.  Again, certain minimal simplifications have been made regarding flexible constructions and word order.  For example, the construction "I thought for a while that..." and "For a while I thought that..." are considered equivalent.

The result of this analysis, the detailed tabulations of which will be found in Appendix A (under construction), was that in every case but two the templates required by the Ishmael text could be found in the Poe text.  The two exceptions were the two uses of the imperative voice in Ishmael, while the Poe text does not use the imperative voice at all.  Given that that is the only exception to generality, and given that the imperative voice cannot be considered rare or unusual, it would appear that templates that work for one text will, in fact, generalize to additional, unrelated texts.


Categorizing Verbs

Linguists have spent a great deal of time creating and describing categories for verbs.  A good introduction to this subject can be found in Rick Morneau's The Lexical Semantics of a Machine Translation Interlingua. We might ask what the rationale might be for creating categories of any kind, whether they be biological genus and species or customer categories in accounting practice.  Two reasons come to mind.  The first is analytical; to study the items by category in hopes of finding generalizations that apply to all the members of a given category.  The goal is to be able to make general statements such as "All mammals have ...", and statements of distinction such as "The difference between mammals and reptiles is ...".

The second justification is practical; to find generalized methods and procedures that can be applied to all members of the category.  Thus, in tax accounting, for example, all expenses that belong to the category "deductible" are treated with the same generalized algorithm regardless of their differences in detail.

When it comes to verb classification schemes the intent is to be able to handle the translation of a large family of verbs with a single procedure.  The problem with this approach is that natural languages seem to consist of rules for which almost every case is an exception.  The expectation that all transitive verbs, for example, can be handled by the same set of rules leads to rules of ever growing complexity which service fewer and fewer actual verbs.  In the extreme case, every verb would have its own set of rules and there would be no useful generalities.

Now add to this the complicating fact that each language may handle each of these verbs differently.  There are even some verbs which are transitive in one language and intransitive in another.  Many English verbs can be either transitive or intransitive depending on how they are used.  These complications imply that a classification scheme that worked in one language would not work in another, and that generalized rules for handling verbs of a given category would depend on drawing the boundaries of that category in a way that applies only to that specific language.

The goal of this project is not to advance the understanding of the theory of verbs, but to put those verbs to work in a practical application.  The theoretician would prefer a smaller number of tidy categories, but the engineer must often be satisfied with a very large number of messy categories and special cases.  Our approach will be, for the most part, to treat each verb as a rule unto itself.  Verbs are implemented as functions, with a separate function being written for each verb.  The theoretician might prefer a single general purpose function that consults a large table of verbs and their arguments, but in practice such a function would, as has been demonstrated time and again in natural language processing applications, soon become over burdened with special coding for exceptions.  Since generality seems doomed to succumb to chaos anyway, we might as well acknowledge that chaos to begin with.

Projects which do not acknowledge that chaos either remain very small scale projects, or become hopelessly complicated and collapse under their own weight.  The Chomp project will make no attempt to generalize or categorize verbs, but will be built around the notion that every verb is unique and deserves, if not requires, its own function.  That, of course, raises the question of how many such functions would be required in a practical language generator module.

Sampling the pages of the Harper Collins Latin Dictionary results in an estimate of between 3100 and 3300 Latin verbs.  Sampling the pages of the The American Heritage Dictionary of the English language results in an estimate of between 3100 and 3900 English verbs. Sampling the pages of the Harper Collins German Dictionary revealed an estimated 3000 to 3500 verbs.  All three of these estimates suggest that something in the neighborhood of 3000 to 4000 verbs might be representative of natural languages in general.

The next logical question is to ask to what extent these verbs are synonyms or near synonyms that could be implemented in the same function.  If a number of verbs express very nearly the same meaning then those verbs could be implemented in a single function that distinguished among those shades of meaning by some modifier to a base verb.  Thus verbs like "walk", "amble", "saunter", "stroll", "run", "jog", "trot", "dash", all convey similar activity but with differing degrees of energy.  These verbs may well be amenable to sharing a single function, "ambulate" with some modifier to modulate the energy level.  Thus a sentence like "John walked to the store" might be encoded "Verb = ambulate( Intensity = low )..." and "Mary ran to the store" would be encoded "Verb = ambulate( Intensity = high )...".  This factor will decrease the number of verbal functions required.

A second consideration is that many verbs have multiple meanings that will need to be implemented as several separate verbal functions. This factor will increase the number of verbal functions required.

Thirdly is the fact that some portion of these verbs are rare, unusual or specialized and can probably be dispensed with, if necessary, as long a appropriate circumlocutions can be encoded.  Thus "fluoridating the water" could become "putting fluoride in the water", and "to immolate" could be replaced with "to set oneself on fire".  Those two verbs could then be safely left out of the LGM.  As a matter of fact, in practice it could well be necessary that some such verbs be left out if corresponding verbs do not exist in other languages.  There is no point in encoding "immolate" in the Chomp document if a significant portion of the target languages have no verb with that specific meaning, but themselves resort to circumlocutions like "set oneself on fire".  On the other hand if "immolate" is a part of the Chomp standard then every LGM for every language will be required to implement that function, and will need to deal with it, even if they choose to deal with via circumlocution.

Finally, there is the fact that some verbs are expressed as multiple words only one of which is a verb.  For example, "run into", "run away", "run amok", "run over", "run out", and "run down". Each of these represents a fundamentally different kind of action;  Running out of gas is a far different situation than running away from gas or running over gas, and it's not even possible to run amok gas.  This situation must be addressed by implementing compound verbs as if they are a single verb.  Thus "run-away-from" is a single verb and "run-out-of" is a different single verb.  This factor will also increase the number of verbs that need to be implemented as separate functions.  Most verbs are not used like this while a very few verbs are extensively used in this manner.  A quick survey of the dictionary suggests that perhaps as many as 5 to 10 percent more verbal functions might be required due to this phenomenon.

How do these three factors balance against each other and what is the net effect on the number of verbal functions required?  Sampling a few thesauri suggests that the average verb has three synonyms close enough to be implemented in the same verbal function.  In addition, the average action concept seems to have two or three verbs which represent different degrees of the same action, such as "stroll", "walk", "trot", "run", or "whisper", "talk", "shout".  On the average that seems to suggest that there will be roughly one sixth to one ninth as many verbal functions as there are verbs, or somewhere between 350 and 600 verbal functions.  Adding the 5 to 10 percent increase due to multi-word verbs, and including a large margin for estimation error we end up with somewhere in the neighborhood of 400 to 700 verbal functions for a complete and comprehensive implementation of a typical language, with an average of 550 verbal functions.  Writing 550 functions is a daunting task, but a very doable one if the result is a language generation module that can accurately translate any document encoded in Chomp.

The sample Ishmael text of the pilot project uses less than 30 verbs and will provide a good measure of the labor involved in implementing a single verb.  From that we will be able to estimate the number of programmer-hours that might be required to implement an LGM for a specific language.  Once that estimate is known it can be determined whether the return is worth the investment.


Page Two >