Implementing Natural Language Machine Translation
of Text From High-Level Encoding of Meaning
Page Two - Back to Table of Contents
The Implementation - The English LGM
The next step is to define, in detail, each sentence template that will be used by the English language generator module, and to demonstrate that each and every sentence of the translation can be produced by those templates.
Before we can know which template will be necessary to implement the translation we expect we will need to know what that translation looks like. Working backwards from the Chomp version of the Ishmael text we might end up with a translation that looks like this:
|
While it is not exactly the same as the original, the translation does convey the same information as the original. It is those eight sentences that we seek to create with the language generator templates.
There are a total of 53 functions and 53 templates that need to be defined. That sounds like a lot of templates to produce 8 sentences, and it is. However, those same 53 templates can produce millions upon millions of other sentences not included in this text. With as few as 500 to 1000 templates we will probably be able to produce translations of just about anything ever written.
The templates presented below are shown as they would evaluate given the specific set of arguments passed to the form. The function definitions themselves will use conditional statements to decide which templates to use for building their return value so the actual definition of the form will be many more lines that the one line shown in this tabulation. The line shown is the only one necessary for this particular text.
Keeping in mind that each template will probably require a dozen or more lines to implement fully, the 1000 templates mentioned above represents something in the neighborhood of 12,000 lines of program coding to fully implement the LGM templates.
Below are the verb, noun and form functions that must be implemented to translate the Ishmael text from Chomp into English. The arguments listed in <angle brackets> correspond to the argument tags used in the Chomp encoding. These may seem a bit confusing or cryptic at first glance, but their meaning should become clear as we take a closer look as some actual implementations of verb functions.
Form: >AND : "and" Form: about : "about" + <argument> Form: Around_Near : "around" + <argument> Form: ASAP : language specific idiom or phrase for "as soon as possible" Form: Belong : belong me returns: "my". belong people -> "people's", belong him -> "his", My * -> "mine", etc. Form: Concatenate : <argument 1> + ";" + <argument 2> + ";" + ... + "and" + <last argument> Form: duration - table lookup with argument gives : "for a little while" Form: If : "if" + <argument> + <then> Form: in_front_of : "in front of + <of> Form: join : <argument 1> + "and" + <argument 2> Form: Locate : <argument 1> + <argument 2> Form: Past : "some" + <unit> + "ago" Form: portion : <article> + <that_is> + "part of" + <of> Form: preamble : with the "with" argument returns "with" + <with> + "," + <event> Form: Pronoun returns pronoun matching person, number and case of its argument Form: Query : <qualifier> + "how long" Form: Reflex : returns "myself" for "I", and so on. Form: same_as : <qualifier> + "the same as" Form: Since_because : "since" + <dform> + "," + <thus> Form: so_much_that : "so much that" + <argument> Form: substitute : <article/pronoun> + "substitute for" + <for> Form: whenever : "whenever" + <argument> or "especially whenever" + <argument> if EMPHASIS Form: Which : "that" + <argument> Noun: returns <qualifier> + <noun> or "the" + <qualifier> + <noun> or "a" + <qualifier> + <noun> Verb: ask - imperative : ["don't"] + "ask" + <query> Verb: BE_equate : <object> + "is" + <ToBe> Verb: Be_time_for : with emphasis returns: "it is definitely time" + <argument> Verb: become_manifest : <agent> + "is/are/am" + "becoming" + <ToBe> Verb: board_enter : <agent> + <qualifier> + "board" + <target> Verb: call_designate - imperative : "call" + <target> + <name> Verb: conclude : <agent> + "conclude" + <object> Verb: decide : <agent> + "decided + <argument> Verb: encounter : missing <object> argument returns: <agent> + "encounter" Verb: enter : pres participle form : <qualifier> + "going into" + <object> Verb: evict_banish - no agent : "to banish" + <object> Verb: exist : "there was" + Not(<Object>) Verb: feel_like : <object> + "feels like" + <target> Verb: find_realize : <agent> "find" + <object> Verb: follow_behind : ( pres participle form ) "following behind" + <object> Verb: go : <agent> + "go" + <object> + <modifier> Verb: have_emote : <agent> <qualifier(s)> + "have" + <object> Verb: have_posses : <agent> + "had" + <object> + "in" + <locate_in> Verb: interest - Null agent form: "interested" + <target> Verb: knock_off : with starred agent optional returns: "<qualifier> + "knocking off" + <object> Verb: know_realize : with EMPHASIS <agent> + "only realized" + <object> Verb: overwhelm : <agent> + "overwhelms" + <object> + <to_degree_that> Verb: pause : ( pres participle form ) "pausing" + <mode> + <location> Verb: prevent : "prevent" + <agent> + "from" + <from> Verb: regulate - no agent : "to regulate" + <object> Verb: require : <agent> + "takes" + <object> + "to" + <TO> Verb: sail with arguments given: <agent> + "would" + "sail" + "around" + <duration> Verb: see - starred agent (optional) with arguments given : "see" + < argument> Verb: throw_cast : <agent> + "casts" + <object> + <target> |
Most of these verbs appear to use the identical form of template and we might wonder to what extent the verb template can be generalized. As we mentioned above in the section discussing verb classification, that could become a trap if we decide to generalize prematurely. Only after we have verified that the argument structure and the methods of applying all modifiers, and qualifiers are identical for two verbs or forms will we be able to safely merge them into a single shared function.
As a trivial example, consider the verbs "found" and "bought". We might see examples like "I found the ball" and "I bought the ball" and conclude that these two verbs can share a function. We look a little deeper and find the examples "I found this rock in the pond" and "I bought this book in the store" and feel that our conclusion was justified. Further examples seem to confirm our cleverness in lumping these two together, as in "I found a hat for uncle harry at the second hand store" and "I bought a coat for aunt harriet at the thrift shop". Once we've committed to lumping them into the same function and built the operation of the system around that assumption we discover the example "I bought that book from Harry for six dollars." Suddenly we discover that our generalization doesn't actually work as we expected. "I found that book from Harry for six dollars" just doesn't work, and out general purpose function begins to collect exceptions to the rules.
Before long we discover that we would have been better off leaving them separate in the first place. To avoid creating future complications by premature generalizations we will avoid all generalizations in the pilot project.
The Implementation of the Verb "Be"
The syntax of the implementation language is similar to the programming language C, but keywords have been capitalized as a reminder to C programmers that this is not C. The LGM implementation language is much more limited than C and has a few specialized requirements. The differences will be pointed out as they arise. This implementation language can be thought of as a programming language in its own right or as a pseudo-code for an implementation that will ultimately be written in C.
For the Pilot project the latter applies as the actual implementations will be written in C++.
It will be noticed that there are two different arguments to this verb that appear to represent the same thing. These are "Qualifier" and "Modifier". The difference between the two is that the qualifier expresses whether or not, or to what degree the action of the verb actually happens, or is actually true, while the modifier expresses the manner in which it happens or is true. Examples of qualifiers include "probably", "possibly", "not", and "never". Examples of modifiers include "quickly", "thoroughly", "impressively", and "crudely". The easiest way to tell the difference is to place the word into an imperative sentence. If it works, it's probably a modifier. If it sounds like nonsense then it's probably a qualifier. Here's how that works: We take the imperative command "Run for cover!" and insert the word "quickly". The result is "Run quickly for cover!" which makes perfect sense, so "quickly" is a modifier. Next we try the word "probably". "Probably run for cover!" is the not the kind of command we can imagine ever shouting out in an emergency, so "probably" is not a modifier, but a qualifier. We can, however, give the command "Never run for cover!" which might lead us to suspect that "never" is a modifier, but we can use "never" as either a modifier or a qualifier. As a general rule qualifiers will not be implemented in the imperative of the verb so if you wish to have he word appear in the imperative use "Modifier = never" rather than "Qualifier = never".
Note that this implementation of "BE_equate" goes far beyond the minimum implementation required by the Ishmael text. This is because we need to begin to get familiar both with the issues that arise in full implementations, and with the labor involved so that we can accurately estimate the feasibility of a full implementation of 1000 or more verb functions.
Once this first verb has been physically implemented in C++ it will be tested with every possible combination of arguments to verify that it generates a correct English sentence in every case.
Verb: BE_equate
{
// arguments
//
// Object - "the thing" that is ...: e.g. "my dog is ..."
// Target - "what" the object is e.g. "my dog is green"
// Tense is set by the outer interpreter when this function is
// called and takes its value from the boolean keywords
// such as PRESENT, PERFECT, PAST_PART, and so on.
// These are equated to numerical values which serve as
// an index into the vForms table.
// Voice = ACTIVE or PASSIVE. If not present, ACTIVE is assumed.
// NOTE: This verb does not have a passive voice version
// NOT is a boolean keyword that indicates negation of the verb
// Qualifier = any qualifier(s) to the verb such as "almost", "probably", etc.
// This verb ignores Qualifier in the imperative
// Modifier = anything which alters the manner in which this verb is
// expressed such as "significantly", "radically"
// Aux = Auxiliary used with subjunctive only, such as "should", "could", "might"
// If no Aux is given in subjunctive then "would" is assumed.
// Verbal = a verb-type modifier such as "appears", "wishes", etc. This is used
// to create structures like "the sky appears to be blue." and
// "The boy constantly wishes he could suddenly be significantly taller."
// Clausal = a word that is inserted before the verb if this is not to be a
// complete sentence, such as "that" in "The man _that_ appears to be happy"
// Clausals do not work with Imperative,. If a Clausal is desired with a
// Verbal then place the Clausal under the verbal argument and not
// under the parent verb. E.g. "the boy _that_ wished..."
// Boolean arguments can also be tested explicitly with a statement
// like "If ( IMPERATIVE ) ..."
Data:
Table vForms =
{
{ "am ^", "are ^", "is ^" }, // PRESENT
{ "was ^", "were ^", "was ^" }, // PERFECT
{ "was ^ being", "were ^ being", "was ^ being" }, // IMPERFECT
{ "will ^ be", "will ^ be", "will ^ be" }, // FUTURE
{ "am ^ being", "are ^ being", "is ^ being" }, // PRES_PART
{ "was ^ being", "were ^ being", "was ^ being" }, // PAST_PART
{ "^ be", "^ be", "^ be " }, // SUBJUNCTIVE
{ "be", "do not be", NULL } // imperative (pos and neg)
};
Implementation:
String extras = NULL;
If ( NOT ) // if the "NOT" boolean keyword is present
extras = "not";
If ( IMPERATIVE ) // E.g. "be afraid" / "do not be afraid"
{
Index formIndex = NOT; // Booleans evaluate to either zero or one and can be used as indexes
Return vForms[ IMPERATIVE ][ formIndex ] + <Modifier> + <Target>; // return imperative form
}
Else if ( Verbal ) // if any argument of type "Verbal = " E.g. "<Object> wants to be <Target>"
{
If ( SUBJUNCTIVE )
{
// forms like "<Object> wishes he could be consistently<Target>"
// or "<Object> seems like it might be <Target>"
extras = <Qualifier> + extras;
If( Aux )
extras = <Aux> + extras;
Else
extras = "would" + extras;
Return <Verbal> + Pronoun( <Object > ) + extras + "be" + <Modifier> + <Target>;
}
Else
{
// forms like "<Object> seems to be <Target>"
Return <Verbal> + <Qualifier> + "to be" + <Modifier> + <Target>;
}
}
Else
{
Index personIndex = GetPerson( Object ); // analyze object for person and number
extras = <Qualifier> + extras; // E.g. "not" might become "probably not"
extras = extras + <Modifier>; // E.g. "probably not" might become "probably not consistently"
If ( SUBJUNCTIVE )
{
If( Aux )
extras = <Aux> + extras;
Else
extras = "would" + extras;
}
If ( personIndex > 2 )
personIndex = 1; // plurals of "to be" all use same as 2nd person singular
Return <Object> + <Clausal> + vForms[ tense ][ personIndex ] ^ extras + <Target>; // return this clause or sentence
}
}
|
Notes:
The templates are found on the lines that begin with the command "Return", although there are other embedded templates fragments within the implementation such as "extras = <Qualifier> + extras;".
String concatenation respects word boundaries so that "hello" + "world" will evaluate to "hello world" with a space inserted.
NULL is a special string with no value. If NULL is substituted into a template nothing will appear in that position. For example, "hello" + NULL + "world" will evaluate to "hello world".
Enclosing an argument name in angle brackets tells the interpreter to fully evaluate that argument before returning it. If, for example, we had Object equal to "dog(S, Belong = NARRATOR )" and Target equal to "pink( qualifier = bright )" the above function would not return "dog(S, Belong = NARRATOR ) is pink( qualifier = bright )". Instead, it would use the noun evaluation function to further evaluate "dog" and the adjective evaluation function to further evaluate "pink". These two functions would then return "my dog", and "bright pink" respectively. Thus <Object> would take on the value "my dog" and <Target> would take on the value "bright pink." The "Return" command would then return "my dog is bright pink".
Notice that <Object> is missing from the template when the Verbal argument is used. This is because the Object argument will be passed along to the Verbal argument which will be responsible for slotting the Object argument into its own template.
Using the carat '^' within a string definition marks a place where an insertion may optionally take place. The value, if any, to be inserted is connected to the string or variable name using the carat instead of the plus sign. Thus "am ^ being" ^ "not" evaluates to "am not being". If there is no value inserted then the carat disappears from the string when it is evaluated.
Although more complete than it needs to be for the Ishmael text, this implementation is not actually complete. Here are some of the missing forms:
I have been
you have been
he has been
I would have been
you might have been
he should have been
I had been
you had been
he had been
I wish I had been
you know you have been
John knows he has been
I wish I could have been ...
I used to be ...
Injected clauses like: I am, in spite of all I have been through, happy.
These additional forms will be added later after the pilot project is complete.
It is possible that some keyword arguments could be set globally for an entire document or section of the document instead of being passed for each verb or form. For example, a keyword boolean like "ARCHAIC" might be set if you wanted the function to return "be not afraid" instead of "do not be afraid", or "wouldst thou perchance" instead of "would you perhaps". Such arguments might also bracket selected sections of the text so that only the paragraph so marked would be translated in the archaic form of the target language if that was supported by the LGM for that language.
Another useful global argument pair would be CASUAL vs FORMAL. In causal mode the LGM might return "ain't" instead of "is not". If the "CONTRACTION" boolean were set then "is not" would be replaced by "isn't", and so on. Globals could also be implemented for turning SLANG usages off and on, and it might even be desirable to have a global boolean that caused sentences to be formed with intentionally bad grammar such as in translating dialog in a novel where errors characteristic of poor usage for that language might be desired. _"Them boys ain't got no sense," he said as he picked up his hat and dusted it off._ could be translated into an equally unpolished version of French or Italian, thus retaining the rustic flavor of the dialog in the translation.
It is tempting to try to generalize here and include verbs like "look" and "seem" in this function. After all we can say "My crayon is blue" or "My crayon looks blue" or "My crayon seems blue", but until we have reached the point where we can be confident that the "Be_equate" function is truly complete we will resist the urge. In addition, verbs like "look" have their own idiosyncrasies. We can say "He looked for the ticket" but not "He was for the ticket". The sense of the verb "look" that might be in the same class as "is" actually means "appears to be" which is really a Verbal modifier on "to be" and not a verb in its own right. "My crayon looks blue" can be paraphrased as "My crayon appears to be blue" or "My crayon is apparently blue" where "apparently" is a pure adverb modifying "be".
A qualifier, or any other argument for that matter, could consist of a single word, or of an entire subtree which would be evaluated before it was inserted into the template. For example, a Qualifier might be a nested form which evaluates to "under normal circumstances" and with a Modifier of "consistently", and an Aux of "should" would cause the verb to evaluate in the subjunctive as "<object> should, under normal circumstances, be consistently <target>"
Below is the output from a test program that ran the actual C++ implementation of this verb against possible values of all the arguments.
The arguments were given the following values in every possible combination.
Object = "the boy", "all dogs"
Target = "clever", "the leader"
Tense = PRESENT, PERFECT, IMPERFECT, FUTURE, PRES_PART, PAST_PART, SUBJUNCTIVE,
IMPERATIVE
Voice = ACTIVE
NOT: present and not present
Qualifier = (none), "probably", "usually"
Modifier = (none), "indisputably"
Aux = (none), "should", "might"
Verbal = (none), "wish(es)", "seem(s)"
Clausal = (none), "that"
All together this amounts to 6,912 combinations which will generate 6,912 different sentences. Each one of those sentences must then be proofread for correctness, and if they are all well formed sentences which correctly express their parameter structure then the verb function will be considered provisionally correct. That status remains only until some future case arises which was not anticipated by the function. When that happens the function will be modified to address that new case and re-tested to verify that both the new case and the old cases all work properly.
Further testing can be done by holding some arguments unchanged while cycling another argument through a larger number of possible values. For example, the tense could be left PRESENT, the aux, verbals, and modifiers left off, and 30 or 40 different qualifiers tried. If forty qualifiers were tried with all of the above combinations it would generate over 92,000 different sentences! Trying even as few as 10 values for verbals, auxes, modifiers and qualifiers with all of the other arguments would result in nearly two million different sentences to verify.
Until the function has been tested against every possible combination of appropriate argument words in the English language we cannot be certain that it is completely correct. Since that task would likely take millions, if not billions of years, the function can never be completely tested. We need to be satisfied with being reasonably sure that the function behaves reasonably well, and address any malfunctions as they show up. Hopefully such malfunctions will become increasingly rare as the function gets more use.
to be continued
last update Jan 7, 2004