Parsing by Search and Replace

Gary J. Shannon
Created: Dec. 28, 2005
Last update: Dec. 31, 2005

NOTE: Much of this was very preliminary.
The most up-to-date information on actual implementation
will be found at
THIS LINK.

<< Back Home

Introduction

This is an attempt to create a parser for English language sentences that uses a method of look up and replacement of text patterns. A complete parse consists of a tree of nodes where each parent node has a number of descendant nodes which are arguments of the function of the parent node. Thus, from "I see John." we get a root node "see" with two child nodes, "I" and "John" which are the arguments of the function "see". Adjectives are treated as arguments of the noun they modify. Articles and demonstrative pronouns are treated as parameters on a par with adjectives, but are notated differently as will be seen below. Adverbs are treated as arguments of the verb they modify.

Trees can be represented graphically as a tree-like drawing, or in indented outline form, or in parenthetical function-like form. For example, the sentence "I see a big boy and a little girl." is parsed to the tree shown in figure 1.


Figure 1. Parse tree for
"I see a big boy and a little girl."

It can also be written in outline form as:

   see
      I
      and
         boy
            a
            big
         girl
            a
            little

and by replacing the levels of indentation with levels of parentheses, as:

   (see I (and (boy a big) (girl a little)))

which contains the same information, organized in the same manner, as both the tree form and the outline form.

The method of parsing to be designed here will use the parenthetical form because of its compactness. It will, in addition to the information shown in the above example, mark each set of parentheses with a part of speech, or pseudo part of speech identifying the role of the contents of the set of parentheses, and will include additional information such as verb tense, noun gender and number, type of article, if any, as so on. In addition, each individual word will be placed inside a pair of marked parentheses and additional information included, where necessary or useful. The word "and" becomes the conjunction function "Cn", which is also used to join together strings of adjectives or adverbs.

In practice, the above sentence would be parsed and marked like this:

     V( see,Prz N( Pr(1,s) ) N( Cn( N( boy,s,m,i Aj( big )) N( girl,s,f,i Aj( little ))))

Which can also be displayed in outline form as:

     V(
       see,Prz 
       N(
         Pr(1,s)
       )
       N(
         Cn(
           N(
             boy,s,m,i
             Aj( big )
           )
           N(
             girl,s,f,i
             Aj( little )
           )
         )
       )

Although that looks complex, in reality it is very simple. The outermost function has the form:

     V( see,Prz N(...) N(...))

which is one of only a very few verb-function templates. The first N-function has the argument:

     Cn( N(...) N(...))

Which simply joins together two nouns to form a compound noun-like entity. Individual nouns have the form:

     N( word,number,gender,article Aj(...))

Since the decomposition of the sentence is nothing more than a nested structure of simple templates it stands to reason that nothing more sophisticated than some form of simple template matching would be necessary to complete the parsing operation. This monograph seeks to demonstrate that this is, indeed, the case.

Preliminary Matching

This first pattern matching task to perform on any sentence is the standardization of numerical expressions. Thus statements like "a buck fifty-two", "two bits", "two hundred quid", and "half a dozen" are converted by a simple algorithm using a list of equivalents and replaced with something like:

   Mon(1.52 usdollar)
   Mon(0.25 usdollar)
   Mon(200 brpound)
   Num(6)

After that, a very long list of multi-word expressions is consulted and replacements made in the text. These are groups of words that are always taken as a single unit, and need not be analyzed separately, only to be reassembled later. It is much more efficient to recognize them as whole units as early as possible. Examples include:

Albert Einstein        apple juice            baby bottle(s)        baseball glove(s)      
Bill of Rights         chicken pox            coffee cup(s)         coffee mug(s)          
credit card(s)         cup of coffee          dime novel(s)         double feature(s)
drivers license(s)     drug store(s)          father(s) in law      file folder(s)
fish stick(s)          floppy disk(s)         gas station(s)        gravy boat(s)
Great Britain          hair brush(es)         happy new year        hard drive(s)          
hockey puck(s)         hockey stick           hotel room(s)         Lord of the Rings
milk carton(s)         Moby Dick              New York              paper clip(s)         
paper money            paper plate(s)         paper towel(s)        peanut butter          
phone number(s)        postage stamp(s)       rat race              root canal(s)          
slow lane              snow shoe(s)           space shuttle(s)      string quartet(s)      
super market(s)        taxi cab(s)            tea cozy(<ies)        tea cup(s)             
telephone number(s)    To Kill a Mockingbird  United Nations        vitamin C              

This list includes names of people, places and things, as well as book and movie titles. Each item in the table is replaced with a functional equivalent that might look something like this:

     N( Vip( einstein.a ))
     N( Bk( tkmockingbird ))
     N( Cit( newyork ))
     N( Bev( apljuice ))

This matching step provides as much information as possible about the nature of the thing named since eventually we will want to do something with the parsed sentence, and the more information that is available the easier it will be to accomplish whatever task is at hand, whether that be language translation or generating a plausible reply in a conversational computer program.

Idiomatic Phrases Pass

This pass checks the text against idiomatic phrases like "chew the fat", "jump off that bridge when (I;we;you;they) come to it.", and "sup dawg?" These are replaced by idiomatic pre-parsed versions. Since the number of such idioms, even though large, is manageable, this step eliminates a great deal of wasted labor later in the process. It also makes the job of updating the parser to understand the most current slang usages very quick and easy.

Associated Words Pass

This next pass looks to identify words that often occur together in a sentence, and tag them before they get tangled up in any possible ambiguities later on. Examples would be "bread and butter", "hammer and nails", "hammer and saw", and so on. Words likely to appear together in a given context are found in groups in a list, and if any two words in a group are found joined by "and" (or various other words or phrases) then their identity is marked. In this way we know that the "butter" in "bread and butter" is not the verb "to butter" and the word "saw" in "hammer and saw" is not the past tense of the verb "to see". Specific exceptions are few and can be enumerated, so that in the context "take some bread and butter it" the pattern "butter it" is seen as an exception, and correctly marked. Again, this is not a matter of sophistication of the algorithm, but simply a matter of sheer size of the data base.

Example of the results of this pass:

     The boy has a saw and hammer. -> The boy has a N( saw,s ) and N( hammer,s ).
     The boy will saw and hammer. -> The boy V( Cn( V( saw,Fut ) V( hammer,Fut )).

Adjective Pass

The next pass is to compare words in the text against a table of adjectives. The patterns to be matched would look something like this:

     big -> Aj( big )
     angry -> Aj( angry )
     happy -> Aj( happy )
     Aj(#1) Aj(#2) -> Aj( Cn( Aj(#1) Aj(#2)))

In each pattern the text to the left of the "->" is replaced by the text to the right. In the bottom pattern the replacement is parametrized so that anything found inside the parentheses of the "Aj" function is copied into the corresponding location in the replacement text. Thus the last pattern would match:

     Aj( big ) Aj( angry )

And turn it into the conjoined compound adjective:

     Aj( Cn( Aj( big ) Aj( angry )))

Here are some sample sentences after this match and replace pass:

     1. I see a boy. -> I see a boy.
     2. I see a big boy. -> I see a Aj( big ) boy.
     3. I see a big angry boy. -> I see a Aj( Cn( Aj( big ) Aj( angry ))) boy.
     4. I saw a happy boy. -> I saw a Aj( happy ) boy.
     5. I see a boy and a girl -> I see a boy and a girl
     6. I see a boy and he sees a girl.-> I see a boy and he sees a girl.
     7. The boy has a N( saw,s,n ) and N( hammer,s,n ). -> The boy has a N( saw,s,n ) and N( hammer,s,n ).
     8. The boy V( Cn( V( saw,Fut ) V( hammer,Fut )). -> The boy V( Cn( V( saw,Fut ) V( hammer,Fut )).

Noun Pass

The next pass consults a table of patterns to match nouns, pronouns and articles and replace them with the appropriate parenthetical structures. These rules might include:

     boy -> N( boy,m,s )
     boys -> N( boy,m,p )
     girl -> N( girl,f,s )
     girls -> N( girl,f,p )
     a saw -> N( saw,s,n,i )
     a N( #1 ) -> N( #1,i )
     a Aj( #1 ) N( #2 ) -> N( #2,i Aj( #1 ))
     the N( #1 ) -> N( #1,d )
     a Aj( #1 ) N( #2 ) -> N( #2,i Aj( #1 ))
     the Aj( #1 ) N( #2 ) -> N( #2,d Aj( #1 ))
     I -> N( Pr(1,s) )
     he -> N( Pr(3,s,m) )

And might result in these replacements:

     1. I see a boy. -> N( Pr(1,s) ) see N( boy,s,m,i ).
     2. I see a Aj(big) boy. -> N( Pr(1,s) ) see N( boy,s,mi Aj( big )).
     3. I see a Cn( Aj( big ) Aj( angry )) boy. -> N( Pr(1,s) ) see N( boy,s,n,i Aj( Cn( Aj( big ) Aj( angry )))).
     4. I saw a Aj( happy ) boy. -> N( Pr(1,s) ) saw N( boy,s,n,i Aj( happy )).
     5. I see a boy and a girl -> N( Pr(1,s) ) see N( boy,s,n,i ) and N( girl,s,f,i ).
     6. I see a boy and he sees a girl. -> N( Pr(1,s) ) see N( boy,s,m,i ) and N( Pr(3,s,m) ) sees N( girl,s,f,i ).
     7. The boy has a N( saw,s,n ) and N( hammer,s,n ). -> N( boy,s,n,d ) has N( saw,s,n,i ) and N( hammer,s,n ).
     8. The boy V( Cn( V( saw,Fut ) V( hammer,Fut )). -> N( boy,s,m,d ) V( Cn( V( saw,Fut ) V( hammer,Fut )).

Verb Pass

This pass looks for the simplest verb/noun patterns and tags the verbs found. Sample rules would include:

     N(#1) saw N(#2) -> V( see,pst N(#1) N(#2))
     N(#1) see N(#2) -> V( see,prz N(#1) N(#2))
     N(#1) sees N(#2) -> V( see,prz N(#1) N(#2))
     N(#1) has N(#2) -> V( have,prz N(#1) N(#2))

Which would result in these replacements:

     1. N( Pr(1,s) ) see N( boy,s,m,i ). 
          -> V( see,prz N( Pr(1,s) ) N( boy,s,m,i )).
     2. N( Pr(1,s) ) see N( boy,s,mi Aj( big )). 
          -> V( see,prz N( Pr(1,s) ) N( boy,s,m,i Aj( big ))).
     3. N( Pr(1,s) ) see N( boy,s,n,i Aj( Cn( Aj( big ) Aj( angry )))). 
          -> V( see,prz N( Pr(1,s) ) N( boy,s,m,i Aj( Cn( Aj(big) Aj(angry) )))).
     4. N( Pr(1,s) ) saw N( boy,s,n,i Aj( happy )). 
          -> V( see,pst N( Pr(1,s) ) N( boy,s,m,i Aj( happy ))).
     5. N( Pr(1,s) ) see N( boy,s,n,i ) and N( girl,s,f,i ). 
          -> V( see,prz N( Pr(1,s) ) N( boy,s,m,i)) and N( girl,s,f,i ).
     6. N( Pr(1,s) ) see N( boy,s,m,i ) and N( Pr(3,s,m) ) sees N( girl,s,f,i ). 
          -> V( see,prz N( Pr(1,s) ) N( boy,s,m,i )) and V( see,prz N( Pr(3,s,m) ),N( girl,s,f,i )).
     7. N( boy,s,n,d ) has N( saw,s,n,i ) and N( hammer,s,n ). 
          -> V( have N( boy,s,m,d ) N( saw,s,n,i )) and N( hammer,s,n ).
     8. N( boy,s,m,d ) V( Cn( V( saw,Fut ) V( hammer,Fut )). 
          -> N( boy,s,m,d ) V( Cn( V( saw,Fut ) V( hammer,Fut )).

Conjunction Pass

The next pass might look at joining together pairs of expressions by searching and replacing patterns based on conjunctions. For example:

     V(#1) and V(#2) -> Cn( V(#1) V(#2) )
     V(#1 N(#2) N(#3)) and N(#4) -> V(#1 N(#2) N( Cn( N(#3) N(#4))))

This would result in replacements to sentences 5 and 6 giving:

5. -> V( see,prz N( Pr(1,s) ) N( Cn( N( boy,s,m,i ) N( girl,s,f,i )))).
6. -> Cn( V( see,prz N( Pr(1,s) ) N( boy,s,m,i )) V( see,prz N( Pr(3,s,m) ) N( girl,s,f,i ))).

All the sample sentences are now completely parsed.

Prepositions and Locations

Any action that takes place may take place in relation  (in, from, toward, near, ...) to a specified location. Many verbal templates would include a slot for location of the action to be specified. A sentence like "John is sitting on top of the box" would treat "on top of" as a modifier to the verb, giving, in effect, "John is atop-the-box-sitting." In other cases a noun may be modified by specifying something about its relationship to some location. These specifications are treated like adjectives. Thus "I saw a boy behind the barn." becomes, essentially, "I saw a behind-the-barn boy." for purposes of parsing. Many other verbs simply subsume the preposition into the verb, like "go to -> goto" and "look for -> lookfor". These sentences would be parsed:

     V( sit,prz N( John,s,m ) Loc( N( box,s,n,d ) Rel( atop )))) 
     V( see,pst N( Pr(1,s) ) N( boy,s,m,i Loc( N( barn,s,n,d ) Rel( behind ))))

These pattern matching templates (for example) would be needed:

     N(#1) behind N(#2) -> N(#1 Loc( N(#2) Rel( behind ))
     V(#1) on top of N(#2) -> V(#1 Loc( N(#2) Rel( atop )))

A Few Slightly Harder Examples

Take this example:

     All of this money must be set aside for charitable purposes.

First the multi-word phrases are condensed out:

     Qn( all.of ) this money must be V( reserve.for ) N( charity,s,n, ).

The adjective pass has no effect, and the noun pass (with the addition of a few rules to handle "this", and quantifiers like "all of" and "most of", etc.) makes these changes:

     N( money,s,n,s Qn( all.of ) ) must be V( reserve.for ) N( charity,s,n, ).

Finally, the verb pass notes the "must" and passive situation and combines the other templates resulting in:

     V( reserve.for,prz,must N( NULL ) N( money,s,n,s Qn( all.of )) N( charity,s,n, )) .

Note the use of a "NULL" actor to indicate passive.

Aside from the addition of a few more pattern replacement templates nothing new was needed to accomplish the parsing of that sentence.

Completion Strategy

From this point out ,the strategy to be employed to complete the task of pattern discovery and evaluation will be to write a complete computer program to perform the kind of multi-pass parametrized pattern matching and replacement described above, and then to test the collected pattern sets against increasingly difficult sentences, expanding the pattern base as needed, until such time as unparsable counterexamples are found, or the parser has proven its ability to handle sentences of significant complexity, say the complete text of Moby Dick, or some other non-trivial test case.

The format described above is very preliminary, and the final format of the matching patterns will probably change during the process of development.

The specification for the actual parsing engine will be found on the following page. (Continue >>)

<< Back Home