Search-and-Replace Parsing Engine Specifications
Gary J. Shannon
Created: Dec. 30, 2005
Last Update: Dec 31, 2005
Overview
The purpose of this project is to create a computer program that will parse English language sentences by using nothing more than pattern search and replace operations on the text of the sentences, based on a library of match-and-replace patterns.
This parsing engine will parse for meaning, not for structure. As a result, the parse trees will be identical for these two sentences:
Old Mother Hubbard went to the cupboard.
It was to the cupboard that Old Mother Hubbard did go.
Parse Tree:
verb
(
go_to,past
name
(
old_mother_hubbard
)
noun
(
cupboard,s
art(d)
)
)
In operation, the program will read a corpus of English sentences from an input file and parse them all, writing the results to an output file. In the process of parsing it will use any number of libraries of match-and-replace patterns loaded from individual pattern files. A list of pattern files will be used by the parsing engine to determine which libraries to load in which order. The sentences will be repeatedly matched against a given pattern library until no more matches occur. At that time the current pattern library will be discarded and the next pattern library in the list will be loaded. When the last pattern library in the list has been used up the parse engine will write its final output and halt.
As each pattern library is used, the intermediate results for every sentence will be written to a trace file so that the effects of any given match-and-replace rule can be examined in detail for each step of the process.
Pattern Syntax
The match and replace patterns used by the parser engine are similar in concept to conventional regular expressions, but are designed to deal with word-oriented , rather than character-oriented patterns, and, for that reason, differ from regular expressions in a number of ways.
Tagged Words
First is the notion of tagged words. Once a word has been classified as to what function it plays in the sentence, it will be placed in parentheses and that set of parens will be tagged with a name. Each of these names can be a base category, or can be a category that is a subset of, or inherits properties from some other, more general category. For example, the category pers for words representing persons or personified actors, inherits from anim which is the tag for animate living things. anim, in turn, inherits from live, which inherits from noun. Likewise, color might inherit from the base category mod. Words that modify other words are included inside the parens that hold the modified word, but are still tagged individually. "walk slowly", for example, might be tagged verb(walk mod(slow)) with mod(slow) nesting within the verb(walk...) parens. If this sounds confusing, it will be a lot clearer once we start looking at some examples.
As an example, the fragment "red book" is tagged noun(book color(red)) and "happy boy" is tagged pers(boy mod(happy)). Since pers inherits ultimately from noun, and color from mod, both of these tagged words would match the pattern noun(* mod(*)) but only the second one would match the more specific pattern pers(* mod(*)), and only the first would match noun(* color(*)). (The asterisk is a wild card that matches anything.)
Or, more systematically::
Pattern: noun( * mod( * ))
Will match: noun(book color(red)) because "color" is a specific kind of "mod"
Will match: pers(boy mod(happy)) because "pers" is a specific kind of "noun"
Pattern: pers( * mod( * ))
Will match: pers(boy mod(happy))
Will NOT match: noun(book color(red)) because pattern "pers" is more specific than the general "noun".
Pattern: noun( * color( * ))
Will match: noun(book color(red))
Will NOT match: pers(boy mod(happy)) because pattern "color" is more specific than the general "mod".
Word Boundary Rules and Special Characters
Next, match and replacement patterns usually begin and end on a word boundary. Word boundaries are delimited by the beginning or end of a line, a space, tab, or a punctuation mark. Leading or trailing spaces, or extra spaces or tabs between words in the match or replacement patterns are always irrelevant and will be ignored. The apostrophe and the underscore are considered part of a word, so that "don't" and "run_away" are each considered to be single words. Thus the search pattern "at" will never match the "at" in "cat", but "a big dog " will match " a big dog" by ignoring redundant white space in both pattern and target text.
The underscore is used to form compound words that will be treated as single words by the parser.
Capitalization of words is always irrelevant and is ignored. The only exception is in category names as will be explained below.
The newline character \n can be used in the replacement text to create a separate line in the intermediate working document, as, for example, to break a run-on sentence into separate sentences.
Replacement Patterns
Replacement patterns are separated from the search pattern by the two-character arrow "->". For example:
New York -> city(new_york)
This pattern will find any occurrence of the two-word search pattern New York (Capitalized or not) and replace it with the tagged single-word equivalent city(new_york). (The tag name city might, for example, inherit from place which inherits from noun. Also note that new_york is a single compound word.)
Wild cards
The character * represents a wild card of any number of whole words, including none. The wild card character % represents zero or one words only. Either wild card may be numbered so that it can be copied into the replacement string. Use only * with it's associated number on the replacement side. For example:
Pattern: *1, take *2 outside %3. -> *3 verb(take_outside) *2, *1.
Result: John, take the dog outside please. -> please verb(take_outside) the dog, John.
Pattern: *1, take %2 outside *3. -> *3 verb(take_outside) *2, *1.
Result: John, take the dog outside please. -> (Does not match. "the dog" is more than one word.)
Note that the meaning of single word includes a tagged word and everything contained by that tagged word, so that noun(dog mod(big)) is a single word for matching purposes. On the other hand, the pattern noun(%1) where the singleton wild card is inside the parens, would not match noun(dog mod(big)) because from within the parens there are two words, dog and mod(big). The pattern noun(*), however, would match because that allows more than one word within the parens. In other words, viewed from the outside, everything inside parens is a single word, but viewed from inside the parens, each separate word counts as an individual word.
Copied Text
The angle brackets < and > enclose any substring that will be copied exactly into the replacement text where the up-arrow ^ is placed. The brackets do not have to fall on word boundaries. This is nothing more than a convenience to help keep the pattern files smaller and easier to type. For example:
Pattern: <big yellow school bus> -> noun(^)
Result: My big yellow school bus is late. -> My noun(big yellow school bus) is late.
Alternate Match Patterns
Match patterns enclosed in square brackets and separated by a vertical bar represent alternate matching patterns. These may be whole words or parts of a word. For example:
Pattern: cherr[y|ies] -> noun(cherry)
Result: One cherry or two cherries? -> One noun(cherry) or two noun(cherry).
Both cherry and cherries match because of the two alternative endings given by [y|ies].
When alternatives are listed in the matching pattern, alternative replacements may also be listed in the corresponding positions in square brackets on the replacement side. For example:
Pattern: r[u|a]n -> verb(run,[present|past])
Result: I run faster than he ran. -> I verb(run,present) faster than he verb(run,past).
Pattern: <cherr>[y|ies] -> noun(^y,[sing|plur])
Result: One cherry or two cherries? -> One noun(cherry,sing) or two noun(cherry,plur).
The list of alternatives may include a null (empty) alternative. For example:
Pattern: child[|ren] -> pers(child,[sing|plur])
Result: One child or two children? -> One pers(child,sing) or two pers(child,plur)?
Pattern: all [|of] the -> quan(all_of)
Result: all the king's horses -> quan(all_of) king's horses
Result: all of the people -> quan(all_of) people
Forbidden Words in a Wildcard Match
In a wild card match a word or set of words may be forbidden within the wild card text by placing those forbidden words in curly braces, separated by the vertical bar. Mutli-word text is permitted. The pattern *{xxx} acts exactly like a single * character and may be numbered for use in the replacement text. If the % wild card precedes the curly braces then only one word is checked for a match. The pattern *{*} forbids anything from appearing at that place in the pattern. Use only * on the replacement side. For example:
Pattern: %{he|she|I}1 saw *2 -> *2 was seen by *1
Result: He saw Mary. -> (No change. Does not match because "he" is forbidden.)
Result: Harold saw Mary. -> Mary was seen by Harold.
Pattern: *{pair of} glasses -> noun(glass,plur)
Result: I lost a pair of brand new glasses. -> (No change. Does not match because "pair of" is forbidden).
Result: Two glasses of milk -> Two noun(glass,plur) of milk
Pattern: *{*} parking lots -> noun(parking_lot,p)
Result: I am parking lots of cars. -> (No change. Does not match because everything is forbidden before "parking").
Exact Category Match
If it is ever required that the input text category match the pattern category exactly, and not accept a match to a more general category, then the category name in the pattern is written in upper case. For example:
Pattern: noun( * mod( * ))
Will match: noun(book color(red)) because "color" is a specific kind of "mod"
Will match: pers(boy mod(happy)) because "pers" is a specific kind of "noun"
Pattern: NOUN( * MOD( * ))
Will NOT match: noun(book color(red)) because "color" is too specific and does not match "MOD".
Will NOT match: pers(boy mod(happy)) because "pers" is too specific and does not match "NOUN".
Changing Categories in the Replacement Text
If a specific category is matched by a more general category in the pattern then the category in the text will carried over to the replacement text. For example:
Pattern: adj(*1) noun(*2) -> noun(*2 adj(*1))
Result: adj(big) pers(boy) -> pers(boy adj(big)) because replaced text gets most specific category of noun and pers.
Notice that pers in the target text was matched by the more general noun in the pattern. In the replacement text the more general noun is called out, but in the process of replacement, pers, the more specific category from the actual target text, is retained in the replaced text.
Pattern: adj(*1) noun(*2) -> NOUN(*2 adj(*1))
Result: adj(big) pers(boy) -> noun(boy adj(big)) because replacement text gets least specific category of noun and pers.
In the second example, the replacement text category is in uppercase letters. This tells the replacement process to ignore the category in the target text and literally use the category named in the replacement text. Notice that the category name, noun, in the replaced text is always put in lowercase by the replacement process.
The Parsing Strategy
Category Inheritance
All the categories that will be used must be predefined to the parsing engine. These category derivations are enumerated in a file that the parsing engine loads at the start of the parsing process. This file will be simple text and will contain a list of categories, each of which is followed by the category from which it inherits, in parentheses. After that any text on the line is ignored by the parser and may be used for comments to document category usage. The user is free to define any kinds of categories desired, and assign them any arbitrary names up to 16 characters in length.
For example:
noun // base category with no inheritance
alive(noun) // living thing
anim(alive) // animate living thing
pers(anim) // person or personified actor
plant(alive) // living non-animate
contr(noun) // object that is a container
name(noun) // name of a person, place or thing
pname(name,pers) // name of a person - inherits from both name and pers
story(noun) // title of a book, poem, short story, play, or movie
movie(story) // specifically a movie title
vip(pers) // well-known person
title(pers) // honorific, or other title applied to a person
mod // base category, modifiers
adj(mod) // adjectives in general
adv(mod) // adverbs in general
color(adj) // name of a color
Notice that pname inherits from both pers and from name. This means that this category is considered to be both a more specific kind of name and a more specific pers word. That means that it will be matched by either of its more general parent categories.
Numeric Preprocessor
Since numbers have rules of their own, and can come in many different forms, before any matching is done, the sentences in the input file are run through a numeric pre-processor that reduces numeric expressions to standard forms. Strings like six bits, a buck fifty six, 555-2135, and half a dozen will be reduced to standard tagged forms such as mon(0.75 usdollar), mon(1.56 usdollar), number(555-2135), and number(6).
Word Compounding
The first set of patterns run against the input file consists of multi-word expressions that represent a single concept and can be turned into a single compound word. Examples would include parking lot, root beer, chew the fat, as old as the hills, and The Lord of the Rings. Whenever possible, these words are tagged as well. Thus we might see patterns like:
parking lot[|s] %{of} -> noun(parking_lot,[s|p])
lots of -> quan(many)
the lord of the rings -> story(lord_of_rings)
Example:
I worked all night parking lots of cars in the parking lots at The Lord of the Rings.
-> I worked all night parking quan(many) cars in the noun(parking_lot,p) at story(lord_of_rings).
This preliminary compounding is vital to reduce later ambiguities. For example, we might encounter the word "door" and not know if it is a noun in its own right, or a modifier on another noun, such as in the phrase "door knob". Likewise, the word "handle" could be a noun or a verb, or even part of a compound like "door handle."
In the example that follows, some names will be included in the matching patterns for purposes of illustration. In practice, a separate pass would probably be made with a match-and-replace library devoted entirely to proper names. Here are some word compounding matches on a few sample sentences with potentially ambiguous words:
1. Old mother hubbard went to the cupboard.
2. Old mother anderson went to the cupboard door.
3. The old mother went to the big cupboard.
4. Mother went to the door handle.
5. A cupboard falls.
6. It was to the cupboard that old mother hubbard did go.
Replacement Rules:
cupboard door -> noun(cupboard_door,s)
door handle -> noun(door_handle,s)
old mother hubbard -> name(old_mother_hubbard)
anderson -> name(anderson)
Results:
1. name(old_mother_hubbard) went to the cupboard.
2. Old mother name(anderson) went to the noun(cupboard_door).
3. The old mother went to the big cupboard.
4. Mother went to the noun(door_handle).
5. A cupboard falls.
6. It was to the cupboard that name(old_mother_hubbard) did go.
Noun Pass
At this point most of the nouns can probably be identified, and those that cannot be tagged definitively can be provisionally identified by using category names that reflect our uncertainty at this point. For example, we might use the category name nt to tag a word like "mother" that could be either a title, name or noun, as in "Mother Hubbard", "Hello mother.", or "My mother is here."
cupboard -> noun(cupboard,s)
mother -> nt(mother)
1. name(old_mother_hubbard) went to the noun(cupboard,s).
2. Old nt(mother) name(anderson) went to the noun(cupboard_door,s).
3. The old nt(mother) went to the big noun(cupboard,s).
4. nt(mother) went to the noun(door_handle,s).
5. A noun(cupboard,s) falls.
6. It was to the noun(cupboard,s) that name(old_mother_hubbard) did go.
Adjective and Article Pass
The exact order of the passes is flexible, and may be altered with more experience. In fact, certain pattern libraries might even be split up into more than one library to be run at different points in the process. But for now, this order will do as a first approximation.
a[|n] -> art(i)
the -> art(d)
old -> adj(old)
big -> adj(big)
1. name(old_mother_hubbard) went to art(d) noun(cupboard,s).
2. adj(old) nt(mother) name(anderson) went to art(d) noun(cupboard_door,s).
3. art(d) adj(old) nt(mother) went to art(d) adj(big) noun(cupboard,s).
4. nt(mother) went to art(d) noun(door_handle,s).
5. art(i) noun(cupboard,s) falls.
6. It was to art(d) noun(cupboard,s) that name(old_mother_hubbard) did go.
Collection Pass
Nouns can now collect their articles and adjectives and nest them inside the noun tags as follows:
art(*1) noun(*2) -> noun(*2 art(*1))
adj(*1) noun(*2) -> noun(*2 adj(*1))
nt(*1) name(*2) -> name(*2 title(*1))
1. name(old_mother_hubbard) went to noun(cupboard,s art(d)).
2. name(anderson title(mother) adj(old)) went to noun(cupboard_door,s art(d)).
3. art(d) adj(old) nt(mother) went to noun(cupboard,s adj(big) art(d)).
4. nt(mother) went to noun(door_handle,s art(d)).
5. noun(cupboard,s art(i)) falls.
6. It was to noun(cupboard,s art(d)) that name(old_mother_hubbard) did go.
First Verb Pass
Now we can take a crack at categorizing some of the verbs:
went to -> v(go_to,past)
did go -> v(go,past)
noun(*) falls. -> verb(fall noun(*))
1. name(old_mother_hubbard) v(go_to,past) noun(cupboard,s art(d)).
2. name(anderson title(mother) adj(old)) v(go_to,past) noun(cupboard_door,s art(d)).
3. art(d) adj(old) nt(mother) v(go_to,past) noun(cupboard,s adj(big) art(d)).
4. nt(mother) v(go_to,past) noun(door_handle,s art(d)).
5. verb(fall noun(cupboard,s art(i)))
6. It was to noun(cupboard,s art(d)) that name(old_mother_hubbard) v(go,past).
Notice that most verbs used the category v while "falls" used the category verb. The category verb is reserved for verbs that have been completely parsed and bound to their arguments. the tag v is used when more work remains to be done on binding arguments.
Second Verb Pass
This pass looks for verbs that can be bound to their arguments.
noun(*1) v(*2) noun(*3). -> verb(*2 noun(*1) noun(*2))
it was to noun(*1) that noun(*2) v(*3,*4). -> v(*3_to,*4 noun(*2) noun(*1))
*{*} nt(*1) v(*2) noun(*3). -> verb(*2 NAME(*1) noun(*2))
1. verb(go_to,past name(old_mother_hubbard) noun(cupboard,s art(d)))
2. verb(go_to,past name(anderson title(mother) adj(old))noun(cupboard_door,s art(d)))
3. art(d) adj(old) nt(mother) v(go_to,past) noun(cupboard,s adj(big) art(d)).
4. verb(go_to,past name(mother) noun(door_handle,s art(d)))
5. verb(fall noun(cupboard,s art(i)))
6. verb(go_to,past Name(old_mother_hubbard) noun(cupboard,s art(d)))
Notice the use of *{*} to indicate that nothing must precede the first word of the match pattern. This says, in effect, that the wild card forbids literally anything, so if anything is present the match fails. Everything is parsed completely except sentence 3. But now that we have identified the verb it is possible to decide, by virtue of its context,that the nt word "mother" is, indeed, a noun:
adj(*1) nt(*2) v(*3) -> noun(*2 adj(*1)) v(*3)
3. art(d) noun(mother adj(old)) v(go_to,past) noun(cupboard,s adj(big) art(d)).
Now by going back and repeating some of the earlier passes this sentence will also be completely parsed.
In Reality
In reality, if and when the parser is ever successfully parsing sentences of reasonable complexity, the order of the passes, the identity of the categories, and the exact manner in which the substitutions are made may change significantly from this example. But the mechanism of search and replace illustrated here will not change.
It should be noted that if complete parsing is possible using this technique then complete un-parsing should also be possible, in which case the same parsing engine could be used with additional pattern libraries to un-parse the parsed sentences either into an English paraphrase, or into another language entirely. In other words, automated translation from English to the target language would be possible using only this simple text search and replace engine. Whether that is true remains to be seen.
Next Steps