Fusion Grammar: A New Approach to Parsing Natural Language
by Gary J. Shannon
Started Oct. 27, 2008
Last Updated Oct 27, 2008
Basic Concept
Fusion parsing is a process which transforms a string of symbols by succesive application of rules, such that the length of the string symbols either remains unchanged, or becomes one symbol shorter with each application of a rule. In time a string of any initial finite length necessarily must have its length reduced until no rule exists which can be applied to the string. This is the terminal string. The terminal string will either consist of one symbol, or of more than one symbol. If the terminal symbol consists of a single symbol then we say that the original string has been successfully parsed. If the terminal string has more than one symbol then the string has not been parsed.
The rules are of two types; fusion rules, and exchange rules. Fusion rules give a pair of adjacent target symbols and a third, fusion symbol. The two target symbols, when found adjacent to each other and in the order specified by the rule, are removed from the string and replaced by the single fusion symbol. For example, applying the rule BC->F to the string "ABCD" yields the string "AFD".
The second type of rule is the exchange rule. In this rule, two adjacent symbols are named which, when found adjacent to each other, will exchange positions in the string. For example, the rule B%C applied to the string "ABCD" yields the string "ACBD".
The intent is that the rule set, or parsing grammar, will parse every string which is a sentence in some specific language. While the rules themselves can be reversed and used as production rules to generate sentences, it is not guaranteed that strings generated by such production will be sentences in the language. In other words, it may be the case that the actual productive grammar of the language is unknown, and perhaps even unknowable, an yet a non-reversable parsing grammar which is complete enough for practical purposes remains possible.
Consider this example:
Rule set:
1. MJ->J
2. JN->N
3. VW->W
4. DN->R
5. RW->S
Initial string: DMJNVW
DMJNVW -> DJNVW (rule 1)
DJNVW -> DNVW (rule 2)
DNVW -> DNW (rule 3)
DNW -> RW (rule 4)
RW -> S (rule 5)
Now suppose that instead of discarding the symbols that have been fused, we enclose them in brackets and mark the brackets with the new fused symbol. The same string with the same rules applied becomes:
DMJNVW -> DJ[MJ]NVW (rule 1)
DJ[MJ]NVW -> DN[J[MJ]N]VW (rule 2)
DN[J[MJ]N]VW -> DN[J[MJ]N]W[VW] (rule 3)
DN[J[MJ]N]W[VW] -> R[DN[J[MJ]N]]W[VW] (rule 4)
R[DN[J[MJ]N]]W[VW] -> S[R[DN[J[MJ]N]]W[VW]] (rule 5)
The terminal string is longer than one symbol, but there is only one symbol remaining on the outside of all brackets, and that is the final symbol of the original terminal string. Now, since the string is a nested parenthetical expression, we can display it as a binary parse tree rooted on the terminal symbol:
+-- D +-- M
| |
+-- R --+ +-- J --+
| | | |
S --+ +-- N --+ +-- J
| |
| +-- V +-- N
| |
+-- W --+
|
+-- W
From Symbols to Words
Now suppose we let each symbol in the string be a category tag on a word of a natural language sentence, for example:
D = a
M = very
J = hard
N = rain
V = is
W = falling
Giving the tree:
+-- a +-- very
| |
+-- R --+ +-- J --+
| | | |
S --+ +-- N --+ +-- hard
| |
| +-- is +-- rain
| |
+-- W --+
|
+-- falling
And the "formula":
S[R[a N[J[very hard] rain]]W[is falling]]
Or in structured form:
S[
R[
a N[J[very hard] rain]
]
W[
is falling
]
]
Depending on what rules are chosen, and how the final result is structured, it should be a simple matter to extract the relevant information from known locations in the structure in order to "understand" the sentence.
The Goal of the Project
The goal of the fusion parser to to succesfully parse as many reasonable sentences as possible. Given a hypothetical language for which the complete production grammar is known, the parsing grammar should be loosely enough coupled to the production grammar so that sentences which are "not quite" in the language can still be succesfully parsed. For example the sentence "Them dogs is mean." is not quite a sentence in "proper" English, yet even a very young and inexperienced human parser can succesfully work out its meaning. A fusion parser should also be able to succesfully parse sentences which are not strictly "correct" with respect to the production grammar.
For this reason the fusion grammar is far less formal than a production grammar, and does not inforce rules of tense, number, gender, case, and so on.
The rules of fusion grammar are not invented by a processes of logic and reasoning, but are discovered by extracting those rules from examples of sentences which were hand-parsed based on an intuitive feel for what constitutes a way (but not necessarily the only way) to parse a given sentence.
Word categories are not enumerated a priori, but emerge by observing what words share rules with what other words. Words that share the same roles in the same set of rules are taken to be members of the same family. The families are discovered empirically, but are not given traditional names like "proper noun", "transitive verb", "subject", or "direct object", simply because those name could be misleading, and might prompt us to think, mistakenly, that we really know which categories will be useful for practical application of fusion parsing rules. Once we have extracted some useful categories or classes, we will not be so bold as to name those classes, or to pretend that we understand them in some abstract or generalized way. The only thing we know about such classes is that 1) they share certain rule groups, and 2) they seem to be pragmatically useful in parsing out the meaning of a sentence.
The same classes that apply to single words will also be applicable to phrases. For example, a certain class might be the class of all words and phrases that could properly complete the sentence "He went...", and would include such members as "to the park", "home", "to school", "to the concert", "into the dark night" etc. It includes any prepositions, expressed or implied, that make it a "destination" (although we will resist the temptation to name the class "The Destination Class" in case the same class shows up in other rules where "destination" is not an appropriate descriptor).
This example may, in fact, be a less than useful one if we consider that the verb "go" might take such a "destination" argument but "enter" does not. Since "enter" has, in effect, a "built-in" preposition "into", it takes a "location", without any included or implied preposition. Such a location, (or possibly "event" as in "enter the contest") might include: "the room", "a dark and gloomy cave", "that old haunted house" etc.
Again, it will be a matter of observing which classes emerge from the data, rather than imposing some a priori classification scheme on words and phrases. It may well be the case that certain prepositions are fused to the verb rather than to the noun phrase, as in [go into] [the house] as opposed to go [into [the house]], or that adverbs will fuse with entire assertions rather than with verbs as in [gently [falls [the rain]] as opposed to "[[gently falls] [the rain]]". The data itself, and the usefulness of the resulting parsed forms, will dictate the rules.
The Non-Adjacency Problem
A basic assumption of the fusion grammar is that words and phrases that should be fused into larger units will be located adjacent to each other. However, this is often not the case. Consider the sentence:
There is a hard rain falling.
We know that the group "[is falling]" eventually needs to be fused, yet the two words needed by this fusion rule are separated by a significant distance. This is the reason that provision is made for so-called exchange rules. Once we have fused the input sentence to a certain degree, we find that the there is only one element, "[a [hard rain]]" separating the two target words:
There is [a [hard rain]] falling.
We might also discover a rule that fuses "there is" into the class-equivalent of "is", "[there is]". Then we would have, in effect: "[there is] X falling". Now an exchange rule could be written of the form "AXB->XAB" which gives us:
[a [hard rain]] [there is] falling
Whereupon the original fusion rule can form:
[a [hard rain]] [[there is] falling]
We could then reduce "[there is]" to "is" giving the same final output string as the original sentence, "A hard rain is falling."
The justification for cancelling out "there" is that the only function served by the word was to distinguish between a statement and a question:
"Is a hard rain falling" vs "There is a hard rain falling"
Once "is" has been swapped into the proper location the fact that the sentence is a statement rather than a question is implied by the structure, so the "this-is-a-statement" flag is redundant and can be dropped.
It should also be noted that two structurally different sentences resulted in the same parse tree. This is not something that can happen with a conventional formal parser. The reason is that a conventional parser parses for structure, while the fusion parser parses for meaning. Thus we would expect to extract exactly the same meaning from these two structurally different sentences:
Quickly, I ran down to the river's edge.
It was down to the very edge of the river that I forsooth did quickly run.
Both should result in something like this:
[I [[ran quickly] [[down to] [river edge]]]]
Notice that the relatively "empty" words like "very" and "forsooth" were discarded as essentially irrelevant for chatbot purposes.
The Risks of Exchange Rules
The risk in using exchange rules is that false adjacencies might be created. The transposition must retain the required adjacency while not allowing any false fusions to take place. Take the phrase "was gently falling", for example. "was" needs to be adjacent to "falling" so that the fully tensed-out verb compound can be fused as "[was falling]". One way to do that would be to exchange "gently" to a new location: "gently was falling". The problem arises from the possibility that "gently" might now fuse with some word that comes before it in its new location. Consider:
He saw that more was gently falling than was drifting away.
After exchanging "gently" we have:
He saw that more gently was falling ...
Now the risk of falsely fusing "[more gently]" is apparent. To prevent that from happening, the exchange rule could optionally include a "barrier" to maintain the original adjacency, but to disallow any newly formed false adjacency. Thus, after the swap, we might have:
He saw that more | gently was falling ...
With the wall erected between "more" and "gently" they are prohibitted from fusing, while the intended "[gently [was falling]]" is still permitted.