This page explores a novel method of parsing a sentence for structure and meaning by recursively fusing together adjacent sentence elements. It should be noted that this web page is not the result of working out the details of this system. Instead, it is part of the very process of working out those details, and should be considered incomplete, provisional, and in a state of flux.
Fusion parsing is a process which transforms a string of symbols by succesive application of rules, such that the length of the string symbols becomes one symbol shorter with each application of a rule. In time a string of any initial finite length must have its length reduced until no rule exists which can be applied to the string. This leaves what we will call the terminal string. The terminal string will either consist of one symbol, or of more than one symbol. If the terminal string consists of a single symbol then we say that the original string has been successfully parsed. If the terminal string has more than one symbol then the string has not been parsed.
The fusion rules are of two types; direct fusion rules, and bookend rules. Direct fusion rules give a pair of adjacent target symbols and a third, fusion symbol. The two target symbols, when found adjacent to each other and in the order specified by the rule, are removed from the string and replaced by the single replacement symbol. For example, applying the rule BC->F to the string "ABCD" yields the string "AFD". The second type of rule is the bookend rule. In this rule, the two target symbols are found acting as bookends to another single symbol which lies between them. In this case the two outer bookend symbols are fused and placed after the inner symbol. For example, the rule B*D->F applied to the string "ABCD" yields the string "AFC".
The intent is that the rule set, or parsing grammar, will parse every string which is a sentence in some specific language. While the rules themselves can be reversed and used as production rules to generate sentences, it is not guaranteed that strings generated by such production will be sentences in the language. In other words, it may be the case that the actual productive grammar of the language is unknown, and perhaps even unknowable, an yet a non-reversable parsing grammar which is complete enough for practical purposes remains possible.
Consider this example of a fusion grammar:
|
Now suppose that instead of discarding the symbols that have been fused, we enclose them in brackets and mark the barackets with the new fused symbol. Now when we look for matching symbols in the rules we ignore any symbols that are within the brackets. The same string with the same rules applied becomes:
|
The terminal string is longer than one symbol, but there is only one symbol remaining on the outside of all brackets, and that is the final symbol of the original terminal string. Now, since the string is a nested parenthetical expression, we can display it as a binary parse tree rooted on the terminal symbol. This shows not only the final terminal string but every step of the process that brought us to that terminal string.
|
Now suppose we let each symbol in the string be a category tag on a word of a natural language sentence. The parse tree then begins to make some sense. For example:
|
The goal of the fusion parser to to succesfully parse as many "reasonable" sentences as possible. Given a hypothetical language for which the complete production grammar is known, the parsing grammar should be loosely enough coupled to the production grammar so that sentences which are "not quite" in the language can still be succesfully parsed. For example the sentence "Them dogs is mean." is not quite a sentence in "proper" English, yet even a very young and inexperienced human parser can succesfully work out its meaning. A fusion parser should also be able to succesfully parse sentences which are not strictly "correct" with respect to the production grammar. For this reason the fusion grammar is far less formal than a production grammar, and does not necessarily inforce rules of tense, number, gender, case, and so on.
The rules of fusion grammar are not invented by a processes of logic and reasoning, but are discovered by extracting those rules from examples of sentences which were hand-parsed based on an intuitive feel for what constitutes a way (but not necessarily the only way) to parse a given sentence.
Word categories are not enumerated a priori, but emerge by observing what words share rules with what other words. Words that share the same roles in the same set of rules are taken to be members of the same family. The families are discovered empirically, but are not given traditional names like "proper noun" or "transitive verb" simply because those name could be misleading, and might prompt us to think, mistakenly, that we really know which categories will be useful for practical application of fusion parsing rules. Thus we might discover that the words "this" and "that" have the same kind of rules, as do the seemingly related "book" and "kitten". We note instances of "this kitten" and "this book" as well as "that book" and "that kitten". This leads us to conclude that "this" and "that" might belong to the same family as might "book" and "kitten". But we will not be so bold as to name those families, or to pretend that we understand them in some abstract or generalized way. The only thing we know about such families is that they share certain rule groups.
Initially, category tags will simply be numbers to clearly distinguish them from text. Staring with a small corpus of sentences (below) we begin by noting that there are several words and fused elements that can be fused with the word "the", which we will assign the tag 000. This word, therefore, serves as our starting point for enumerating categories. We will tag those elements that can be fused to the right of "000:the" with 010, and the fused element that results will be tagged 020, which asserts that since "the the cat" is not acceptable, a 010 element once fused with "the" cannot be fused with "the" again, and consequently cannot be tagged with 010. Thus, our first rule is [000 010]->020. The tag numbers could be assigned sequentially, but it turns out in practise that leaving gaps in the number sequence comes in handy later on when revisions need to be made and we would like to insert a new category.
|
We observe that some stand-alone words may be tagged 010, and also be part of a group of words that is tagged 010. We notice, for example, both 010:[weary traveler] and 010:traveler. From this observation, and without assuming that we know anything about the nature of these words or categories, we hypothesize that there exist elements such as "weary", which, when combined with an element of category 010 create a fused element which still belongs to category 010. We further conclude that some of these words themselves do not belong to category 010. For example, we would not, intuitively, accept the grouping "[the weary] traveller" when the grouping "the [weary traveler]" is available. Yet we would, if nothing better were available, accept "no rest for [the weary]."
So how do we categorize a word like "weary"? It clearly does belongs to category 010 in "No rest for the weary." But does not belong to that category in "The weary traveler...". The solution is to allow that this word can be tagged in two distinctly different ways, creating two distinct parse trees. If the sentence is inherently unambiguous then only one of those parse trees will be capable of completion, and the incomplete tree will be discarded. Words that are fused with 010 words to create 010 elements will be tagged with 030. Those 030 words which can also function as 010 words will also be tagged 040 in the tagging dictionary, where 040 is a near synonym to 010. Thus "weary" will be found in the dictionary as both "030:weary" and as "040:weary" Implied by this are the rules "[030 010]->010", and and "[000 040]->020" our list of word tags, rules, and sentences now looks like this:
|
You will notice that we have not said "this word is a noun" or "that word is an adjective". Nor will we succumb to the temptation to pretend that we know more than what the sentences themselves reveal. Our only acceptable criteria for this process is the intuitive sense of what does and does not sound right to our ears. For example, we might wonder what category the word "my" belongs in. At first glance it looks like it is mutually exclusive with "the" and seems to occupy the same slots as that word. For example, we have "the red ball" and "my red ball". Does that mean that "my" is another member of category 000? In order to decide we ask ourselve if the two are always interchangable, and the answer is no. "No rest for the weary" sounds right, but "no rest for my weary" does not. So we will assign a new tag, 050 to "my", and let it inherit only one of the rules from "the", namely "[050 010]->020". With this rule, "my little white kitten" becomes "020:[050:my 010:[030:little 010:[030:white 010:kitten]]]"
That this rule should create an element of category 020 can be verified by seeing that such 020 elements do indeed sound right when replacing 020 elements generated with "the". For example: "No rest for the weary" vs "No rest for my little white kitten", and "The boys ran away" vs "My little white kitten ran away."
Our formal training might lead us into the mistake of assuming the word "a" belongs in the same category of the word "the". We were, after all, taught to call both of them "articles." But there are many cases where, for one reason or another, they are definitely not interchangable. Not only do we have differences due to singular vs plural as in "the boys" vs "a boys", but we have the pair "no rest for the weary" vs "no rest for a weary". Before we can categorize the word "a", however, we need to recognize that some 010 words work with "a" and some 010 words do not. In other words, we need to divide category 010 into two separate categories which we will call 010 and 011. Category 011 will catch all those category 010 words that do not work with "a". For now we will also tag the results differently, as 020 and 021. Later we may merge those two categories. The word "a" itself will be tagged 001.
|
Notice that the original rule [000 040]->020 has been changed to [000 040]->021 because the result of that rule (e.g. "the weary") is inherently plural just as those [000 011]->021 elements. For example, "many of the weary" and "many of the boys" vs the incorrect "many of a man". This kind of back and forth trial and error modification goes on until we have a relatively stable set of tags and rules based on a wide variety of model sentences.
(Dec. 6, 2006. To be continued)