Home

Building a Context Network from a Narrative
An Approach to Chatbot Question Answering
Page Seven: Chunking and Knowledge Units

by Gary J. Shannon

Started Nov. 22, 2008
Last Updated Nov 28, 2008
Page One: Overview
Page Two: Details of Parsing
Page Three: World Models and Reasoning by Analogy
Page Four: Exploring Some Complications
Page Five: Nuts and Bolts 1: Part of Speech Tagging
Page Six: Nuts and Bolts 2: From Tags to Nodes
Page Seven: Chunking and Knowledge Units

Sentences and Knowledge Units

A sentence is an open-ended thing which may contain any number of clauses, and convey any number of discrete facts. A knowledge unit, on the other hand, conveys only one piece of information. The core of the knowledge unit is the verb that links the other content words together. Content words are linked to the verb through prepositions, either expressed, as in "TO the store", "FROM Mars", or implied as "Mary" in "John gave Mary the book."

In some cases it may be ambiguous as to whether a given connection is prepositional or verbal. Consider "Mary's book" where the possessive "'s" marker acts like a prepositional link, versus "book that belongs to Mary" where the "belongs" seems a more verb-style link. Should a possessive then be marked by a single link between the owner and the thing owned, or should the owner and the thing both be linked to a common "belongs" verb node?

for the purpose of simulating understanding, it should be remembered that "the book belongs to Mary" is a separate fact that needs to be noted and remembered independent of what other actions involve the book. Therefore, the sentence "John has Mary's book." actually conveys two distinct pieces of information, and should be chunked into two distinct knowledge units: "John has the book." and "The book belongs to Mary." This suggests that (unlike some of the examples on previous pages) that there are two verbs involved in the sentence:

Even as simple a statement as "I have a red crayon." actually contains two knowledge units: "I have a crayon." and "The crayon is red." The sentence "My red crayon is broken." has three knowledge units: "I have a crayon," "The crayon is red." and "The crayon is broken."

Verb Valencies

Just about any verb can have about as many arguments as one cares to hang off that verb. Take the simple sentence "I ate." The verb "ate" takes one argument, "I". But we could add a second argument by writing "I ate pizza." The second argument is an object of the verb. But it doesn't end there. by using prepositions as hooks we can go right on hooking more and more arguments onto the verb: "I ate pizza for dinner." "I ate pizza for dinner with Mary." "I ate pizza for dinner with Mary at The Pizza Barn." "I ate pizza for dinner with Mary at The Pizza Barn in Valley Plaza Mall." "Before the movie I ate pizza for dinner with Mary at The Pizza Barn in Valley Plaza Mall." And so on.

Strictly speaking, all those extra bits of information hung onto the verb with prepositional hooks are not "arguments" for the purpose of determining a verb's valency. For my purposes, (although this may not agree with a linguist's definition) the valency of a verb is the number of arguments that it can take without any prepositional hooks. "I gave Mary the book." The verb "gave" has three positional arguments, and a valency of three. The role of each argument is strictly defined by its position relative to the verb. If the order is altered, then prepositional hooks must be introduced to override the positional roles: "John gave the book TO Mary."

In English, the highest valency of any verb is three. It has been argued by some linguists that some English verbs are quadravalent, but in every such case, the fourth argument (and beyond) must be attached with a prepositional hook, and is, in fact, optional. Consider the classic "quadravalent" verb "trade": "John traded Mary the book FOR the candy bar." We could have ended the sentence at "John traded Mary the book." and the knowledge unit is complete, as far as it goes. The average listener might be prompted to ask "What did he trade for?", but by the same line of reasoning, "I ate." might prompt the listener to ask "What did you eat?" In any conversational context there is always more information that might be requested beyond that which was supplied in the sentence.

But as far as valency is concerned, arguments attached by hooks are always considered optional. Some positional arguments may also be optional, but they are still counted toward the valency of the verb. Thus, in that context, every English verb has a valency of one, two, or three positional arguments. The normal positional locations are:

ARG1 verb. - John ran.
ARG1 verb ARG2. - John watched the game.
ARG1 verb ARG2 ARG3. - John gave Mary the book.

Although the above argument order is usual, it is not the only possible order, although other argument orders are generally used for "poetic" purposes: "Down, down came the rain." If the normal order is disturbed then that disruption must be signaled in one of two ways. The first way is when the verb has no possible ARG2 of the same type, allowing ARG1 to move into the ARG2 slot: ("Comes the dawn.") The second signal is to move ARG1 into the ARG2 slot, and move ARG2 somewhere else while simultaneously tagging it with a prepositional hook: "To Mary gave John the book." Clearly "John" cannot take the ARG2 role because that role has been explicitly cast to "Mary" by the preposition "to". And by casting "Mary" in the ARG2 role, the sentence is left with no ARG1. Therefore the only possible assumption is that "John", in spite of being in the ARG2 slot, must take on the ARG1 role.

In cases where the types of the arguments are not interchangeable, the arguments may also be shuffled around without risk of confusion: "Down comes the rain." vs "The rain comes down." Each argument slot has a specific type or syntactic/semantic category into which any arguments must fall in order to satisfy the needs of that slot. Since the class of the argument "the rain" is not the same as the class of the argument "down", the two can be exchanged. We might think of the definition of the verb as being "overloaded", in the sense that there is more than one defined argument pattern, and whichever one best fits the argument types can be used.

Thing:ARG1 came Direction:ARG2
Direction:ARG2 came Thing:ARG1

Thing:the rain
Direction:down

The rain came down.
Down came the rain.

Where the types of the two arguments are not distinct, the two arguments cannot be exchanged without risk of confusion:

Thing:ARG1 saw Thing:ARG2
Thing:ARG2 saw Thing:ARG1

Thing:John
Thing:the tiger

John saw the tiger.
The tiger saw John.

The arguments can, however, be rearranged if they are marked with hooks: "The coming OF the rain WAS down." "The seeing OF the tiger was by John." Although such a construction is unconventional at best, it is understandable given that the arguments are hooked. The original verb "come" has been recast as a noun here ("the coming"), and the arguments placed in their normal order for the new verb "was". So in a sense, even such a bizarre structure as this still closely follows the argument position rules: ARG1 was ARG2, where ARG1 is "the coming of the rain", and ARG2 is "down". (or ARG1 is "The seeing of the tiger" and ARG2 is "by John")

Verb Argument Templates

Since we are going to be using verb argument templates to extract arguments from units of knowledge, it would be good to know exactly how many different verb argument templates there are. We could simply say that the three listed above ("ARG1 verb", "ARG1 verb ARG2", "ARG1 verb ARG2 ARG3") is the whole set. In a sense that is correct, but it does not take argument types into account, and as we've seen above, knowing the argument types for each verb will be necessary to bind arguments to their correct template slots in cases where the verb template is overloaded. (As is "came" in "He came home. = Home came he. = Home he came.")

When a verb's argument template specifies the general type of the argument, then the template is called "typed". When the template specifies the detailed syntactic and semantic role of the argument, then the template is said to be "strongly typed." An example of a strongly type template would be when the verb "pour" calls for an argument which is a mass noun of a pourable substance, or countable noun in plural. "He poured sand out of his shoe." "The giant poured bricks and chickens out of his hat." Such a strongly typed template would frown on trying to fill a slot with the wrong type of word:

John poured himself a steaming cup of hot philosophy. Then he sat down at the computer and wrote a long Montreal to his girlfriend, about how he wished he could visit coffee and study letter.

A human could quite easily unscramble that sentence because of the strongly typed slots for each verb. Each and every argument could be lifted out of the sentence and put back into its proper place based simply on satisfying the required parameter types of each verb. Tomorrow's project will be to take a list a a few thousand verbs and reconstruct the template for each, to see how many unique strongly typed templates show up.

Noun Classes and Verb Templates

Upon delving more deeply into the subject of verb templates, it turns out that it is very helpful to know what class of nouns may be inserted into any slot in the template of a given verb. But rather than trying to classify nouns manually, or analytically, I've decided to take a corpus-based approach to noun classification. I will begin by assuming that every verb takes as its arguments nouns of a class unique to that verb. I will do this by using a set of model sentences and deciding, by human intervention, whether a given noun can or cannot fill a given slot in a given model sentence.

I will use sentences from a variety of different corpora I have available to me, and each sentence will be numbered, and each verb will be stripped of irrelevant adverbs, and each noun stripped of irrelevant adjectives. The skeleton sentence will then be repeated, once for each noun slot in the sentence, and each copy of the sentence numbered sequentially. Then each noun to be classified will be tried in each slot. If the noun can be used in that slot, both syntactically and semantically, then the number of the model sentence will be added to the list of classes for that noun. For example, the first few sentences from what I call "corpus 3", which is section three of the book Graded Sentences for Analysis by Rossman and Mills, would be as shown below, where the 257 sentences in that corpus are numbered 3001 through 3257, with the slot number appended to each sentence number to give the class number of any noun that satisfies the slot.

Taking the proper noun "Henry" as an example, we find that it will legitimately fill slots 30011, 30012, 30012, 30021, 30022, 30031, 30032, 30041, 30042, 30051, 30052, 30061, 30062, 30071, 30081, 30082, 30083, and 30091. "Henry" is, therefore, said to be a member of those named classes. On the other hand, the noun "(the) news" fits only in slots 30022, 30032, 30051, 30053, 30063, 30072, 30084, and 30092, and so those classes characterize that noun. Some of the class memberships may seem like a stretch when judged on the basis of common sense. Would we, for example, really throw Henry to the dog? But whether a given action is likely or not in practical, sensible, or ethical terms, the statements still make sense. Thus while we might never actually say "We bought the baby a gun." it remains grammatically and semantically permissible to say that we did so.

Obviously there will initially be vastly more classes than will ultimately be needed. Only after a large number of classes are built for a large number of nouns will we begin to observe that many classes have identical membership. Thus it might turn out, for example, that every noun which is a member of class 30021 is also a member of class 30062, and vice versa. In that case we will realize that 30062 is simply an alternate name for the class 30021, and any mention of class 30061 can be discarded from the list of noun class memberships, and that sentence discarded from the collection of classification sentences.

Eventually, we will arrive at a set of prime classes which is the minimal set required to describe the syntactic and semantic behavior of each noun with respect to each verb. In theory, this could be accomplished by manually constructing a a priori list of noun classes, such as "animate objects", "sentient beings", "constructed objects", "movable objects", and so on, but it's likely that by letting the corpus dictate the classes, with no a priori assumptions as to their nature, we might learn something interesting about noun classes that we might otherwise not have noticed.

One additional condition for membership seems to be the use of an article or determiner with the noun. Thus "Henry" and "boy" are not interchangeable in sentence 30031, for example, however, "the boy" can take Henry's place in that sentence. Initially we will note that while "Henry" belongs to class 30031, "boy" belongs to class "30031a", where the suffix "a" indicates that membership in that class is dependent on the use of an article or determiner. Later, as the prime classes begin to emerge, we may decide to merge class 30031 with class 30031a, and mark the noun "boy" as being the carrier of the need for a determiner in some cases. That will, of course, depend solely on whether the data shows that such a generalization possible.

I will automate the process as much as possible, by writing a computer program to present model sentence to me with nouns from a file of nouns to be classified. Noun classes will be tracked, and their eventual merging will be handled by computer, but the judging of each sentence will still need to be done with a single good/bad mouse click by a human judge after reading each candidate sentence. The project for the next few days will be to write the software for that noun classification task.

UNDER CONSTRUCTION (as of Nov. 28, 2008)




Page One: Overview
Page Two: Details of Parsing
Page Three: World Models and Reasoning by Analogy
Page Four: Exploring Some Complications
Page Five: Nuts and Bolts 1: Part of Speech Tagging
Page Six: Nuts and Bolts 2: From Tags to Nodes
Page Seven: Chunking and Knowledge Units