Home

Ignorant Chunking
or How noun and verb phrase chunking can be
done without reference to parts of speech

by Gary J. Shannon

Started Dec. 05, 2008
Last Updated Nov 05, 2008

The Idea

Is it possible to properly chunk a sentence without any reference to parts of speech? I believe it might be.

Suppose each word is classified according to what position it may occupy in a bracketed chunk of two, three, or four words (or nested set of previously chunked words). Each word is given a code number that represents both the size of the group, and its position in that group. The words might be numbered like this:

[21 22]
[31 32 33]
[41 42 43 44]

If a given word can find surrounding words that have the appropriate chunking codes, then that word, and its surrounding words, are chunked as indicated. For example, suppose we find the word group "New York" in the text. We look up "new" and find it it coded "21". Since we want to know if we can group that word with the following word, we look up "York and find that it is coded "22". The sequence "21 22" then becomes the chunked element: "[21 22]" or "[New York]".

Likewise, we might find that "has" is coded "21", and "been" is coded "22", allowing us to chunk the pair "[has been]". This raises the obvious question of possible mis-chunking, as in the sentence: "The bird cage is open." Here, we look up "the" and find it coded "21", and "bird" is coded "22" to allow the chunk "[the bird]". However, we clearly need to group "bird" with "cage" before grouping anything with "the". This requires that each coding must also be given an application order. The sentence is scanned multiple times, each time chunking groups of the next higher application order. Thus, the chunk [bird cage] comes about because "bird 21", and "cage 22" both have lower application orders than "the 21" and "bird 22".

We might give each code number a two-digit application order with "00" being the first in order and "99" being the last. Thus, with the dictionary:

bird 0021 1022
the 1021
cage 22 (Note that absence of an application order means the chunking code applies at all order levels.)

The first pass, using only order 00 codes, gives the chunk: "The [bird cage]", and the second pass, with order 10 codes, gives the chunk "[The [bird cage]]". The only unanswered question is, how did we decide that the chunk [bird cage] should, itself, be coded "22"?

Taking a more complex example: The clown in the circus is a very funny fellow. We would like to see it chunked this way:

[[[The clown] in [the circus]] is [a [[very funny] fellow]]]

Here, "very" is coded "21" and "funny" is coded "22", yet the chunk "[very funny]" must be coded "21", not "22". In order to accomplish this, we introduce the notion of dominance. Each coded word is either dominant or recessive. The word "cage 22" is dominant, passing it's code to the chunk it forms, while the word "very 21" is the dominant word in its pair, so that the chunk inherits that dominant code, "21". A dominant code is marked with "*".

It might also happen that one of the words is forced to be recessive, even in the absence of a dominant partner. To assert that a word is always recessive its code is marked with "-".

We must also adjust the application orders so that "[very funny]" gets chunked before "[funny fellow]" giving the proper nesting "[a [very funny] fellow]]". Thus, our dictionary becomes something like:

a 3021-
bird 1021 2022*
cage 22*
fellow 1022
funny 0022- 1021-
the 3021-
very 0021

Now let's apply the codes to the sentence: The bird cage held a very funny bird. The order 00 pass chunks only these words: "very 0021" and "funny 0022-", with code "22" being recessive:

The bird cage held a [very funny](21) bird.

The order 10 pass sees [very funny] as if it were a single word with code 21. It uses only the order "10" codes: "bird 1021" and "cage 22*". The first occurrence of "bird 21" finds it pair in "cage 22", and forms a chunk. The second occurrence of "bird 21" finds no following word coded "22", and fails to form a chunk. Likewise, the chunk "[very funny] 21" fails to find a word following it coded "22". The result is thus:

The [bird cage](22) held a [very funny](21) bird.

The order 20 pass uses the word, "bird 2022*", and the chunks, "[bird cage] 22" and "[very funny] 21". The chunk "[very funny] 21" finds its mate in "bird 22*" and a new chunk is formed:

The [bird cage](22) held a [[very funny] bird](22).

The order 30 pass uses the words "a 3021-" and "the 3021-", along with the chunks "[bird cage] 22" and "[[very funny] bird] 22". Both "the" and "a" find their mates and the chunking is now:

[The [bird cage]](22) held [a [[very funny] bird]](22).

Finally, we might add "held 4032", and add the codes "4031" and "4033" to the dictionary entries for "bird" and "cage". Since the phrases inherit the codes of the dominant words at each level, that would give us the coding:

[The [bird cage]](4031) held(4032) [a [[very funny] bird]](4033).

Now the chunk "[The [bird cage]] 31" looks for two following words or chunks with codes "32" and "33" codes, and finds that they are present, chunking the three groups into a complete sentence as:

[[The [bird cage]] held [a [[very funny] bird]]].

And the sentence is properly chunked.

To determine the proper order number(s) and chunking code(s) for each word, a large number of sentences are hand chunked according to this scheme, and a program is written to analyze the depth of nesting for each pair, with the deepest nesting becoming the lowest application order number. Some words will be coded differently at different nesting levels as with "funny" being "22" at one order level, and "21" at a later order level.

From a large enough sample of sentences it should be possible to compute the codes and order levels for all the words that occur in those sentences, keeping in mind that words that might be chunked at one order level might need to be chunked at a different order level in another sentence. However, it should be possible to find a set of order levels that satisfies the requirements of all the sentences in the corpus. For example:

I bought a bag.

vs

I bought a big bag of buttered popcorn.

The first sentence implies that "a" and "bag" can be chunked at the first level, whereas the second sentence requires that "a" is chunked with "[[big bag] of [buttered popcorn]]" much later. As long as the chunking order is arranged to account for the worst-case sentence, the shallower sentences will also chunk correctly, with several intermediate levels causing no new chunking to take place at those levels. (Note the use of the "32" word "of" which chunks with a "31" and "33" to give "[X of Y]".)

It might also be the case that repeated chunking must occur at the same level until no new chunking is produced. Thus "the big brown angry dog" might be chunked three times at the same level to yield the successive results:

the big brown angry dog
the big brown [angry dog]
the big [brown [angry dog]]
the [big [brown [angry dog]]]

Since the pairs "big(21) brown(21)" or "brown(21) angry(21)" will not satisfy any chunking requirements, they cannot be mis-chunked at any level.

Some typical parses might look like this:

[[Benches [were built]] along [the [side porch]]]
[[Between [[the cottage] and [the wall]]] grew [a [splendid [old oak]]]]
[[Books of [voyage and travel]] became [his passion]]
[[Caesar , [[the leader] of [the [Roman army]]] ,] was [a [mighty warrior]]]
[[Colder and louder] blew [[the wind] , [[a gale] from [the northeast]]]]
[[Each [nook and corner]] was [clean and orderly]]
[[[Evangeline , [[a tale] of Acadia] ,] [was written]] by [Longfellow , [an [American poet]]]]]
[[Every [spring and fall]] [[our cousins] pay us [a [long visit]]]]
[[Everybody looks happy] on [Christmas day]]
[[Gertrude , [Queen of Denmark] ,] was [[the mother] of Hamlet]]
[[He is small] , but strong]
[[He played [a tune]] on [his [wonderful pipe]]]
[[He remained poor] [all [his life]]]
[[He stayed away] [a month]]
[[Her [ribbons and laces]] looked [fresh and new]]
[[Hiawatha lived] with [[his grandmother] , Nokomis]]
[[His carelessness] caused [his father] [much anxiety]]
[[His character] is [above reproach]]
    

UNDER CONSTRUCTION. More to follow as I experiment further with this concept.