Home

Building a Context Network from a Narrative
An Approach to Chatbot Question Answering
Page Five: Nuts and Bolts 1: Part of Speech Tagging

by Gary J. Shannon

Started Nov. 19, 2008
Last Updated Nov 20, 2008
Page One: Overview
Page Two: Details of Parsing
Page Three: World Models and Reasoning by Analogy
Page Four: Exploring Some Complications
Page Five: Nuts and Bolts 1: Part of Speech Tagging
Page Six: Nuts and Bolts 2: From Tags to Nodes
Page Seven: Chunking and Knowledge Units

First Step

Before any kind of meaningful analysis can be done on an input sentence, it would helpful to know which part of speech each word manifested in the sentence, and what role the word plays in whatever phrase it belongs to. Since so many English words can act as different parts of speech in different contexts, tagging each word with the proper part of speech (POS) is not completely trivial. In fact, some of the best state-of-the art taggers only manage to correctly tag something like 95% to 97% of all sentences from a complex corpus such as The Brown Corpus or The Wall Street Journal Corpus.

Of course, sentences from casual chatbot conversations are not likely to be as complex and convoluted as sentences in the pages of the Wall Street Journal, so many of the difficult situations that trip up most POS taggers will never be seen by a chatbot. With that in mind, I decided to take a straightforward approach to tagging based on the general idea behind a tagger created by Eric Brill in 1993 as his PhD thesis.

The Brill Tagger

The Brill Tagger looks up each word in a lexicon of words collected from a training corpus. After each word in the sentence is given a preliminary POS tag, the sentence is checked against a series of so-called context rules which may or may not change the tag on the word according to what other words or tags are found in the immediate vicinity of the word in question. The Brill tagger was written to learn these context rules by comparing its own tagging of a sentence to the human-edited standard tagging for that same sentence. The tags applied to the words in the sentence are those tags defined as the standard tag set for The Penn Treebank. (For example: JJ = adjective, NN = noun, VB = verb in base form, etc.)

The context rules take the form of a list of tests, and what changes to make if each particular test is passed. For example:

NN VB PREVTAG TO        - Change tag NN to VB if previous tag is TO
VBD VBN PREV1OR2TAG VBD - Change tag VBD to VBN if previous, or next previous tag is VBD
IN RB WDAND2AFT as as   - Change tag IN to RB if word is "as" and word two after is "as"
RB JJ NEXTTAG NN        - Change tag RB to JJ if next tag is NN
JJ NN SURROUNDTAG DT IN - Change tag JJ to NN if this tag is surrounded by tags DT and IN
      

Building on Brill

While the basic idea of my own tagger is similar to that of the Brill tagger, the details of my implementation vary considerably. Briefly, The changes were:

The Lexicon

I began with the combined word list from the Brown and WSJ corpora. This lexicon consisted of 93,696 words. More than two thirds of those words turned out to be garbage, so the de-junked lexicon ended up containing 27,951 words. (The average English-speaking avid reader has a reading vocabulary of around 30,000 words.) As an example of the kinds of problems with the original lexicon, which was collected by a blind computer with no adult supervision, the word "the" had 102 entries, including "the...", "THE", "The", "the", "--the", "problem-the", "Rodney-the", "honey-in-the-sun", and so on. Even the entries that included only the word "the", with various mixes of upper and lower case, each occurrence was classified with a different set of POS tags. If, for example, the name of the fast-food restaurant "Jack In The Box" ever appeared in a WSJ article then each word in that name was tagged, in capitalized form, as a proper noun. Thus the multiple entries for "In", "The", and "Box" as proper nouns. Multiply that by the number of company names, movie, and book titles, and other proper nouns (e.g. separate multiple entries for each of "Lord", "Of", "The", "Rings", "General", "Motors", "Telephone", "And", "Telegraph", ...) and the magnitude of the problem becomes clear.

This reminds me of "training to the test", where school children are taught by rote to know the answers to specific questions that will appear on a forthcoming test. There was no attempt to actually make any sense out of the strings of characters collected into the lexicon, and their usefulness as lexicon entries could not, by any stretch of the imagination, serve any purpose beyond scoring well on a POS tagging test of that specific corpus. Since the goal of this project is for the program to understand input in the form of random, non-corpus based conversational sentences, that approach to lexicon building is too specialized to be useful.

As a further example of lexicon garbage, each cardinal number ever encountered in the corpus was entered into the lexicon and tagged as a cardinal number. Thus if "127" ever appeared in the training corpus then the tagger would know that "127" was a cardinal number. If, however, "126" and "128" did not happen to appear in the training corpus, the tagger would not know what to do with either of them. The fact that the lexicon could "know" what 127 is, but not have a clue about 126 or 128 is clear evidence that the tagger does not understand what a number is, in general. (Of course the Brill tagger was never intended to understand this generalization, so this is not a condemnation, but merely an observation.)

Plurals, Numbers, and Other Generalizations

To address these shortcomings, the new tagger was given a complete numeric preprocessor that is capable of understanding, and converting to pure numeric form, a wide variety of numeric and number-word phrases, including such slang terms as "a buck fifty two", "two bits", "a hundred and thirty two thousand six hundred and twelve point two seven nine", "a couple dozen", and quantities written with digits like "3.14159". The result of this step is that all words and phrases that can be interpreted numerically are given a standard-form numeric value such as "NUM=24" instead of "two dozen".

Much like the case for numbers, if, for example, the original corpus contained the word "horses" but not the word "horse", then the tagger would know the plural of that noun without the possibility of recognizing its singular form. Likewise if a word were stored only in its singular form there's no chance it would be recognized in its plural form.

The new tagger was also given an extensive set of rules regarding the formation of plural nouns in English, and can recognize the plural form of any singular noun in the lexicon. As a result, nearly all the plural nouns in the lexicon could be discarded, further reducing its size. (Some plurals have no singular form such as "clothes", "pants", or "pliers", even though these can sometimes occur in singular form as adjectives rather than as nouns: "pant leg".)

Verb forms were also generalized so that all regular verbs had their past, participle, and gerund forms removed from the lexicon. Given a regular verb in any form, the tagger determines, algorithmically, what the root verb is. There are also rules for forming nouns from other nouns and verbs, and vice versa, so that given the verb "haggle", for example, the noun "haggler" would be recognized and properly tagged even though it is not in the lexicon as "haggler". Similar rules are in place for verbal prefixes like "un-", "re-", "under-", "over-", and "de-".

The net result is that this tagger has a lexicon file less than one third as large as the Brill tagger, yet recognizes, morphologically, as many as five times more words than what the lexicon actually contains. Given the single entry for "construct", for example, the tagger will recognize and correctly tag "constructs", "constructed", "constructing", "constructor", "constructer", "construction", "deconstruct", "deconstructs", "deconstructed", "deconstructing", "deconstructor", "deconstructer", "deconstruction", and so on.

This faculty also gives the tagger the ability to recognize incorrect forms like "teachor", "instructer", and "accounter", as well as "made-up" words from other words like "alphabeticalizer" from noun ("alphabet") to adjective("-ical") to verb ("-ize") to noun ("-er"), all from the single lexicon entry for "alphabet". Such tolerance for misspellings and made-up words is vital to a robust chatbot that needs to handle random conversational input.

Another task of the preprocessor is to replace idioms with their standard English equivalents, and to simplify the vocabulary. Since the goal is understanding, each word in the vocabulary must not only be recognized, but it must be understood. It is beneficial, therefore, to reduce the total number of words that need to be understood. Charles Ogden's Basic English, for example, uses a vocabulary of 850 words, substituting longer phrases in place of words not in the basic list of 850. While Ogden's set may be too sparse for our purposes, it does point in generally the right direction. Another example is Longman's Defining Vocabulary which lists the 2197 words needed to define all the other words in the English language.

Since the Chatbot does not need to understand subtle nuance, it is perfectly acceptable for the POS tagger to modify the sentence into a simpler form before proceeding. Thus we might replace "large" with "big", "huge" with "very big", "acquire" with "get", and "adjacent to" with "beside". This vocabulary compression leaves us with fewer word meanings to deal with when parsing out the meaning of the sentence.

Idiomatic phrases are also removed at this stage and simple English translations inserted. Thus "chat", "gab", "chew the fat", and "shoot the breeze" all become become "talk", while expressions like "It couldn't hurt", "check things out", "come up with", "don't bother", "threw his arms around", "undying gratitude", "lift a finger", (and on and on) are replaced by their basic English translations. Since this all happens before tagging begins there is no effort wasted on tagging words that don't need to be understood outside of their idiomatic context.

Context Rule Format

The format I've used for the context rules essentially shows the context of the change word visually. Instead of using a command like PREVTAG or NEXTTAG I simply put the previous or next tag in its appropriate place in the command. The tag to be changed is written XX>YY where XX is the existing tag and YY is what that tag should be changed to. The rules may include a mandatory wildcard * or and optional wildcard #, or both. For example:

For NN VB PREVTAG TO write: TO NN>VB
For RB JJ NEXTTAG NN write: RB>JJ NN
For JJ NN SURROUNDTAG DT IN write: DT JJ>NN IN
For VBD VBN PREV1OR2TAG VBD write: VBD # VBD>VBN
For IN RB WDAND2AFT as as write: IN=as>RB * as

This also allows rules of arbitrary complexity or specificity which cannot be written in the original Brill tagger context rule set. For example:

DT JJ NN VBD RB>RB FST

Notice that the above rule calls for changing tag RB to tag RB. It appears that the rule does nothing at all. However, recall that each word is initially tagged with all possible tags for that word. What this rule says is that IF the word MAY be tagged RB then mark it as DEFINITELY tagged RB. In other words, the word might have carried the full tag VB+RB+JJ+NN before that rule was applied, but will carry only the tag RB after that rule is applied. For that reason the rule set includes numerous rules similar to examples like DT JJ>JJ NN to change a possible adjective to a definite adjective when surround by a determiner and a noun, or NN VB>VB IN to change a possible verb to a definite verb in the context shown.

Next Step

I won't dwell too long on the further details of the tagger except to mention that the thesaurus uses expressions that allow wildcard and text replacement in its entries, so in addition to translating idioms and simplifying vocabulary, it also takes the place of Brill's "Lexical Rules". The thesaurus, the lexicon, and the context rules are works in progress and will be fine-tuned as time goes on. For now let me just add that the tagger is fully functional, and implemented as a console ap in MSVC++ 6.0 under Windows/XP.

The next step promises to be much more interesting; getting from a tagged sentence to the node structure discussed on earlier pages. That will be the subject of my next programming project, and the next page in this exploration.




Page One: Overview
Page Two: Details of Parsing
Page Three: World Models and Reasoning by Analogy
Page Four: Exploring Some Complications
Page Five: Nuts and Bolts 1: Part of Speech Tagging
Page Six: Nuts and Bolts 2: From Tags to Nodes
Page Seven: Chunking and Knowledge Units