Humanising Language Teaching
"Multi-guity" and how computers (almost) cope with it
A koala bear goes into a restaurant. He orders a meal, eats it quietly, then asks for the bill. When the waiter brings the bill, the koala bear pulls out a gun and shoots him dead. He then calmly stands up, puts on his coat on and walks out. The anguished restaurant manager comes running after him, shouting "Murderer! Why did you shoot my waiter?". The koala bear turns round and says: "But I was just doing what all koala bears do". "What do you mean?" the manager asked. "Well, look me up in a dictionary and you will find the answer", the koala bear replied. So the manager rushed back into his office, where luckily he had a dictionary lying on a shelf. Hurriedly turning to the entry for koala bear, he read:
Most of the words we use can have more than one meaning. Some - like shoot or leave, for example - may have well over 30 different meanings once you include things like phrasal verbs and idioms. Consequently, a simple sentence of five or six words could easily have hundreds of different potential interpretations. Yet in most forms of communication, whether we are having a conversation or reading a newspaper, genuine misunderstanding is rare. In the Longman Dictionary of Contemporary English, the word form tip has 20 separate meanings, but if I say "give the waiter a tip", no-one is going to ask me "Which sense of tip are we talking about here?" One of the mysteries of language is how well human beings deal with this issue of "multi-guity". (The more familiar term "ambiguity" seems inadequate because it implies a choice between just two possible meanings.) Computers, however, are much less intelligent than people, and have a very hard time working out what sentences mean.
This is an area of intense research activity, because the commercial rewards are potentially very great. When you visit an Internet search-engine like Altavista or AskJeeves, the system needs to be able to "understand" your requests in order to find the information you are looking for. In many cases, too, these systems will offer to translate the results of your query into another language, and you can't even begin to translate if you do not first understand the "source text" - which means successfully interpreting sentences that could have dozens of possible meanings. Dictation software, which allows you to bypass the keyboard and speak instructions to your computer, similarly depends on a degree of "machine understanding". How do computers cope with all this?
The short answer is that, at the moment, they do not cope particularly well. But progress in this field, though slow, is beginning to yield results. And one of the most successful technologies here is the automated part-of-speech tagger (or "POS-tagger"). This is a program that goes through a text and automatically assigns a part of speech to every word in it - effectively, saying "this word is a noun, that one is an adjective", and so on. The success rate for programs like this is now very high, with around 97% of tags assigned correctly. In fact, a good part-of-speech tagger is the first step in the process of automated understanding: taking the koala bear's definition, for example, a tagger would say "eats is a verb, but shoots is a plural noun in this sentence, and so is leaves". And this immediately rules out the more sinister interpretation that the joke is based on.
For anyone using a corpus, there are great practical advantages in using POS-tagged text. If you are interested in looking at the behaviour of a word like tense, for example, it helps to be able to specify in advance that you only want the corpus to tell you about tense when it is a noun (or a verb, or an adjective). The most extreme example of this is the word see: this is one of the most common verbs in English, with well over 50 meanings. But see is a noun, too, a technical term in the Christian church, meaning a district that a bishop presides over. A lexicographer who wanted to investigate this rare noun would be swamped with irrelevant data if s/he simply asked the computer to provide examples of see, without specifying the part-of-speech. This is a rather execeptional case, but a more typical use of the tagging facility is for narrowing down a search in order to focus on a specific pattern. For example, a general concordance for the noun dream might look like this:
e concrete road. He was in a dream, aware of the sh
The third and sixth lines here, referring to a "dream car" and a "dream home", reveal a pattern that is worth investigating further, and we can now make a more narrowly-focussed corpus search by specifying that we only want to see instances of "dream-as-a-noun immediately followed by another noun". The results (using a sample from the British National Corpus) look like this:road. He was in a dream, aware of the sh
when he takes the dream car for a spin at Silversto
new Williams FW15 dream car was sitting pretty at i
(Martin) builds a dream house for the girl of his d
It wouldn't buy a dream kitchen but it was a start,
Why do you want a dream kitchen? Noreen wanted
ew bike and win a dream kitchen for his mother, and
planning to win a dream kitchen. She unpinned the
ldren is to get a dream kitchen. Phyllis, 59, ca
want to see if my dream man is among them.
ugh an impressive dream sequence, Westland traces t
WDNEY IS THIS the dream ticket America has been wai
. It really was a dream ticket for women. Where Cli
ink again of this dream woman, and added to the old
This gives us a lot more information to go on, though notice here that there are still limits to how smart the computer can be: the expression "dream sequence", for example, does not mean - as the others all do -"the type of sequence that is so good that you dream of having it". There is a degree of subtlety here that the machine is not yet capable of dealing with but, nevertheless, the results are extremely useful.
For learners and teachers of a language, a typical application of this technology might be to look at the way that prepositions (any prepositions) are used in conjunction with adjectives (any adjectives). A good place to start would be the BNC Sampler, a 2 million word sample of the full BNC, which comes with corpus-query software, and is available for just £30 sterling: go to http://info.ox.ac.uk/bnc for more information. The text on the Sampler is POS-tagged, and the reliability of the tags is in this case very high because they have been "manually" checked by human readers. This makes it an excellent resource for teachers and students who want to test out their own hypotheses against high-quality data. Then again, you could just make up some more ambiguous sentences to try out on your students.
Michael Rundell is a lexicographer, and has been using corpora since the early 1980s. As Managing Editor of Longman Dictionaries for ten years (1984-94) he edited the Longman Dictionary of Contemporary English (1987, 1995) and the Longman Language Activator (1993). He has been involved in the design and development of corpus materials of various types, including the BNC and the Longman Learner Corpus. He is now a freelance consultant, and (with the lexicographer Sue Atkins) runs the "Lexicography MasterClass", providing training courses in all aspects of dictionary development and dictionary use (see http://ds.dial.pipex.com/town/lane/ae345).