Copyright Information

Humanising Language Teaching
Year 2; Issue 3; May 2000

Ideas from the Corpora

The biggest corpus of all

Mike Rundell 17 May 2000

For anyone who occasionally needs to check out a word or phrase but doesn't have access to corpus data and corpus software, a number of free alternatives are available on the web. For a start, you can run online "guest" searches on either the British National Corpus or on the Bank of English by visiting their respective websites (http://info.ox.ac.uk/bnc, http://titania.cobuild.collins.co.uk/direct_info.html). You can search for single words or for phrases of any (reasonable) length, and you can to some extent specify the text type you are interested in (for example, by asking the search-engine to look only at journalistic sources). As in most "demo versions", the output is deliberately limited in order to encourage you to opt for the full package: both systems offer a full, and very sophisticated, online service but you have to pay for it. But even with these limitations you will only get back 40-50 corpus lines for any inquiry both these sites are a good starting point for the occasional corpus query. Even 50 contextualized examples of going to would be a good basis for some useful hypothesis-testing.

Another interesting route is to think of the Internet itself as a giant corpus, and to use a standard search engine to find instances of a word or phrase. To do this, you go to Yahoo!, Altavista, or a similar site, type in a search word or phrase, and if you were reckless enough to search for something like going to the system will tell you it has found 250 million occurrences for you to look at, and here are the first ten. Rule One, therefore, is to remember that the web contains more text than every corpus in the world put together, so it is only worth using if you are looking for something comparatively rare. I recently needed to look at the word Hitchcockian, for example, because the corpus resources I had available produced only a handful of instances (with none at all on the BNC). Searching the web proved a very useful exercise, yielding a couple of hundred fresh citations.

But search engines are not designed for the benefit of lexicographers, corpus linguists, or language teachers, so the output format is generally very unfriendly. In most cases, all you get back is the address of a webpage that contains your search item: to look at every context, you would need to open each of these pages and then search within them for the relevant item not exactly a cutting-edge form of corpus access. One of the better search engines in this respect is www.google.com, which at least gives you some context when it returns the results of your query.

But a major breakthrough is at hand, in the form of a stunning new website that produces real "concordances". As with Altavista and others, http://webcorp.connect.org.uk/ searches the entire Internet for your query. But in this case the output is a proper concordance with an amount of surrounding context which the user (that's you) can specify in advance. The results, in other words, look very similar to what you might get from the BNC or COBUILD Direct but in this case the "source data" is the vast store of text on the entire Internet. Here is just a small sample of hits from Webcorp for the expression feeding frenzy (saved as text and re-formatted):

tle out of control. We had a feeding frenzy by the media, and we had
e courts in the litigational feeding frenzy certain to ensue in a co
alifornia, where an Industry feeding frenzy continues regardless of 
s of dollars more using the "feeding frenzy effect" how to write ine
al regulations." Period. The feeding frenzy has begun. Big business 
aged by the sensationalistic feeding frenzy in the media. Before all 
, you'll understand what the feeding frenzy is all about. Oh, one mo
 happen - Peter Birk A media feeding frenzy- Joe Sabatini I side wit
alian beaches after offshore feeding frenzy - June 17, 1998 Shark si
he heat of the latest system feeding frenzy. Millennium Digital Medi
usiness (B2B)marketplaces, a feeding frenzy of optimistic reports is
t is certain that the unique feeding frenzy that occurs only at auct
e tip reef shark is during a feeding frenzy with other sharks. This 

The fact that the system searches every website there is has some interesting effects. For example, even though well over 80% of the web is in English, that still leaves hundreds of million words of other languages. So if your search word is used in more than one language, you may get a multilingual return. I tried a search on the word peste, and out of 50 occurrences (the search was still going at this point but I aborted it), 24 were French, 17 Spanish, 6 Italian, and the other 3 (I think) Catalan. It might be interesting, too, to see what comes up when you look for loanwords in English from other European languages (such as zeitgeist or chiaroscuro).

The big question here is about the actual value of the web as a corpus. In fact, of course, it is not a corpus at all according to any of the standard definitions: what it is is a huge ragbag of digital text, whose content and balance are largely unknown. It is, in the jargon, a highly "skewed" archive, in that some text-types are very well represented, and others are hardly present at all. Contemporary fiction, for instance, exists only in tiny amounts on the web, but any respectable general corpus would include a significant percentage of this important and influential text-type. So the first caveat is that the web should not be regarded as a representative sample of English (or any of its other languages), and cannot therefore be used as a basis for making reliable generalizations about linguistic behaviour. One aspect of its lack of balance emerged from a search for the expression eye candy. This is a fairly recent coinage for describing something that is visually appealing but doesn't have much content: for example, a recent Guardian article about US sitcoms like Friends used this term to describe the genre. But a search on Webcorp turned up dozens of examples showing eye candy as a name for software accessories like screensavers or desktop-graphics that put pretty pictures on the background of a computer screen. The point, of course, is that an awful lot of the material on the web is about computing, so that searches like this can end up being rather self-referential.

But the web has two significant advantages over conventional corpora: its size and its up-to-dateness. The huge volume of searchable data means that you are likely to find plenty of hits for even the rarest words and phrases. And no stable corpus, however regularly it is updated, can match the Net in terms of being as up to date as yesterday.


Michael Rundell is a lexicographer, and has been using corpora since the early 1980s. As Managing Editor of Longman Dictionaries for ten years (1984-94) he edited the Longman Dictionary of Contemporary English (1987, 1995) and the Longman Language Activator (1993). He has been involved in the design and development of corpus materials of various types, including the BNC and the Longman Learner Corpus. He is now a freelance consultant, and (with the lexicographer Sue Atkins and computational linguist Adam Kilgarriff) runs the "Lexicography MasterClass" (http://www.lexmasterclass.com), providing training courses in all aspects of dictionary development and dictionary use.
