Automating the Creation of Dictionaries: Are We Nearly There?
Michael Rundell is a lexicographer and corpus linguist. He's been a dictionary editor for over 40 years, and has had spells working for Pearson (as editor of the Longman Dictionary of Contemporary English and Longman Language Activator), and for Macmillan (as Editor-in-Chief of the Macmillan English Dictionary. He is now Chief Lexicographic Officer for Lexical Computing Ltd, the company which runs the Sketch Engine software. In this role, he also teaches lexicography and lexical computing at the annual Lexicom Workshops. He is the co-author (with Sue Atkins) of the Oxford Guide to Practical Lexicography. Email: email@example.com
The text was originally published in the proceedings of the16th International Conference of the Asian Association for Lexicography (ASIALEX 2023) held from 22 to 24 June 2023 at Yonsei University, the oldest private university in Korea. The main theme of the conference was “Lexicography, Artificial Intelligence, and Dictionary Users”.
Just over a decade ago, a number of papers (notably Rundell & Kilgarriff 2011) reviewed developments in the application of language technologies to the compilation of dictionaries. They showed how the dictionary-making process had been to some degree automated, and they speculated on the prospects for further advances along the road towards full automation. Ten years on, it is time to assess what progress has been made. This paper starts with a brief overview of the state-of-the-art in 2011, then looks at developments in the period between then and now. Predictions made in earlier papers are reviewed: how far have they been realised? Several semi- automated projects are reported on, showing gradual progress towards a new approach to dictionary compilation. In this model — known as ‘post-editing lexicography’ — the role of human lexicographers is to post-edit (that is, evaluate and refine) the first draft of a dictionary which has been generated automatically and transferred into a dictionary writing and editing system. All of these developments have been called into question by the recent arrival of ChatGPT and similar large language models, which seem to offer the prospect of by-passing current technologies. Through of a number of experiments using ChatGPT to generate dictionary text, the potential for these AI tools to replace the current state-of-the-art is investigated.
A little over a decade ago, several papers were published which discussed the prospects for automating the various processes involved in creating a dictionary. Two of these (Rundell & Kilgarriff 2011, Rundell 2012) gave an overview of the state-of-the-art in applying computational techniques to each stage in the production of a dictionary, from gathering language data to compiling dictionary entries and publishing them in multiple formats. A third paper (Kilgarriff & Rychlý 2010) described the automatic clustering of a word’s salient collocations, showing how it could provide the basis for a (fairly crude) form of word sense disambiguation — a model known at the time as “semi-automatic dictionary drafting”, or SADD. And a fourth paper (Kilgarriff, Kovář, & Rychlý 2010) focussed on an approach used in two major publishing projects for automatically identifying appropriate example sentences and transferring them from corpus to dictionary- writing system, along with their XML mark-up, with a single click (or tick — hence the name “tickbox lexicography”). Collectively, these papers showed how “several important aspects of dictionary creation have been gradually transferred from human editors to computers” (Rundell & Kilgarriff 2011, p. 258). (For a short survey of the application of technology to dictionary-making, see Rundell, Jakubíček, & Kovár 2020).
The present paper will start with an overview of the technologies available to dictionary-makers around the time these papers were published: where had we got to with automation, and how did we see the likely trajectory of this process. So our starting point is around 2011. I will then look at more recent developments — roughly in the period 2011-2022 — and assess their impact on the goal of automating lexicography. The last ten years have seen significant advances towards greater automation, in part thanks to the availability of much larger corpora. And in a dramatic new development, the last six months or so have seen the arrival of a new and potentially game-changing technology. Large language models (LLMs), notably the ChatGPT family released by OpenAI, first appeared as recently as November 2022, but they have already made a significant impact on a whole range of industries and research communities. No survey of the dictionary automation theme would be complete without an evaluation of these newest contenders and their potential for enhancing, or reinventing the process of lexicography — or even making it irrelevant. It remains to be seen whether their impact will be truly disruptive or merely evolutionary, and in the concluding section we will discuss the implications of all this for the future of dictionaries, and indeed for the future of lexicographers.
The state-of-the-art in 2011
What needs to happen for a dictionary to be created and published? We can break this down into three consecutive stages:
- the ‘pre-lexicography’ stage (cf. Atkins & Rundell 2008, p. 15), during which language resources are collected; then linguistically annotated to optimise their usefulness (through tokenization, lemmatization, part-of-speech tagging, and so on); and then used as a data source from which a provisional headword list is extracted,
- the lexicographic heart of the project, during which corpus data is analysed, relevant linguistic facts are identified, and dictionary entries are created according to criteria established at the pre-lexicography stage
- the publication stage, when the content produced in the previous phase is made available to the end-user in physical and/or digital form.
The last of these three stages need not detain us long. By 2011, the dictionary publication process was substantially automated. For well over half a century, dictionary text has been structured and stored in databases of increasing sophistication (e.g. Krishnamurthy 1987; Rundell, Jakubíček, & Kovár 2020, pp. 18-20). From around the turn of the 21st century, dedicated dictionary writing software became widely available, greatly simplifying the business of converting these dictionary databases into published products. In the process, lexicographers were relieved of many routine tasks which, though intellectually undemanding, were labour-intensive and error-prone — tasks such as ensuring the structural integrity of every dictionary entry, or checking that cross-references matched up. Not so long ago, it was part of the lexicographer’s job to ensure that different elements in the dictionary (such as examples, syntactic codes, or style labels) appeared in the correct typeface. Nowadays, dictionary content is created as plain text, and the question of how it is output for publication is independent of the compilation process. Other operations have been transferred from lexicographer to end-user: for example, the question of where in the dictionary an idiom should be placed (do I put ‘kick the bucket’ at kick or at bucket?) no longer troubles lexicographers because in an online dictionary the search algorithm will find it at the point of use.
Automation, then, was already well advanced (and widely employed) at this end of the dictionary-making process. Here, it is a straightforward win-win, relieving lexicographers of time-consuming, tedious work (the ‘drudgery’ which Dr. Johnson complained of) and transferring it to machines, which generally do these jobs better and faster.
This stage had also been substantially automated by the early 2010s (e.g. Rundell & Kilgarriff 2011, pp. 262-267). In particular, the development of very large corpora was now much easier (and much cheaper), and — though still by no means a trivial undertaking — was a world away from the heroic task it had been in the days of the first COBUILD corpus (Renouf 1987) or of the British National Corpus (http://www.natcorp.ox.ac.uk/). The advent of the Web made almost every variety of text available, in digital form and in vast quantities, and techniques for converting raw text into linguistically-useful data were already mature and reliable. Lexicography benefited from research carried out in the natural-language processing (NLP) community. Methodologies had developed for finding continuous bodies of text on the Web (which is full of lists, advertisements, links, and various forms of boilerplate), and for ‘cleaning’ Web-derived text, for example to remove the kinds of duplication which are are pervasive in this medium. (see e.g. Kilgarriff, Rundell & Uí Dhonnchadha 2006, section 3.3).
Cleaned Web-sourced text is then processed to optimise its efficiency as a data resource for lexicography, in a step-by- step procedure outlined by Greg Grefenstette as long ago as 1998. (Grefenstette 1998). Text is tokenized (to identify word boundaries); then lemmatized (to group together the inflected forms of a word under a single ‘canonical’ form); and then part-of-speech tagged (with each word-form or lemma assigned to a grammatical class). This was done through the application of tools developed over many years by NLP researchers, independently of lexicography, and as Grefenstette noted, ‘The tools used for one level can be stretched to perform tasks on a higher level.’ (ibid., p. 24).
Armed with a large and well-processed corpus, we are well-placed to generate a headword list for any dictionary we are planning. A range of factors can influence headword selection, notably a ‘user profile’ which helps us determine the kinds of vocabulary our target user is likely to need. But broadly speaking, frequency is the key determinant, and, other things being equal, ’If a dictionary is to have N words in it, they should be the N words from the top of the corpus frequency list’ (Rundell & Kilgarriff 2011, p. 264). A frequency-driven provisional list can then be refined by human editors. 2 All of this applies especially to English, and to a large extent to other well-resourced languages. It will not yet be true for less well- resourced languages. But the methodologies are well established, and can be applied more widely when adequate resources become available.
The lexicographic process
With pre- and post-lexicography phases significantly automated by 2011, how far had the central task of dictionary-entry writing been transferred from humans to machines at that point?
From the early 2000s, the size of corpora (for English and other well-resourced languages) was beginning to be measured in billions of words. This was orders of magnitude greater than the 8-million-word COBUILD corpus of the 1980s, and the corpora available to dictionary developers continued their steady growth. With such an abundance of data, a working method based on reading concordances was becoming increasingly unviable: there were just too many concordance lines to read. Lexical profiling software, of which the Word Sketch is the best-known example, emerged to solve this problem. Word Sketches quickly became a central part of the lexicographer’s toolkit, because they ‘provided a neat summary of most of what a lexicographer was likely to find by the traditional means of scanning concordances’ (Rundell & Kilgarriff 2011, p. 269). Word Sketches were initially developed in response to a specific requirement: the need for a more systematic account of collocation. But it soon became clear that — since different collocates or different syntax patterns tend to be associated with different meanings — the Word Sketch was a useful guide to identifying dictionary senses. Consequently Word Sketches tended to replace concordances as the preferred starting point in the process of analysing polysemous words.
In a related endeavour, word sense disambiguation (WSD) —‘a challenge for language technology researchers since the earliest days of the field’ (Kilgarriff & Rychlý 2010, p. 303) — was beginning to yield to computational approaches. The underlying principle is that the different meanings of a polysemous word are typically associated with particular collocations and/or syntax patterns. Facilitated by Word Sketches, human lexicographers now identified ‘dictionary senses’ by clustering individual language events (concordance lines) on the basis of their shared collocational or syntactic behaviour. So the challenge for automation is to get machines to replicate this process. In conceptual terms at least, a good deal of progress had already been made by 2011 (see especially Kilgarriff & Rychlý 2010).
Another significant innovation from this period is the GDEX algorithm (the name stands for ‘good dictionary examples). Its modus operandi is to trawl the corpus and identify sentences which illustrate some aspect of a word’s characteristic linguistic behaviour, such as a syntax pattern or collocation. A candidate list is presented to the lexicographer, who then chooses the most promising ones for use as dictionary examples, whether verbatim or lightly edited. Again, this replaces an earlier working model where the lexicographer would ‘manually’ scan numerous concordance lines in order to find a suitable example. The workings of the system are described in detail elsewhere (Kilgarriff et al. 2008), but the key point is that, even on its first outing in 2007, the system worked well enough to streamline this major component in building a dictionary entry.
At an operational level, protocols have been devised to link corpus data directly to a dictionary’s database. Thus a collocation or construction, along with examples selected by GDEX and approved by the lexicographer, can be copied directly — in a single move — into the relevant fields in the emerging dictionary database. All of these innovations, though initially introduced to meet the needs of specific projects, gradually became standard features in the lexicographer’s toolkit.
We now come to another entry component, the ‘labels’ used in dictionaries for signalling that an item deviates in some way from the unmarked case. Broadly speaking, labels may be grammatical or sociolinguistic. A grammatical label could be applied, for example, to indicate that a particular verb has a strong preference for occurring in the passive or for not occurring in progressive forms. Sociolinguistic labels are applied to words or meanings whose distribution across text types is in some way limited. While a word that is ‘unmarked’ can be found in all varieties of text, some words tend to be used predominantly in, say, legal or medical discourse, or in very informal registers, or in texts from a specific regional variety (such as the English spoken in India or the Spanish of Argentina).
In the case of grammatical preferences, the process for determining which words might merit a label was already well understood in 2011. A simple calculation can show the ‘normal’ incidence of passive forms across all verbs, and the degree to which any individual verb deviates from that norm. Where the deviation is significant, the software can indicate this to the lexicographer. The precise threshold at which a verb should attract a label such as ‘often used in the passive’ (is it 50% passives, 60%, or more?) will be a matter of editorial policy, but the principle is straightforward.
Likewise, for many sociolinguistic features, the mechanisms for automating labels (or at least, for prompting the lexicographer to apply a label) are not theoretically complex. Essentially, we would need to collect well-defined subcorpora (such as texts from a specific subject domain or representing a specific regional variety) and identify ‘key words’ — those items that occur significantly more often in the subcorpora than in a general-purpose ‘reference corpus’. None of this is technically difficult, but in practical terms it remains challenging: assembling large numbers of subcorpora in order to facilitate automatic labelling is not a trivial exercise. Even then, there are some classes of label which are less amenable to automation. Applying a label like ‘offensive’, for example, is more likely to be a matter of judgement than of statistical calculation. In general, it is fair to say that — though in most cases the solutions are well understood — progress towards automating the application of labels was modest in 2011.
One major dictionary component was still dependent on ‘manual’ lexicography in 2011: the production of definitions remained resistant to automation at this point. To be sure, abundant corpus data and sophisticated analysis software provided lexicographers with better raw materials, making it easier to identify the salient features of a word’s meaning. Consequently the quality of definitions improved. But producing them remained a demanding and labour-intensive operation: machines could not do the job on their own.
To summarise the position in 2011: significant progress had been made in transferring elements of the dictionary-making process from humans to machines. Corpus building, headword-list development, and much of the publication process were substantially automated. So too were some aspects of dictionary entry creation. The task of finding appropriate example sentences reflecting typical usage had been considerably streamlined. Word sense disambiguation had not yet been automated, but Word Sketches enabled lexicographers to do the job more efficiently (and with less dependence on subjective judgements), and it was already possible to see how automation might work.
All of this brought operational efficiencies and led to improvements in the quality of dictionaries. But much of the content was still, to a large extent, the product of human skill and effort. Nevertheless, a shift in working patterns was emerging. Where lexicographers previously scanned multiple concordance lines to extract relevant information, we could now foresee ‘a new paradigm where the software selects what it believes to be relevant, and … populates the appropriate fields in the dictionary database’ (Rundell & Kilgarriff 2011, p. 278). In this model, the lexicographer’s job is to evaluate a first draft of an entry produced by the computer, and to decide what to keep, what to eliminate, and what to add.
The next decade (2011-2022)
The next ten years saw steady progress towards greater automation, building on the methods and technologies outlined in the previous section. During this period, the corpora used by lexicographers grew by an order of magnitude or more. When Kilgarriff and Rychlý discussed their ideas for automatic word sense disambiguation in 2010, they were working with a corpus of 1.3 billion words. Five years later, 20-billion-word corpora had been developed for English and several other European languages, and by the end of the decade the largest available corpus for English had almost 40 billion words. This is important because of the well-known Zipfian distribution, not only of individual words but of specific meanings, multiword units, and the patterns associated with words. With much larger corpora, we get a more granular — and more reliable — picture of how words typically combine, and this in turn supports the automation agenda.
Software was continually improving too. For example, Sketch Engine’s functions now included a tool for identifying the keywords in a text or corpus — not only single words but multiword terms as well. In two projects for Slovene (a general lexical database and a specialised terminological dictionary), the respective headword lists were automatically extracted from corpus data, while example sentences were generated by two separate configurations of the GDEX algorithm. The approach saved a great deal of time, ‘by directly exporting all the data for each lemma and importing it into the dictionary-writing system’ (Kosem et al. 2014, p. 361), relieving lexicographers of routine tasks and allowing them to focus on sense division, definition-writing, and finalising entries.
The application of sociolinguistic labels (for marking register, domain etc) continued to resist easy automation. Multi- billion-word web-derived corpora had in general proved more useful for lexicography than the earlier, much smaller corpora assembled from print media, such as the British National Corpus. But the trade-off for acquiring such huge datasets was a loss of the detailed header information about the documents which made up more ‘traditional’ corpora. Experiments in classifying the genres in web corpora through supervised learning have so far had limited success (Suchomel 2021), but this a promising line of research which might eventually underpin some level of automated labelling.
The migration of most dictionaries from print to digital media has put a higher premium on ‘currency’ — the requirement for the dictionary to be always up to date. This implies a need to identify new vocabulary items as they emerge. How far could this process be supported by an automatic approach? Cook et al. 2013 report on their application of a word sense induction system to two corpora (a ‘focus corpus’ and an older ‘reference corpus’) whose constituent texts are around 15 years apart — the goal being to identify lexical items in the newer texts which did not occur in the older ones. These items could be newly-emerging words, but also (much harder) novel senses of headwords already in the dictionary. Even in a small-scale experiment, a number of clear cases were detected, showing that the method had ‘the potential to aid in identifying dictionary entries that require updating’ (Cook et al.2013, p. 63), where definitions and even examples may not reflect current usage.
The state-of-the-art: post-editing lexicography
Towards the end of the period under review, the various strands of technical innovation introduced over the previous two decades came together in a set of projects which represent the state-of-the-art in semi-automated lexicography. Lexical Computing (the Sketch Engine company) was commissioned to produce three from-scratch trilingual dictionaries for the Naver Corporation, Korea’s leading ICT organisation. For each of the three dictionaries the target languages (TLs) were Korean and English, and the source languages (SLs) were, respectively, Lao, Tagalog, and Urdu. Audio pronunciations were recorded in the conventional way using human speakers, but apart from this single component, the corpora for the project and all parts of the dictionaries’ content were generated automatically and then post-edited by humans.
The projects are described in detail in Baisa et al. 2019 and Jakubíček et al. 2021, so a brief summary will suffice. Large web corpora of the three SLs were created then annotated (lemmatized, POS-tagged etc) using tools available in Sketch Engine. Each corpus provided the source material for, first, a headword list, and then the main content of each entry in the dictionary.
As a first stage in building an entry, word sense division was achieved using a combination of Word Sketches and word embeddings. Collocation is central to this, and the algorithm’s output is a set of clusters with associated collocations. An important feature here (as will become clear when we discuss ChatGPT, below) is that each cluster is supported by a set of concordance lines, giving human editors a direct route back to the underlying corpus data. Once a sense inventory had been established, further salient collocations were added for each sense, along with corpus-derived examples, lists of related words (such as synonyms and antonyms), and TL translations obtained from commercial machine-translation services.
All of this data is generated automatically and exported into the Lexonomy writing and editing tool. The human contribution is then performed in a series of discrete stages, each dependent on the one before, ‘where editors were always post-editing only particular entry parts’ (Baisa et al. 2019, p. 807). Their role could be seen as analogous to that of a senior editor on a conventional dictionary project, editing a first pass produced by a member of the lexicographic team. For example, in reviewing a sense cluster presented by the algorithm, an editor may decide to split it into two separate senses or to move the whole cluster into another sense. This and other editing tasks (such as validating translations or improving the example sentences suggested by the machine) were facilitated by customised widgets added to the Lexonomy system.
This was a first attempt at full-scale ‘post-editing lexicography’ on a major project, and initial impressions are favourable. Challenges remain in terms of data management and human resource management, and a great deal was learned which is already feeding into further iterations of the process on other projects. But the approach clearly worked as a proof of concept, and demonstrated the ‘viability, affordability, and performance benefits of this compilation model’ (Baisa et al. 2019, p.817). Predictions made a decade earlier had been substantially borne out and — although in technical terms this represents an evolution over many years — in terms of lexicographic practice and methodology, it can been seen as revolutionary.
AI and large language models
ChatGPT and how it works
In the course of just a few months, much of the above discussion has been called into question by the advent of ChatGPT, which was released in November 2022. (It has been followed by numerous competitors, such as Google’s Bard, Microsoft’s Bing Chat, and Anthropic’s Claude, some of which may have been released before they were quite ready, in order to cash in on the enormous hype accompanying these tools. All the analysis reported here has been done using ChatGPT version 3.5.)
The system has a wide range of uses, including creating code for developers, providing medical diagnoses, writing poems, song lyrics or academic papers, producing translations — and possibly compiling dictionaries. Responses to ChatGPT’s arrival, in terms of its potential impact on the world in general, have ranged between the apocalyptic (‘the end of human civilization’) and the derisive (it is nothing more than ‘high-tech plagiarism’, in Noam Chomsky’s view, and ‘a way of avoiding learning’).
The question for our community is whether we can abandon all the technologies described above (which, over time, have brought us steadily closer to the goal of full automation) and simply hand over the entire job of dictionary-making to this disruptive new AI technology. Before we can answer this, it is important to have some understanding of how the system works.
ChatGPT is a chatbot built on a large language model (LLM). At the simplest level, what LLMs do is generate statistically likely sequences of words in response to a prompt. Now, it turns out that ‘a great many tasks that demand intelligence in humans can be reduced to next token prediction with a sufficiently performant model’ (Shanahan 2022, p. 1). ChatGPT’s performance is often so strikingly good that we may be deceived into thinking that these systems have the same kind of intelligence as we do. They do not. They ‘are simultaneously so very different from humans in their construction, yet … so human-like in their behaviour, that we need to pay careful attention to how they work before we speak of them in language suggestive of human capabilities and patterns of behaviour’ (Shanahan 2022, p. 3).
Using ChatGPT to generate a dictionary
The best way to evaluate the system’s capabilities is to get it to produce dictionary entries, and in the short time since ChatGPT’s release a number of experiments on these lines have already been made (e.g. de Schryver & Joffe 2023, Lew forthcoming, Jakubíček & Rundell forthcoming).
The starting point is always a ‘prompt’ — a question in natural language which prompts ChatGPT to generate a response. Typical prompts include ‘Could you define word W?’, or ‘Generate a dictionary entry for W’, or ‘Create a dictionary entry for W, showing all its meanings and its uses in different contexts’, or any number of other formulations. A good deal of trial-and-error is needed in order to settle on a form of words which induces the system to produce the results we are looking for. But all of this is doable. We can ask it to to generate a whole batch of entries (for dozens or even hundreds of headwords). It can also be programmed to produce a fully structured entry, with XML mark-up, and transfer it seamlessly into the dictionary database in a writing system such as Tshwanelex or Lexonomy.
Two sets of sample entries will be discussed here: twenty or so that were specially selected to test the system’s performance on specific entry components (Sample A); and a batch of 99 entries chosen to represent the full range of dictionary entry types, including all the main word classes, monosemous and polysemous words, concrete and abstract concepts, and so on (Sample B: this set is discussed more fully in Jakubíček & Rundell forthcoming). All sample entries are for a monolingual English dictionary.
Meanings and definitions
To start with a ‘simple’ entry, the technical term carbon cycle (Sample B) is well-defined in each of the three versions tried. Here is one of them:
The process that carbon goes through in nature, including its exchange between the atmosphere, oceans, and land biosphere through natural processes such as photosynthesis, respiration and decomposition.
On a less technical level, the adjective remiss (Sample A) is adequately defined as failing to fulfill a duty or obligation; careless or negligent in the performance of a task.
Remiss is one of those words whose use in text has a strong preference for appearing in certain recurrent patterns (underlined in the examples that follow), and ChatGPT’s example sentences reflect this well:
As a teacher, it would be remiss of me not to grade my students’ assignments promptly. The company was found to be remiss in its safety protocols
So far, so good. But both these entries are for single-sense words. Polysemy proves to be a bigger challenge — not surprisingly, since word sense disambiguation is one of the most difficult tasks in lexicography (and, as we saw above, one of the hardest to automate). When prompted to ‘create a dictionary entry for the word party, showing how it is used in different contexts’ (Sample A), the system performs poorly. It correctly identifies three clear meanings: (1) a social gathering, (2) a political organisation, and (3) a group of people engaged in a particular activity (‘a party of hikers’). But things then go downhill. First, it provides two further ‘senses’ which are simply variations on the ‘social gathering’ meaning: (4) ’a group of people having a good time, typically involving music and dancing’, (5) ’a group of people who have gathered to celebrate’. Then a further ‘sense’ for which there is no real evidence: (6) ‘the people present in a room, place, or gathering’ (this could either be a metonymic extension of the first meaning, or a variation on the third). It fails to account for the sense used in legal discourse of a person or group involved in an agreement or contract (‘all parties to the agreement…’) or for any of party’s verb uses.
This is unimpressive: party was chosen precisely because it is a simple type of polysemous word: its different meanings are distinct and easily identified, so it is the kind of entry which a human lexicographer would have no difficulty compiling. Things get worse with overwhelm. This is a more challenging polysemous item, which no two dictionaries will divide up in the same way, but even so, ChatGPT’s response (Sample A) is weak:
- to defeat completely; to bury or bury under a mass of something
- to overwhelm or overcome, especially with emotion
- to overwhelm someone or something with a superior force
- noun: a feeling of being overwhelmed
The first sense combines two quite different concepts; the third merely duplicates the first (and uses the word ‘overwhelm’ in the definition); and the noun use (though attested in corpus data), is rare. In a different sample (de Schryver & Joffe 2023), the word cat is given two verb senses:
- to behave in a sly, stealthy, or stealthily destructive manner
- to spend time idly or lazily, particularly while lounging or relaxing in a comfortable manner
Both look plausible enough, but they are pure inventions, unsupported by corpus data, and not recorded in mainstream dictionaries (such as ODE, Merriam-Webster, or dictionary.com). (There is some evidence for the expression ‘catting around’, e.g.: he had been out drinking and "catting around" at one of many brothels. But this is a somewhat different meaning, and too infrequent to make it into most dictionaries.)
In our own Sample A, when asked for a definition of presentation, ChatGPT says that ‘according to the Merriam-Webster dictionary’ it is ‘the act or process of presenting something to an audience’. In reality, this is not a Merriam-Webster definition, and the dated, formulaic ‘act or process of’ style is unhelpful. (We find something similar in the definition of closure in Sample B: ‘The act or process of closing or the state of being closed’.)
The issues raised by these entries are symptomatic of problems found in most of the polysemous headwords in both samples: some meanings are duplicated; others are invented; and important meanings are omitted (in Sample B, climate has five ‘senses’ of the weather-related meaning, but none for the common metaphorical use, as in ‘a climate of distrust’). On the basis of the two sample sets, it would be fair to conclude that ChatGPT performs best in handling single-sense words (especially technical terms), but is on shaky ground when confronted with even quite simple polysemous items of mainstream vocabulary.
Examples and grammar
Example sentences in contemporary dictionaries are typically drawn directly from corpus data and (whether selected ‘manually’ by the lexicographer, or proposed by GDEX) they will sometimes be post-edited to remove distracting or extraneous material. It is unclear how the examples in ChatGPT-generated entries are sourced, but the results are consistently bad.
The entry for fair (adjective) in Sample A has nine (sic) senses, of which two are labelled ‘obsolete’, two ‘archaic’, and one ‘dialect’. Each sense has one example, and every one of these follow the same formula: 3rd person singular subject, with sentence-initial ‘The’, followed by a verb in the simple past. To give an idea, the first four examples for fair read:
The referee made a fair decision by awarding a penalty. The garden was filled with fair flowers.
The price of the item was fair, not too high or too low.
The fair-skinned woman had to wear a hat and sunscreen to protect her skin from the sun.
Ironically, though generated by a machine, these examples look as if they were invented by a not very competent human editor (and incidentally the third and fourth breach Grice’s maxim of quantity, making them look even more inauthentic). This is not an aberration. In the entry for party, 11 of the 12 examples follow a similar pattern, and the same tendency is repeated across every entry in both samples. In a separate experiment (Lew, forthcoming) we find these three examples for the main meaning of persuade:
The salesperson persuaded the customer to buy the product. The speaker persuaded the audience to support the cause. The friend persuaded the colleague to take a day off
These are, if anything, even worse, with both the subject and object of the verb being a generic noun introduced by a definite article. This tendency reaches its nadir in this example for command (in Sample B), which almost looks like a sackable offence:
The commander commanded his troops to march forward.
While current technology (GDEX) presents lexicographers with candidate examples which they may need to edit, the examples shown above are irretrievable, and would simply have to be junked and replaced.
The system also has problems in dealing with grammatical categories. In Sample B, one meaning of aside (with the example: He pushed the plate aside to make room for the pie) is labelled as a preposition (it is an adverb here). An entry for the verb haunt (Sample B) starts well enough, with sense 1 describing what ghosts do. But sense 2 defines the verb first with an adjective phrase, then a noun:
Constantly present in one’s mind; an obsession
Notwithstanding the problems highlighted in this brief survey of sample entries, there are some grounds for optimism. On the development side, ChatGPT and similar systems are improving rapidly, partly (but not only) through huge increases in the volume of data they are trained on. There is also plenty of scope for fine-tuning the prompts we use, and it responds well when we do: in Lew’s experiment, for example, the system was (successfully) prompted to produce definitions following the full-sentence model used in the COBUILD dictionaries (Lew, forthcoming). Equally, one could presumably devise a prompt which would steer it away from over-using the ‘3rd person subject +simple past’ formula in example sentences. And although it has significant problems with word senses and grammatical categories, its definitions (even when wrong) are generally well-written and accessible. All things considered, this is a remarkable technological leap and, as a first shot in generative AI, it is extremely impressive — perhaps dangerously so. For the time being, though, one is reminded of Samuel Johnson’s observation on hearing about a woman giving a sermon (with apologies on behalf of Dr. Johnson for the 18th century misogyny): ‘Sir, a woman's preaching is like a dog's walking on his hind legs. It is not done well; but you are surprised to find it done at all’.
Discussion and conclusions: AI vs. current approaches
We saw earlier how an accretive process, involving a collaboration between the lexicographic and computational communities, has, over two decades or more, brought us closer to automating the production of dictionaries. The current state-of-the-art is a model where a complete first draft of a dictionary can be generated automatically and transferred to populate a dictionary database. This is then post-edited by humans (who are not necessarily lexicographers) to produce the finished dictionary. At a stroke, this approach is challenged by the recent arrival of AI technologies (in the form of ChatGPT and similar tools), which offer the possibility of by-passing these procedures and producing an almost finished dictionary in a single operation. To decide how realistic this possibility is, and how disruptive AI is likely to be for our community, we can start by asking three questions:
- Can ChatGPT directly answer users’ lexical queries (thereby making dictionaries redundant)?
- If not, can ChatGPT generate good dictionaries with minimal human input (thereby making lexicographers redundant)?
- If not, can ChatGPT produce a good-enough-quality draft dictionary ready for human post-editing (thereby making the tools we use now redundant)?
Can ChatGPT successfully answer users’ lexical queries?
In many use cases, individuals just need a quick answer so that they can continue with the task in hand: what does this word man? what is its equivalent in Korean or German? what would be an example of how it is used? ChatGPT will often provide what users need. But existing resources do this too. Most of us use search engines (like Google) to find a quick monolingual definition, or translation services (like Deepl) for a bilingual translation. In other words, it is already normal for us to resolve some lexical queries without using a dictionary, and in most cases, the resources we use outperform ChatGPT in terms of simplicity and reliability.For many other use cases, especially in educational or professional settings, people will often refer to a dictionary. In this context, there is a premium on trust (confidence that the information in the dictionary is accurate), and on ‘curation’ (knowing that the information in the dictionary has been selected to reflect what is most characteristic of the way a given word behaves). Can ChatGPT provide the more committed user with the same kind of service as a good corpus-based dictionary?
On the question of trust, research suggests that ChatGPT is not yet a reliable source. We have seen that it often gets things wrong, which means that even an apparently ‘good’ definition (as at carbon cycle) would need to be independently verified. Even more undermining of trust is the fact that the system is ‘non-deterministic’: it will always give a different answer in response to the same prompt. ‘Curation’ refers to what happens during the ‘synthesis’ stage of compiling a dictionary (Atkins & Rundell 2008, p.386 ), when we distil what is lexicographically relevant from a large mass of corpus data: for example, listing the most typical syntax patterns, selecting the ‘best’ collocations and other recurrent phraseological patterns, producing definitions which describe the most important semantic features of a word, and providing examples of usage which reflect the most typical contexts found in the corpus data. This is what lexicographers do, and the computational resources developed in the last two decades have been designed to replicate this process, applying salience metrics to identify what is most typical about a word’s behaviour.
Assuming that reliable and well-curated dictionaries will still be needed in many situations of use (and certainly by language professionals and by serious learners and their teachers), we come to our second question.
Can ChatGPT generate good dictionaries with minimal human input?
This time, the answer is a straightforward ‘no’. The experiments described in the previous section suggest that ChatGPT can produce plausible-looking dictionary text, at least for headwords at the simpler end of the spectrum. But closer examination almost always reveals problems, whether of omission, invention, or inauthenticity. The system’s most enthusiastic supporters begin their talk by saying ‘We claim that machines can now take over the whole process [of dictionary-making]’ (de Schryver & Joffe 2023, slide 3). They also highlight the fact that ChatGPT can be fully integrated into a dictionary writing system (in their case, Tshwanelex), producing complete dictionary entries structured in XML or other data formats such as JSON. Towards the end of their talk, however, they appear to row back a little, concluding: ‘Let the machine do the bulk of the work, with human intervention only at the vetting stage’. But this is precisely the ‘post-editing’ model we use now, and seamless linking between language data (in a corpus) and structured dictionary text (in a dictionary writing system) is already an integral part of this approach. This leads us to our final question
Can ChatGPT outperform existing technologies in creating a draft dictionary for post- editing?
The tools and methods available to us now can produce a good first draft of a dictionary. They have been tested in real dictionary projects, and they are improving with each iteration. So far, this post-editing model has been applied only to bilingual (and n-lingual) dictionaries, while the ChatGPT-generated samples we have discussed here are for a monolingual dictionary. But most of their entry components (sense division, grammatical information, example sentences) are common to both types of dictionary, so it is fair to say that ChatGPT cannot currently compete. There is some encouraging evidence that ChatGPT may be able to produce definitions which are good enough to be a basis for human post-editing. This is worth exploring further, as current approaches have not yet been very successful in automating definitions.
In both compilation models (the post-editing approach we use now, and a model based on ChatGPT), human intelligence still has an important role in interpreting automatically-generated language data. In one experiment (Sample A), ChatGPT was prompted several times to ‘explain the meaning and use’ of the verb to cause. (This wasn’t a request for a dictionary entry.) It gave a well-written, discursive response, but repeatedly fell short of an adequate explanation. Because cause is (no pun intended) a cause célèbre in corpus linguistics. Pre-corpus dictionaries simply described the verb (as ChatGPT does) in terms of the relationship between an action and its consequences. But when Word Sketches became available, it was instantly apparent to human linguists that cause had what John Sinclair would call a ‘negative prosody’: it was overwhelmingly used where the outcome was ‘something bad’. In a Word Sketch from the English Web 2020 corpus in Sketch Engine, the top 12 salient object collocates are all negative, starting with: damage, problem, harm, death, injury, and disease. For a human reading this, the conclusion is inescapable, and definitions in contemporary dictionaries reflect this finding. But while ChatGPT’s responses did indeed give examples where the verb’s objects included ‘cancer’, ‘tension’, ‘damage’, and ‘confusion’, they failed to make the imaginative leap which any human editor would, and did not record the negative prosody in their explanations.The AI tool, in other words, lacks the intelligence to deduce the real meaning of cause.
By contrast, an editor reviewing a draft produced using the post-editing approach would — on seeing a set of negative objects — have been able to drill down into the underlying corpus data, and would immediately see how the word was normally used. This point is key to understanding a fundamental weakness in the ChatGPT model. In a post-editing system, there is a permanent link between the draft dictionary and the corpus from which it is generated. At any point, an editor can go back to the source data for clarification. This option is not available with ChatGPT, which is essentially a black box. It presents us with answers (and different answers every time we ask the same question), but we have no way of knowing how it generated them, and therefore no way of verifying their truthfulness. Whether this flaw can beovercome in future versions is not known. But it would be hazardous in the extreme to rely on any large language model which did not allow access to the underlying data on which its output is based.
The world of AI technologies is a highly competitive one, with massive resources at its disposal. It is likely that ChatGPT and similar tools will improve quite rapidly. Lew (forthcoming) found that version 4 of ChatGPT could be prompted to generate better (or less bad) example sentences than in his earlier experiments. So it would be foolish to conclude that systems like these could never replace, or at least substantially improve upon, the tools we use now for post-editing lexicography — even though these tools are constantly improving too. For the time being, we must conclude that ChatGPT does not herald ‘the end of lexicography’.
Atkins, B. T. S. & Rundell, M. (2008) The Oxford Guide to Practical Lexicography. Oxford: Oxford University Press. Baisa, V., Blahuš, M., Cukr, M., Herman, O., Jakubíček, M., Kovár, V., Medved, M., Mechura, M., Rychlý, P., Suchomel,
V. (2019) Automating Dictionary Production: a Tagalog-English-Korean Dictionary from Scratch. In Electronic lexicography in the 21st century. Proceedings of the eLex 2019 conference. 1-3 October 2019, Sintra, Portugal, pp. 805-818.
Baroni, M., Kilgarriff, A., Pomikálek, J., & Rychlý, P. (2006) WebBootCaT: a Web Tool for Instant Corpora. In Euralex Proceedings 2006, Torino, Italy: Edizioni Dell’Orso, pp. 123-131.
Cook, P., Lau, J. H., Rundell, M., McCarthy, D., Baldwin, T. (2013) A lexicographic appraisal of an automatic approach for detecting new word senses. In Electronic lexicography in the 21st century: thinking outside the paper. Proceedings of eLex 2013. Ljubljana/Tallinn: Trojina, Institute for Applied Slovene Studies/Eesti Keele Instituut, pp. 49-65.
de Schryver, G-M. & Joffe, D. (2023) The end of lexicography: welcome to the machine. https://www.youtube.com/ watch?v=mEorw0yefAs&list=PLXmFdQASofcdnRRs0PM1kCzpuoyRTFLmm&index=5. (last access: 20.05.23) Grefenstette, G. (1998) The Future of Linguistics and Lexicographers: Will there be Lexicographers in the year 3000? In
EURALEX 1998 Proceedings. Liège: University of Liège, pp. 25–42.
Jakubíček, M, Kovář, V., Rychlý, P. (2021) Million-Click Dictionary: Tools and Methods for Automatic Dictionary Drafting and Post-Editing. In Book of Abstracts of the 19th EURALEX International Congress, pp. 65-67.
Jakubíček, M, & Rundell, M. (forthcoming) Generating English dictionary entries using ChatGPT: advances, options and limitations. In Proceedings of eLex 2023, Brno, Czech Republic.
Kilgarriff, A., Rundell, M., & Uí Dhonnchadha (2006) Efficient corpus development for lexicography: building the New Corpus for Ireland. In Language Resources and Evaluation Journal 40 (2), pp. 127-152.
Kilgarriff, A., Husák, M., McAdam, K., Rundell, M., Rychlý, P. (2008) GDEX: Automatically Finding Good Dictionary Examples in a Corpus. In Proceedings of the XIII EURALEX International Congress, Barcelona: Universitat Pompeu Fabra, pp. 425–433.
Kilgarriff, A. and Rychlý, P. (2010) Semi-Automatic Dictionary Drafting. In de Schryver, G-M. (ed.) A Way With Words: A Festschrift for Patrick Hanks, pp. 299-312.
Kosem, I, Gantar, P., Logar, N., Krek, S. (2014) Automation of Lexicographic Work Using General and Specialized Corpora: Two Case Studies. In Euralex Proceedings 2014, Bolzano, Italy: Institute for Specialised Communication and Multilingualism, pp. 355-364.
Lew, R. (forthcoming). ChatGPT as a COBUILD lexicographer. Retrieved from osf.io/t9mbu
Renouf, A. (1987) Corpus Development, In J.M.Sinclair (ed.) Looking Up: An account of the COBUILD project in lexical computing. London: Collins ELT, pp. 1-40.
Rundell, M. & Kilgarriff, A. (2011) Automating the creation of dictionaries: where will it all end? In Meunier F., De Cock S., Gilquin G. & Paquot M. (eds), A Taste for Corpora. A tribute to Professor Sylviane Granger. Benjamins, pp. 257–281.
Rundell, M. (2012) The road to automated lexicography: an editor’s viewpoint. In Granger, S. & Paquot, M. Electronic Lexicography, Oxford: Oxford University Press, pp.15-30.
Rundell, M, Jakubíček, M, & Kovár, V. (2020) Technology and English Dictionaries. In Ogilvie, S. (ed.) The Cambridge Companion to English Dictionaries. , pp. 18-30.
Shanahan, M. (2022) Talking About Large Language Models. Retrieved from https://arxiv.org/abs/2212.03551 (last access: 10.05.23
Suchomel, V. (2021) Genre Annotation of Web Corpora: Scheme and Issues. In Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1, pp. 738-754.
Please check the Pilgrims f2f courses at Pilgrims website.
Please check the Pilgrims online courses at Pilgrims website.
Automating the Creation of Dictionaries: Are We Nearly There?
Michael Rundell, UK