It's All about Data: Students of Science Meet Language as Data and Gain a Skill for Life
James Thomas, Czech Republic
James is the head of teacher training at the Department of English and American Studies, Faculty of Arts, Masaryk University in Czech Republic. In addition to standard teacher training courses, he is active in e-learning and corpus work. In 2010 he and his co-author were awarded the ELTon for innovation in ELT publishing for their book, Global Issues in ELT. He is chairman of the corpus SIG of EUROCALL, and a committee member of the biennial conference, Teaching and Language Corpora. His research investigates the application of current British linguistics to the pedagogical use of corpora, and training future teachers to include corpus findings in their lesson preparation and directly with students. In March 2016, he published the second edition of his book, Discovering English with Sketch Engine.
Rules and patterns
Language and metalanguage
Answering your own language questions
In the Impact Project, Interdisciplinary Collaborative Language Course for Science Students (Němcová, Kovaříková) university science students were introduced to the use of corpora as an aid to their writing of academic English. This article outlines the importance of deepening students' language awareness in order to use corpora more profitably. It then demonstrates how students were guided towards answers to questions that inevitably arise in the process of producing language as non-native speakers. The article goes on to show how teachers use corpora to identify non-standard constructions while correcting students' written work, and how video is used to share this error correction with students. This use of video exemplifies a variety of uses of corpus which the students see in reference to their own texts.
The hundreds, if not thousands, of people I have introduced corpora to in the last fifteen years have quickly embraced the notion of language as data. These people include language students and teachers, non-philology science students and academics, and people simply curious. It is immediately clear to them that corpora are a rich resource.
Teaching and training are strong learning experiences. The issues that have to be addressed when training people to use corpora include matters related to access to corpora, to the complexities of forming searches and drawing conclusions from the data, to dealing with the idiomatic and rare language that appear in the search results, and to the inevitable deviations from standard English. It turns out that these are relatively insignificant when compared to the issue of one's conception of language itself: the distinction between rules and patterns is germane.
Grammar rules are invisible linguistic abstractions expressed in metalanguage – in standard teaching materials they are often exemplified in simplistic language that does not always represent their typical behaviour, see, for example, Hanks' deconstruction of the 'syntactically perfectly well-formed sentence', Matilda saw an ant sitting on a peacock (Hanks 20012) on this web page.
Swan points out, "Grammar rules provide a (largely illusory) sense of security, standing out as signposts in the complicated landscape of language learning." Grammar rules are the very stuff of grammar books and course books, which are the primary exposure to language for many learners.
The connotations of rule manifest through the words that typically occur with it, its collocates, as can be seen in this word sketch: http://ske.li/bawe_rule_ws. Somebody makes, lays, sets and implements rules, while we obey, follow, apply rules. And we break them – for which there are consequences. Who made grammar rules and what did they base them on? Do they adequately describe the language? How do learners apply them? Can any word of the appropriate part of speech occupy any syntactic slot?
An alternative to rules is patterns, though for anyone learning a foreign language in an educational setting, it is fairer to say that patterns are a supplement to rules. Patterns operate at all language levels from morphology through words and phrases to discourse, pragmatics and the rest. Consider, for example, if numbers are used to start sentences, if the same things are described as disinterested and uninterested, if in the beginning and at the beginning are used interchangeably, if we harness data, if women can be handsome, which phrasal verbs are acceptable in academic writing, whether to use to photograph or to take a photo, which collocates of target words are worth learning, is the noun evidence used similarly in research and law, and thousands of etceteras. Few of these questions can be answered using grammar books and dictionaries, yet the answers are important to whomever needs them and when they need them. The most useful answers express what is said rather than what can be said, and this distinction encapsulates the essential difference between rules and patterns.
Patterns emerge from the language used by millions of individual native speakers on a day to day basis in situ. Since people's idiolects have been formed by multiple encounters with the language across their whole lives, we could expect more reliable answers to our questions if we consulted the language of thousands of them and used the data to identify typical language behaviour. This can be done using corpora, which are databases of language that was produced for normal communicative purposes, and that has been sampled and stored. As Sinclair said, "The language looks a lot different when you look at a lot of it at once." (1991:100). We can take Sinclair's comment more literally than he probably intended it when we scroll up and down corpus data that has been sorted – the patterns are actually visible. For example, some of the patterns of the word priority emerge when the sampled language is sorted alphabetically to the left, as here, http://ske.li/bawe_priority. In the left panel of the priority corpus page, clicking on Right under Sort, reveals other patterns of normal usage that priority works in.
The corpus links above are to data that is housed in and searched using Sketch Engine, which we use at Masaryk University. It is an online corpus tool with dozens of preloaded corpora in many languages. It permits a wide range of search options and some unique tools that process the data. And once data is processed, it is well on the way to becoming information.
Data → Information → Knowledge
An interesting analogy using Lego can be seen here.
In the Impact project, introducing the science students to corpora began with five general language questions that we felt they could answer using their intuition. For example, is whose only used for people? Even pondering them in groups, they were quite unsure. Without any introduction to corpora, they looked at corpus lines via the data projector and observed the patterns that quickly revealed themselves – with a little guidance. http://ske.li/brown_whose_25
The students were then subjected to a short presentation that covered some of the foundation concepts outlined above. Having no concept of language as data, yet having a strong concept of working with data as scientists, they did not need to be convinced of the value of patterns in the data manifesting typical language behaviour, or that this is what they needed to make their own language output more like that of native speakers.
That was the easy part.
Knowing what language questions to ask and then framing them is another matter altogether. The metalanguage they possess comes from their first language, in most cases Czech, and from the English course books that got them to where they are today. These are invariably rule-based, though some give a nod to a lexical view of language, mostly in the guise of collocation. For example, while most learners are familiar with the definite and indefinite article, they are not aware that one of the most frequent patterns in written English, Noun of Noun, is almost always preceded by the. Using this pattern obviates a very common problem for anyone whose first language does not have an article system like English.
It is one thing to tell students to use the before Noun of Noun, it is another to get them to observe the data and draw information from it. This guided discovery approach lies at the core of our attempt to help students to understand the patterned nature of language, and how to derive patterns of normal usage as they confront choice while writing texts.
Faced with choice, one needs criteria. The decision whether to use a phrasal verb or a more Latinate equivalent is a stylistic and register choice. Whether to use it in the active or passive voice is a matter of colligation, i.e. the grammatical company that words keep. This is determined by the patterns of normal usage of the genre as well as discourse and syntactic criteria. In our example, we compare the collocations, perform experiment, carry out experiment.
When writing a scientific article for the general public, the phrasal verb in the active voice is found to be standard. In a corpus of research articles, however, perform an experiment is almost twice as frequent as carry out, and they are used in active and passive voices.
Students performed the searches necessary to complete the table below with the numbers of occurrences. The numbers were interpreted and discussed. In the process, the concordance pages were sorted to reveal patterns, visible patterns, which are now available to them to use confidently in their own writing. Furthermore, observing the subject of the active constructions and the use of by in the passive constructions is also studying the colligation that the collocations keep. Knowing a collocation involves knowing how it is used, as is the case with learning vocabulary generally.
The results in this table are based on the ARC corpus, which is one of Sketch Engine's open corpora. No registration is needed to open the links below.
||carry out 649
|we as subject of active
|with passive by
Developing their search skills can only go hand in hand with developing their linguistic awareness, which crucially involves terminology related to vocabulary and semantics. Rule-based language teaching resources do not usually include chunk, hypernym, delexical verb, collocation, colligation, fuzzy, probable or the most pertinent, pattern. A grasp of these concepts is at least as empowering as articles, indirect speech, conditionals and the other standard rule-based fare.
During the one-semester course, worksheets containing typical language questions were prepared, sometimes deriving from errors in their writing, with screenshots and instructions. The pdf can be downloaded from here. We are now developing a set of instructional videos to accompany the book, Discovering English with Sketch Engine (Thomas 2016).
The students also received video feedback on their written work. The main task the students undertook was the writing of abstracts. These were submitted in MS Word and corrected with Track Changes. This allowed the students to see their original and the corrections and suggestions at the same time. Upon receiving their work back, they could compare the 'before and after' versions, and decide which of the teacher's suggestions to accept and reject to make their final copy. The process of commenting on their work was captured using JING, a screencast program which can be downloaded here.
Not only did the students see suggestions in Word, but they could also listen to the teacher commenting on language issues. Furthermore, the process of making these mini-videos permits pausing, during which the teacher found corpus examples, online dictionary pages, and demonstrated how these resources can be gainfully employed at specific points in their writing.
Watching a video of a teacher using corpora to demonstrate a language pattern is another form of training students in asking, searching and interpreting. Click here to see one such video. At about 1'45”, there is a demonstration of the Noun of Noun in a scientific domain.
There is a general consensus that using corpora to observe language patterns that can be employed in one's writing is worthwhile. This consensus can be found among corpus linguists working in language education, among teachers and among students. Experience demonstrates that there are some obstacles that need to be overcome, such as access to corpora and learning to use the software that searches corpora and presents the data. Different software offers different views on the data and this impacts on the language features that are observed. However, understanding the nature of the language well enough to ask questions appropriately underpins the identification of useful, transferable linguistic patterns.
Science students are not learning English as an object of linguistic fascination –English permits them to participate in the international spheres of their academic field. The more fluent, accurate, sophisticated and idiomatic (FASI) their English, the higher the level of participation they can enjoy.
Corpus training is intended to provide them with a skill for life.
Hanks, P. (2012) How people use words to make meanings: semantic types meet valencies. In Input, Process and Product: Developments in Teaching and Language Corpora. Eds Thomas, J., Boulton, A.
Sinclair, J.M. (1991) Corpus, Concordance, Collocation. OUP
Swan, M. (2006) Teaching Grammar – does grammar teaching work? Published in Modern English Teacher 15/2, 2006, accessed at the author’s website:
Thomas, J. (2016) Discovering English with Sketch Engine. (2nd ed.) Versatile.
Please check the Teaching Advanced Students course at Pilgrims website.
Please check the Practical Uses of Technology in the Classroom course at Pilgrims website.