Skip to content ↓

Dec 2018 - Year 20 - Issue 6

ISSN 1755-9715

Meeting the Needs for Computer-based Testing at University Level

Carmen Argondizzo is Professor of English Linguistics at the University of Calabria (Italy) where she teaches students majoring in Economics and Political Science. Her research interests focus on discourse analysis in the field of Language for Academic Purposes and the related pedagogical implications, considered through a humanistic perspective. Her recent publications, which appear in the Peter Lang Linguistic Insights series, include volumes on Creativity in Language Education. Email:

Jean M. Jimenez is a researcher at the University of Calabria, Italy, where she teaches English for Academic Purposes and English for Specific Purposes. She holds a PhD in Applied Linguistics from Lancaster University, UK. Her research interests include Second Language Acquisition, Corrective Feedback in Computer Assisted Language Learning, and Testing. She has presented papers at international conferences in Europe and North America.

Ian M. Robinson is a researcher at the University of Calabria in Italy and has been involved in testing and test writing for many years in Italy and Japan and has written and presented on these and other subjects.  His other academic interests include the use of oral presentations in English lessons as well as creativity in ESP. Email:

NOTE. Although the authors cooperated fully in preparing and writing this article, they individually devoted more time to the following sections: Carmen Argondizzo Introduction, The learners’ feedback, Concluding remarks and future perspectives; Jean Jimenez The context, The challenge; Ian Robinson The Team’s solution.



Language testing plays an important role in foreign language teaching and learning and, although at times problematic and quite challenging, tests “are here to stay” (Shohamy, 2001, p. 160). Indeed, there is a close connection between language learning and assessment throughout the entire teaching activity and it is thus wise to adopt pedagogical techniques that could nicely balance these two relevant actions. Being good evaluators implies being aware of what happens in class in terms of objectives that learners need to achieve and good practices that teachers should always apply. Therefore, testing should hopefully consider both the on-going learning process learners go through during the different stages of a language course, i.e. formative assessment, as well as the end-of-term competences they are able to achieve, i.e. summative assessment (Argondizzo, 2010).

Moreover, testing involves compromises between what is desirable and what is feasible; what is valid and what is reliable. Throughout the years, the approaches to EFL testing have changed as new ideas have come along. One aspect that has brought in a radical change is the use of information and communication technologies (ICT) in testing. Many, if not all, of the international language tests use computers as their mode of delivery in order to offer tests which are easier to administer, save time, and guarantee a higher level of reliability. The challenge arises, though, when an individual university language centre with limited resources wants to change from pen and paper-based testing to a computer-based test. This was the case of the Language Centre at the University of Calabria, which decided to start implementing computer-based tests a few years ago. This article discusses the changes involved in the process from both a didactic as well as a technological perspective. It also outlines the challenges met by the instructors and the ad hoc Test Development Team, created in order to support the motivation and need for this evolution, and how these challenges were faced so as to make the most of the opportunities that computer-based testing affords. It will also briefly focus on the learners’ reaction to the new testing system.


The context

Most undergraduate programmes at Italian universities include EFL courses, the contents of which are often decided on by each individual department and/or by individual lecturers. This means that, not only is there a great variety in the levels and syllabi of these courses as well as in the number of credits awarded, there is also a difference in how language competence is assessed. In fact, test developing and test writing at universities often occur on a small scale, with lecturers in different departments working autonomously to either select or develop their end-of-course achievement tests (Jimenez & Rizzuti, 2009). This creates a non-homogeneous language course offer within a single institution and makes it difficult to determine what exactly is meant by, for example, an ‘English One’ course. With recent university reforms and calls for greater transparency as well as the need to facilitate student mobility, this situation has started to change.

In 2010, as part of language courses on offer (OLA - Offerta Linguistica di Ateneo), the Language Centre at the University of Calabria set out to organize an English for Basic Academic Skills course which would be suitable for students across all departments at the university. The aim was to harmonize the ‘English One’ courses and methods of evaluation, while preparing students for the more specific language courses that they are then often required to attend as part of their degree course. This means that most of the students are in their first year at university when they do the OLA course, although some are in their second year. The students come from all of the fourteen departments at the University of Calabria and from nearly all of the various undergraduate degree courses on offer.

In addition to creating a common course syllabus for all of the degree courses, it became necessary to develop a new test format to be used as an achievement test, the level of which was set at a B1 lower CEFR level to realistically reflect the context in which the university is situated (Robinson, 2015). This was initially done as a paper and pen exam, but after a couple of years the Language Centre decided to transform the exam into a computer-based one. The present article discusses this change.


The challenge

Creating effective tests is a complicated issue which involves compromises between what is desirable and what is feasible (Alderson, Clapham, & Wall, 1995). Much of the traditional canon of testing and assessment literature (e.g., Hughes, 1989; Bachman, 1990; Alderson et al., 1995; Fulcher and Davidson, 2007; Bachman and Palmer, 2010) concentrates on the construct and validity of tests and item writing, giving examples of task types (e.g., multiple choice, banked or unbanked gap filling, sentence transformations, etc.). There is a small, but growing literature on computerizing tests which includes an article by Shilova, Artamonova and Averina (2014) on the use of computer-based tests during a course. In their article, Shilova et al. (2014) state that “along with apparent advantages, implementation of computer-based testing also increases the teacher’s working load” because of the need to compile enough material “relevant to the learning objectives” (p. 436). As mentioned previously, a great deal of language testing is carried out by individuals, or small groups, preparing and administering their own tests. Unfortunately, some writers (e.g., Cunningham, 1998) feel the quality and reliability of items in teacher-produced tests is not always as high as that of standardized tests written by specialists. Yet this does not always have to be the case as teacher-produced tests can achieve a good quality of reliable and valid items which can also have the added advantage of being specifically targeted to the needs and context of the local test takers. How to achieve this, however, can be problematic. “The best solution is to have item-writing teams consisting of professional item writers and suitable experienced teachers” (Alderson et al, 1995, p. 41).

One aim of the project at the University Language Centre was to coordinate this practise and build upon it in a structured manner. With the aid of computer programming experts and open source programs, which are freely available to everyone, it became possible to transfer the paper and pen-based tests to computer-based ones. The challenge was how to give more value to the best components of the old format by making use of what this technology could offer, taking into consideration the limited financial resources available. In order to accomplish this task as effectively as possible, a Test Development Team composed of language researchers, postgraduate specialists in testing, and language instructors was created.

Computer Assisted Language Learning (CALL) has shown some of the advantages of using ICT in language teaching (see, for example, Chapelle, 2009; Chapelle & Jameison, 2008; Hockly, 2015; Ng, 2015; Stanley, 2013; Stevens, 2010; Thomas, Reinders, & Warschauer, 2014), the challenge for the Test Development Team was how to best use this tool for language testing. Colborne (2011) states that computer interaction design is essential and computer pages need to be simple and user-friendly. This applies to the pages of a computer-based language test as well: it is essential that the candidates find the computerized exam as user-friendly as possible, otherwise computer skills become an integral part of the test, which may present a problem with validity. All of this had to be taken into account when devising the new test format.

As highlighted above, testing does not only involve instructors and candidates. There are also test writers, invigilators, and test scorers. Great care must be taken to develop tests which guarantee validity and reliability, especially when dealing with high stakes tests such as those at university. Validity refers to “the extent to which a test measures what it is intended to measure” (Alderson et al., 1995, p. 6). For example, if we are assessing language ability, the test should not require special knowledge of the topics dealt with in the test or require computer skills. Reliability, instead, refers to the consistency of test results. Important decisions must be made about the content (i.e., tasks, topics, types of texts), the format, the level, the timing, and the scoring guidelines. The topics should be interesting and motivating, and geared to the needs of tertiary students. Once the items have been written, they need to be proofread, edited, piloted, and then edited again. It is also important that the tests reflect the teaching approach adopted in the classroom. The general approach among language instructors at the University of Calabria reflects the post communicative language approach, which ‘maintains the position that the primary function of language is effective communication (...). But it allows a much larger role for procedures such as explicit teaching of grammar, vocabulary, pronunciation and spelling, including form-focused (...) exercises’ (Ur, 2012, p 8). The Language Centre tests were aimed to be coherent with the Council of Europe’s description of language competences: “Communicative language competence can be considered as comprising several components: linguistic, sociolinguistic and pragmatic” (2001, p13). The first of these includes “lexical, phonological, syntactical knowledge and skills and other dimensions of language as a system”; sociolinguistic competences refer to such ideas as “rules of politeness, norms governing relations between generations, sexes, classes and social groups, linguistic codifications of certain fundamental rituals in the functioning of the community”; pragmatic competences “are concerned with the functional use of linguistic resources” (2001, p13). Constructing tests along these lines was also meant to have a positive washback effect in the classroom by further encouraging good teaching practices following a communicative approach.

A very important factor in testing is ensuring high interrater reliability, i.e., the scoring consistency between different raters. While problems can more easily arise when dealing with open questions in which the candidates must produce examples of written language that different examiners must rate, closed questions may also present problems due to human error. Scoring a paper-based test can be an extremely time-consuming activity, especially with the high numbers of students often found in university settings. This might mean hours of going through tests, with the possibility that the raters marking grow tired and start making mistakes. Double-checking the scoring requires additional time and does not always guarantee that mistakes will be found. In addition to saving time, automated correction can reduce human error.

Test invigilating can also be a difficult task, especially when the invigilators have to keep watch over a large group of people taking the same test, making it more difficult to stop people from catching a glimpse of a neighbour’s replies. Another challenge was, therefore, to devise a way to minimize the potential for cheating.

All of these factors had to be taken into consideration in order to guarantee the quality of assessment techniques that the team wanted to apply. The following section will explore the strategies that were adopted.


The team’s solution

The test currently consists of five tasks, which are illustrated in the table below. Due to the university wide interdepartmental nature of the OLA course, the topics covered in class, and as a consequence those included in the test, reflect a range of general academic topics from the following sectors: Social Sciences, Science, and Humanities.


OLA test format



Target skills

Number of items

Understanding a Graph:

Facts and Figures



Short text (80-100 words) and graph dealing with facts and figures.


Task type: True (T), False (F), Not Given (NG)   

Students are asked to:

  • Analyse different types of graphs and their related texts to find specific information.
  • Recognize academic lexicon.
  • Distinguish between statements that are true, false or for which the veracity cannot be judged because the information is not available in the texts.



Reading Comprehension 1:

Text cohesion and analysis of spoken discourse




Five short student profiles (25-30 words) and six short dialogues (40-45 words).


Task type: Matching

Students are asked to:

  • Recognize both similarities related to lexicon as well as cohesiveness between themes that are expressed in two different text types.
  • Identify specific information.
  • Identify lexicon related to their field of study.




Language in Context





A text (100-120 words) from which words have been removed.


Task type: Multiple choice (A, B, C, D)

Students are asked to:

  • Identify and use the L2 grammar within a text.
  • Understand the general meaning of the paragraph at different levels – single word, single phrase and multiple phrase sections.






Reading Comprehension 2:

Skimming, scanning and intensive reading





A longer text (370-400 words).



Task type:


Multiple choice (A, B, C, D)

Students are asked to:

  • Identify the main idea of a paragraph and link it to the appropriate heading.
  • Understand relations between different parts of a text.
  • Find and understand the main idea of a text.
  • Find and understand specific information.


(5 + 10)

Text Completion



A semi-formal or informal email in an academic context (150-160 words).

Task Type: Unbanked gap-fill

Students are asked to:

  • Provide lexical, grammatical, and thematic items within the context of a longer text that requires understanding at a sectional level.




Many of the tasks present in the computer-based test are the same as those used in the paper and pen test format. However, the aim of the project was to use the computer to its advantage, which included being able to build up a data bank of test items for the different tasks. This would then allow the computer to generate different tests at random during the same test session, thereby reducing the problem of students cheating during exams by collaborating with each other. At the same time, it became essential to ensure that the various items in the test bank were all of the same level since the tests would be individualized for each student. This was done at the item development stage by making sure that the whole team had a clear and internally consistent appreciation of the B1 level by constant reference to the Common European Frame of Reference guidelines and through mini workshops. Once this process had been carried out, it became possible to randomize the test.

Specifically, this randomization means that there can be one format, keeping the same order in which the tasks are presented to the candidates so as not to alter the construct, but allowing for the insertion of different test content. In addition to this, within each task, items can also be randomized. Therefore, even if two candidates happen to have the same Graph Task, they will not find the items in the same order. For example, the sentence that is used as item 1 in one test might be number 8 in another test. For the Language in Context Task, instead, it is not possible to randomize the order of the items as they are part of a text so must appear in the same order. However, the options (A, B, C, D) can be randomized so that, for example, the option given as A in one test could be C in another. This same rationale is used for the items in the Reading Comprehension 2 Task. The Reading Comprehension 1 Task, with its student profiles and connected dialogues, can have the order of the profiles randomized as well as the order of the dialogues.

It was also important to assess productive skills such as writing. Although it would have been possible to ask candidates to write a short essay or an email, marking these tasks would be very time consuming. In addition to the time and effort necessary to draw up clear marking criteria and scoring grids, assuring that the same standards are used by scorers requires long hours of training, returning to the aforementioned interrater reliability. Hence, at least for the time being, a decision was made not to include a free writing task, but rather a guided writing task in the form of an unbanked gap fill which requires candidates to write in the missing words. Whereas the computer system can automatically score the tasks previously illustrated since there is only one correct answer, this Text Completion Task requires human intervention in the scoring. Although great care is taken in trying to create an exhaustive list of possible correct responses to be included in the answer bank, students can often surprise us with their ingenuity. However hard we try, we can never match the creativity of language learners and one or two other correct answers might appear. For example, in this invented item ‘I am looking forward to __________ you next week’, we could have “seeing”, “meeting”, “visiting” or “calling” as correct answers. We could also imagine that candidates might write “see”, “meet”, “talking” or “watch” as incorrect answers. All of these can be uploaded onto the system to facilitate automatic correction. If any of these appears as an answer, the system knows how to score it. However, it is possible that a candidate could write “kissing” as the reply. In the context of this invented email, this could be correct, but the system would not recognize it as such. It is therefore necessary for a human scorer to enter the system and add such replies to the answer bank so that this candidate, as well as any future candidates, is awarded the corresponding point. The same is done when the candidate answers “elephant” or some such obscure word that for some reason seemed appropriate to them. Obviously, here no points are awarded, but the answer is added to the list of incorrect replies. The more complete the list of correct and incorrect answers is, the more efficient the automatic correction process becomes. Although this operation does take a little time, it was deemed important for the candidates to have to produce some language rather than always clicking on a pre-set menu. This allows the test to assess a candidate’s active production of the language, even if only to a limited extent.

All this was done bearing in mind Colborne’s (2011) suggestion to keep things as simple and uncluttered on the screen as possible. This means that ease of visualization is key; all elements of the task have to be clearly visible for the candidates with no distracting visual noise or problems of opening up extra windows for the texts or the need to scroll up and down the page in a task. For many of these tasks the item choices come as drop-down menus so that candidates just have to click on the appropriate choice, rendering the operation as user-friendly as possible. Each candidate’s replies are saved at the end of every task before moving on to the next one. All tasks can be revised by the candidates and changes made before finally saving and finishing the test.

A part of the task of developing computer-based tests was also to draw up a protocol for the practical considerations involved in the testing. This is particularly necessary in a university wide scheme which involves staff from different departments as well as the use of different computer rooms. Hence the invigilators are provided with guidelines to ensure that appropriate procedures are followed and that the test procedure is identically replicable in every test session. The students log onto the language centre website where the tests are generated for that particular session, and a technician is on-hand at all times to deal with any technical problems that may arise. The session can only be opened with a one-time password divulged to the candidates before starting. The internet connection to the computers must be disconnected so that candidates cannot consult online sources. In addition, the use of telephones or other digital devices is strictly forbidden during the test. These and other simple protocols for invigilating have been drawn up to ensure consistent and fair test settings for all candidates.


The learners’ feedback

The shift from pen and paper to computer-based testing occurred through a carefully outlined process. The new system was initially introduced, in an experimental phase, with groups of students belonging to degree courses in which the number of enrolled students was not very high.

As expected, students did not show any inhibition towards dealing with a test presented on a computer screen rather than on paper. Only at the very beginning (a.y. 2012-2013) did a few learners note that their ability to concentrate had been reduced since they preferred the traditional approach used in so many other subjects. Being able to reflect while facing a piece of paper, writing and erasing their answers when they were not correct, in some cases seemed to be reassuring for them. Yet, once the computer-based test was introduced to students from all degree courses, the team observed that, throughout the years, students were easily adapting to the new system, new items and new protocols. The advantages, from the learners’ perspectives, that could and can be currently observed, from both a technical as well as a didactic point of view, are several. Among them, the following three aspects are worth highlighting: a) Nowadays learners do not need to be trained in order to interact with the computer. They are ready, from a technical perspective, for any task they are asked to perform (e.g., gap filling, matching, drag and drop, multiple choice answer, etc.); b) Students seem to appreciate tasks that refer to basic skills they need to master in their academic fields of study (e.g., analysing a graph, identifying basic academic language, comprehending and completing texts, such as emails related to their academic life; c) Test results can be provided in a very short time, thus keeping students’ level of anxiety low.

However, learners’ performance did not show any particular change due to the new testing system in terms of average level of competences achieved at the end of a language course. This aspect suggests that the valid assessment of learning achievements can be achieved reliably by various testing formats, and that the computer-based one has proved as effective as the paper and pen one. Therefore, now that the new form has become a consolidated practice, the team can strongly affirm that, while still offering a well-balanced end-of-term assessment, they have created a system that offers many beneficial services, which will be considered in the final section.  


Concluding remarks and future perspectives

Currently, twenty-eight degree courses are using the services of the Language Centre and the OLA testing system that it provides, which means that the students in all these courses use the same written test. Although we still do not have complete standardization given that University Credits for the different courses may vary and some courses also involve other forms of evaluation (e.g., a compulsory oral test), the team is working towards harmonizing these aspects as well. Nevertheless, there is greater transparency and understanding within the university regarding what an ‘English One’ course means and the OLA stakeholders have become familiar with all of the aspects of the English for Basic Academic Skills course offered by the University Language Centre.

There are various benefits to using this testing system. One of these is that it has freed up human resources which, in a time of severe financial cuts at Italian universities, have become increasingly valuable. The computerization of tests has meant that less time is spent invigilating and marking tests. Moreover, computerization has offered the opportunity to create a data bank of test items which can be randomized to create multiple individual tests for the candidates involved in a testing session, which has resulted in fewer tests having to be developed.  In the past, once a test had been used, it became redundant. Now the database of tests has a much longer life since even if a student were to re-sit the test, it would be highly improbable that she or he would have the same test items. The testing system has also led to a reduction in cheating as it is statistically unlikely for two students sitting next to each other to have the same test. Finally, by having the computer automatically correct the tests, not only is time saved, but, most importantly, the reliability of the scores increases since human error is, as far as possible, removed.

Based on these positive washback effects and on new departmental requests, the Test Development Team is currently updating the format. Specifically, the test will very soon be upgraded in order to assess the B1 level, as an end-of-term level for Basic Academic English courses. This will imply the introduction of new testing items which will evaluate learners’ competence in listening activities and assess guided writing competences. The test will also include a speaking session that will be optional for those students who want to further challenge themselves. Yet, the challenge will grow for teachers as well, if we consider the relevance of a meaningful integration between teaching practices and assessment procedures. The teaching practices will need to be even better adjusted to the learning outcomes that learners are expected to achieve. The assessment procedures, although based on a pleasantly rigid technological system, should still not neglect the importance of blending tests which assess the competences learners achieve at the end of a language course with tests which assess their gradual development of competences. This aspect of teaching/testing practices, which will also encourage learners to self-assess their language progress throughout the course (Argondizzo, 2009), will be part of a future experimentation that the team will soon develop. This new project will have the aim of enhancing an appropriate balance between technology and the humanistic oriented teaching that the Language Centre will continue to apply within the academic context.   

In general, we feel that this has been a beneficial project and one that is worth sharing with other practitioners as the decisions we have had to make are those that other individuals or Language Centres might encounter and this could help inform them.



Alderson, J.C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.

Argondizzo, C. (2009). Il Portfolio Europeo delle Lingue: tra oblio e voglia di ripresa. In F. Gori (Ed.) Il Portfolio Europeo delle Lingue nell’Università Italiana: studenti e autonomia (pp. 25-37). Trieste: EUT (Edizione Università di Trieste).

Argondizzo, C. (2010). Crescere attraverso un test. Integrare la valutazione sommativa con la valutazione formativa. In: N. Vasta, N. Komninos (Ed.) Il Testing linguistico. Metodi, procedure e sperimentazioni (pp. 50-72). Udine: Editrice Universitaria Udinese.

Bachman, L. (1990). Fundamental considerations in language testing.  Oxford: Oxford University Press.

Bachman, L., & Palmer, A. (2010). Language assessment in practice. Oxford: Oxford University Press.

Chapelle, C. (2009). The relationship between second language acquisition theory and computer-assisted language learning. The Modern Language Journal, 93 (focus issue), 741-753.

Chapelle, C., & Jameison, J. (2008). Tips for teaching with CALL: Practical approaches to computer assisted language learning. New York: Pearson Longman.

Colborne, G. (2011). Simple and usable: web, mobile, and interaction design. Berkeley: New Riders.

Council of Europe (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge: Cambridge University Press.

Cunningham, G. K. (1998). Assessment in the classroom: constructing and interpreting tests. London: The Falmer Press.

Fulcher, G., & Davidson, F. (2007). Language testing and assessment; an advanced resource book. Abingdon: Routledge.

Hockly, N. (2015). Developments in online language learning. ELT Journal 69/3. Oxford: Oxford University Press.

Hughes, A. (1989). Testing for language teachers.Cambridge: Cambridge University Press.

Jimenez, J., & Rizzuti, D. (2009). An investigation into the factors affecting the content and design of achievement tests administered at university. In M.G. Sindoni (Ed.) Testing in university language centres. Quaderni di ricerca del Centro Linguistico d’Ateneo Messinese, Vo.2/2008 (pp. 83-100). Soveria Mannelli: Rubbettino. 

Ng, W. (2015). New digital technology in education: Conceptualizing professional learning for educators. London: Springer.

Robinson, I. (2015). Uniting to Improve: The OLA experience at UNICAL. In: Paolo E Balboni (Ed.) I "Territori" dei Centri Linguistici: le azioni di oggi, i progetti per il futuro (pp. 86-92). Torino: UTET università.

Shilova, T., Artamonova, L., & Averina, S. (2014) Computer-based tests as an integral component of an EFL course in moodle for non-linguistic students. Procedia- Social and Behavioral Sciences 154, 434 – 436. doi: 10.1016/j.sbspro.2014.10.187

Shohamy, E. (2001). The Power of tests: A critical perspective on the uses of language tests. Harlow: Pearson Education.

Stanley, G. (2013). Language learning with technology: Ideas for integrating technology in the classroom. Cambridge: Cambridge University Press.

Stevens, A. (2010). Study on the impact of information and communications technology (ICT) and new media on language learning. European Commission. Retrieved from. (accessed 29/04/2017).

Thomas, M., Reinders, H., & Warschauer, M. (2014). Contemporary computer-assisted language learning.  London: Bloomsbury 3PL.

Ur, P. (2012) A course in English language teaching. Cambridge: Cambridge University Press.


Please check the Teaching Advanced Students course at Pilgrims website.

Please check the English Course for Teachers and School Staff at Pilgrims website.

  • Meeting the Needs for Computer-based Testing at University Level
    Carmen Argondizzo, Italy;Jean M. Jimenez, Italy;Ian Michael Robinson, Italy

  • Power Points
    Robin Usher, Saudi Arabia