Ch.+36+by+Micheline+Chalhoub-Deville

**Chapter 36** **Technology in Standardized Language Assessments** **by Micheline Chalhoub - Deville**

Computer technological advances have changed different aspects of our life, education, everyday life, etc. The impact is observable also in second language assessment with computer-based tests (CBT) and computer adaptive test (CAT) being common in SLA (Provenzo, Brett, and McCloskey 1999). Researchers categorized different changes introduced by technology in terms of two differentiated outcomes: Type I or "sustaining" versus Type II or "disruptive" (Maddux 1986; Christensen 1997). Type I or sustaining technology refers to the use of computer to administer learning drills. Type II or disruptive technology helps to accomplish something that was not plausible previously, such as utilizing computers as tools for new models of instruction (e.g. in distance learning). CBTs are more flexible and individualized, they track students' performance, provide with immediate test feedback, introduce new item/task types, have enhanced test security. The most attractive side of disruptive technology is addaptive approach. CATs enable tailoring item difficulty to test-takers' performance. Nowadays there is a tendency to promote assessment more as the disruptive type of technology.

**Second Language CBT Instruments**

There are many tests using sustaining type of technology. Brigham Young University (BYU) assessments are among the first CBTs developed in L2 field (Larson 1987; Larson 1989; Madsen 1991). The instruments include placement tests in French, German, Spanish, Russian, and English as a SL. The tests measure learners' abilities in grammar, reading, and vocabulary. The instruments include also listening comprehension component. The items are mainly of selective response type, such as multiple-choice ones. The BYU instruments use an adaptive algorythm based on the Rasch item response theory (IRT) model. Another French placement test is the Computer Adaptive Proficiency Test (CAPT) developed by Laurier (1991, 1999) at the University of Montreal. It includes multiple-choice items to assess test-takera' reading and listening comprehension, lexical and grammatical knowledge, self-assessment of oral skills. A three-parameter IRT and graded-response models are used for the adaptive algorithm. Dunkel (1997, 1998, 1999) at Georgia State University has developed a Hausa an ESL CAT to assess test-takers' listening comprehension using Rashc estimation procedures for an adaptive algorithm. Items used in the CAT are selected-response, e.g. multiple choice, matching, and identifying appropriate elements in a graphic. Southern Illinois University also developed an ESL placement CAT (Shermis 1996; Young et al. 1996) designed to check test-takers' reading comprehension. This kind of test was used when moving from one course to another. There were used multiple choice items and Rasch model for adaptive algorithm was employed. Since 1998 the Test of English as a Foreign Language (TOEFL) has been administered introduced with two versions CBT and paper and pencil (P&P). It was used to measure language abilities in grammar, listening and reading comprehension as well as writing (in case of CBT TOEFL) of non-native speakers of English seeking to be admitted into postsecondary institutions in North America. Items used in the two kinds of TOEFL are selected-response, mainly multiple-choice items. The test have been used for selection and placement purposes. Advances that can help to make the above mentioned kinds of assessment become Type II or disruptive one are still needed.

REPRESENTATION OF THE CONSTRUCT Researchers argue persuasively that standardized tests are not based on well articulated theories of L2 construct from the perspective that the skills are presented separately in those tests (Bernhardt 1991, 1999; Buck 1994; Grabe 1999). They contend that L2 construct is multidimensional and involves various interacting components (including knowledge of the language system, knowledge of the world, knowledge of the particular situation of the language use) and processes (a variety of strategies and processing skills to access, plan, and execute communicative intents. Researchers state that ability levels should be taken into account while designing tests, since language abilities of more proficient test-takers differs from those of less proficient ones. Test takers at different ability levels have different command of linguistic and nonlinguistic concepts and knowledge, metacognitive processes they employ, and the degree of automaticity while governing these processes. For example, while phonographic features and word recognition are more critical in early stages of reading language development, in more advanced stages complex syntactic attributes of the text are more important. Moreover, computers can collect reading protocols (Bernhardt 1991) or summary tests (Taylor 1993). Computerized scoring templates can help in assessing writing, capturing outlines and drafts, using spell checks, looking ups in dictionaries, referencing to grammatical help and time spent on each of these aspects. Current CBTs emphasize the separation of language constructs, which is irrelevant with the real language use. Widdowson (1978) believes that “conversations” involve listening and speaking, and “correspondence” includes reading and writing. To sum up, the focus should be on integration of skills will provide a more meaningful and adequate pattern of language use.


 * Item/Task Constructions **

When developing a meaningful task, certain ability features should be engaged in the task. Palmer and Bachman suggest the “distinctive task characteristics” which is a help in providing L2 tasks. “Distinctive task characteristics” has the following characteristic features:


 * The setting
 * The characteristics of the input and the expected response
 * The relationship between input and expected response

Computer technology also has a great role in the construction process. Regional generative model (RGM) is an important example. It helps to develop task prototypes. RGM can be used for various types of task development, such as multiple-choice or simulations (role-play, interview, etc.). Simulation tasks are important from the point of view of giving the task developers a chance to delve deeper into activities resembling real-life situations.

When speaking about language performance assessment we should also touch upon the role of rater variability. The records in this sphere are controversial. Some evidence suggests that there is a high level of rater agreement (Dandonoli and Henning, 1990). But there is also certain evidence suggesting major divergence between the raters’ scores (Brindley, 2000). This divergence may be accounted for by numerous factors such as raters’ prior experience, subconscious expectations and subjective attitudes (Brown, 1995). Lumley and McNamara (1995) mention that rater behavior may undergo changes over time. McNamara (1996) believes that the problems of rater differences may be treated through the use of measurement technology (e.g. many-faceted Rasch analysis) and statistical compensation. Moss, by contrast, proposes the hermeneutic approach which implies that a social process mediates judgments of performance. Communicative tests are on the way of substituting standardized tests standardized tests. In the past decade outcomes-based approaches have been predominant in the context of assessment. Outcomes-based approaches appear to be advantageous in comparison with the system of standardized tests. These approaches make the link between assessment and learning clearer, the process of reporting more transparent and the communication between the stakeholders – better. Still, the implementation of these approaches is not a smooth process. It causes hardships to use outcomes statements simultaneously so as to meet external requirements for accountability and supply diagnostic information (McCay, 2000). Empirical research has revealed a number of problematic issues pertaining to performance assessment. The main area of concern is establishing inter-rater reliability.
 * Rater Variability **
 * Assessment in the Language Curriculum **