Can computer-based testing achieve quality and efficiency in assessment?
Paul Bocij, Advanced Multimedia Ltd.;
|Recording of examination and assessment results|
|Recording of student perceptions of computer-based assessment|
|Recording of student background information|
|Recording of material related to student learning|
|Recording of practical issues related to computer-based assessment|
Selection Of Students
Students were selected by choosing specific modules from the range of undergraduate courses offered by the School of Business and the School of Mathematics & Computing. The modules chosen included elementary courses in Information Systems, Computer Science and Information Technology. In order to ensure that a reasonable sample of data could be collected, it was necessary to ensure a high level of access to target groups. This was achieved by limiting the choice of modules selected to those taught by the researchers. All of the modules selected for both stages of the study followed a common syllabus, sat comparable examinations and received the same number of contact hours.
The students who took part in the first stage of the project were all engaged upon modules that featured an end-of-year examination. It was felt that the choice of modules would allow a degree of generalisation when considering the results of the study. It was also felt that many of the skills provided by the chosen modules could be considered general in nature, providing an opportunity to assess the "core competencies" discussed by writers such as Atkins, Beattie and Dockrell (1993, pp. 31 -32).
The modules selected for the second stage covered a range of academic levels and included students from a variety of backgrounds. The chosen modules also reflected a mix in terms of assessment practices including written assignments, multiple choice tests and examinations and allowed a comparison to be made with the subjects examined in the first stage of the study.
Two basic approaches were taken in order to gather data concerning student performance:
Construction Of Computer-Based Assessments
Each computer based test was checked by the module, course or subject leader prior to use in an iterative process that continued until no further changes were necessary. Most changes concerned colour schemes, titles and instructions, for example in the case of summative assessments, module leaders often asked for statements concerning the University's examination regulations to be displayed at the start of the program. Where possible, an attempt was made to select questions from sources considered to be reliable, such as previous examination papers. Where it became necessary to construct new questions, reference was made to established sources such as Gronlund (1968), Cunningham (1991), Gibbs, Habeshaw and Habeshaw (1988) and the Oxford Centre for Staff Development (1992). In terms of constructing new questions, generating computer-based tests and computer-based learning techniques, reference was made to sources such as Dempster (1994), Draper, Brown, Edgerton (1994), Field and Chandler (1993) and Craig (1993).
A variety of techniques were used to gather information concerning students experiences of using the computer-based assessment software. These included pre- and post-test questionnaires, informal review groups that consisted of up to twelve student volunteers and a member of staff, participant observation and informal interviews.
Development Of Questionnaires
In addition to the assessment programs, a number of questionnaires were also developed for the project which were designed for the collection of basic background information on the students who took part in the study and consisted of: pre-test attitudinal measurement, post-test attitudinal measurement and learning style analysis. All questionnaires were constructed according to established guidelines, such as those offered by Bell and Newby (1990), Berclie, Anderson and Niebuhr (1986), Converse and Presser (1986) and Oppenheim (1966). Two of the questionnaires used were adapted from existing sources in order to improve validity and reliability. The first was Honey and Mumfords (1982) "standard" Learning Styles Questionnaire (LSQ). The second questionnaire used an exercise to measure changes in learning and understanding. This involved the creation of pre- and post-test concept maps based around a central concept. The use of concept maps in order to measure various aspects of learning has become relatively popular in recent years. Buzan (1993), the originator of mind maps, argues that this technique provides a quick, efficient and simple way of assessing learning. A particularly effective account of the use of concept maps in relation to computer-based learning is provided by Small and Grabowski (1992) who used this method in order to measure "...depth, breadth and interrelatedness of learning" during a study of information-seeking behaviour. The sections of each questionnaire designed to elicit information concerning prior knowledge and experience of information technology were created in accordance with an "experience" questionnaire developed at the University of Glasgow on behalf of the national Teaching and Learning Technology Project (Arnold S., Barr N., Donnelly P. et al, 1994). All questionnaires were piloted using a sample of 45 students.
For the first stage of the study all students were informed that a computer-based test would take place at the beginning of each module. Students were reminded of the test date at regular intervals and were given several opportunities to discuss any concerns they might have. In the first stage, tests took place two weeks before the examination period. Each test took place immediately after a revision tutorial that was used to answer any questions students might have concerning the tutorial and explain the operation of the program. Students were also asked to complete a background questionnaire, pre-test questionnaire and LSQ during the tutorial session. Each group was then escorted to a computer room by the tutor. The tests were carried out in the Universitys computer rooms under formal examination conditions. In order to deal with the relatively large number of students that took part, several rooms were used simultaneously. A number of security measures were taken in order to prevent cheating during the test. These included staff supervision of assessments, password protection, randomised question order and the encryption of results files. As students completed the test, tutors were asked to audit the condition of each computer room, recording data such as the noise level, the number of faulty machines in the room and the behaviour of students. After finishing their tests, students were asked to complete the post-test questionnaires before leaving the room. A small, informal review group was formed one week following each series of tests. The group consisted of eight students and the meeting was chaired by a member of the project team. Students were asked to give accounts of any general or personal comments concerning the test.
For the second stage of the study, "mock" tests were held early in each module and students were informed that this would allow them to familiarise themselves with the test program before a final, assessed test at the end of the Semester. All of these mock tests were carried out in an informal atmosphere and students were allowed to talk freely or refer to notes. Staff supervised each session in order to deal with any technical problems that might arise and record any observations of interest. All assessed tests took place in the penultimate week of the Autumn Semester. Students were assigned to a computer room and required to attend at a specific time and place. All tests were supervised by two or more members of staff. All assessed tests were carried out under formal conditions and were time constrained. The security measures described earlier were taken to prevent cheating.
All of the data collected was prepared for use with spreadsheet and statistical packages. Data concerning student performance on the computer-based tests was treated using the techniques described by Gronlund (1968, pp. 92 - 104), who provides several techniques that can be used to gauge the validity and reliability of individual questions and entire tests. Generating a reliability co-efficient, for example, involves comparing the number of items in a test, the mean score for the test and the standard deviation of test scores. This allows a given test to be compared with a range of "standard" values, for example a conventional examination will typically have a reliability co-efficient of between 0.65 and 0.75. In general, the reliability of individual test items can be gauged by examining the frequency distribution of responses. As an example, one might identify a poorly-constructed and therefore unreliable question item by examining the number of times a given distractor (possible response) is selected by respondents, for example a relatively low value might indicate that the distractor is too obvious as a possible response. The methods used included frequency distributions (including stanine scores), reliability co-efficient (Kuder-Richardson Formula 21), standard error of measurement, correlation co-efficient (product-moment), and descriptive statistics, such as sample standard deviation and skewness. In addition, further calculations, such as an analysis of variance (anova), were carried out on some sets of data.
A group of 130 students, comprising 69 first year students and 61 second year students, completed questionnaires during the first stage of the study. This sample represented approximately one quarter of all first year students and half of all second year students involved on the modules selected. The second stage of the project yielded 338 questionnaires, taken from 218 first year students and 120 second year students. This sample represented approximately half of all students engaged upon introductory computing modules. In all, 468 students completed pre-test questionnaires and 428 students completed post-test questionnaires. The difference in the totals was due to absences, spoiled papers, and so on.
Experience of Information Technology
As might be expected, relatively few students professed to have either very high or very low levels of knowledge and experience of information technology. The distribution of student responses was relatively even, with approximately one third of the entire sample claiming a fair or average knowledge of computing. Extreme values at each end of the scale represented less than 6% of the total sample. In examining questionnaire responses by academic year, it was found that students from the first stage of the study tended to be less confident of their knowledge than students from the second. Additionally, these students made less use of common applications, such as word processing, and were less likely to have access to a computer outside of the university environment. Considering that the majority of students from this group belonged to computing courses, this seemed uncharacteristic of the sample.
Four groups of questions were used to examine student attitudes and perceptions toward computer-based assessment. The summary that follows is based upon all respondents from both stages of the study. For the purposes of this presentation a simple comparison of results both by relative proportion at each test stage and significant changes between test stages is discussed.
The first group of questions recorded student opinions regarding the use of computer-based testing as a method of assessment. Almost half of the sample (46.51%) held opinions neither in favour nor against computer-based assessment. However, more students (40.31%) were disposed toward computer-based assessment than conventional examinations (13.18%). Opinions concerning the "fairness" of computer-based assessment underwent significant changes. Almost a quarter of the sample (24.03%) felt that computer-based tests offered a completely fair way of assessing students. This was a dramatic increase from the 4.65% recorded via the pre-test questionnaire. Amongst those who had previously held neutral opinions, almost half now expressed a different opinion. From the values recorded, it appears that most changed their opinions in favour of computer-based assessment.
CBA as a means of testing (D)
|Scale||Pre-Test (%)||Post-Test (%)|
Range of Skills
The second group of questions asked students to consider the range of knowledge and skills assessed by the examination process and compare this to their perception of computer-based testing. Discounting extreme values, it appeared that student opinion concerning the breadth of knowledge and skills tested by computer-based assessment was quite evenly distributed. However, the data collected did suggest a slight tendency toward the view that computer-based testing is somewhat more thorough and comprehensive than conventional examinations. Smaller changes were observed when students reconsidered the range of knowledge and skills tested by computer-based assessment. In general, the number of those who felt that computer-based assessment tested a larger range of skills than a conventional examination increased. However, the number of students who felt that computer-based assessment tested a small range of skills also increased slightly.
The range of skills assessed by CBA (D)
|Scale||Pre-Test (%)||Post-Test (%)|
|1 (Large Range)||3.08||6.80|
|7 (Small Range)||1.54||0.24|
Method of Measuring Performance
The third group of questions asked students to consider how accurately computer-based assessment might measure knowledge and abilities in comparison to a conventional examination. A large group of students (40.77%) felt that computer-based assessment could be considered as accurate or more accurate than a conventional examination. However, a small number of respondents (1.54%) felt that computer-based assessment provided an extremely low level of accuracy. Marked changes were observed in student opinions concerning the ability of computer-based assessment to measure skills and knowledge accurately. In general, student attitudes shifted toward a belief that computer-based assessment could be considered more accurate than a conventional examination. Perhaps the most extreme change occurred within the group that felt computer-based assessment offered the most accurate means of measurement, where the proportion increased from 0.77% to 13.63%.
The value of CBA as a means of measuring performance (D)
|Scale||Pre-Test (%)||Post-Test (%)|
|1 (Best Measure)||0.77||13.63|
|7 (Poorest Measure)||1.54||0.49|
The final group of questions examined the relatively common belief amongst students that computer-based assessment could be considered an impartial method of assessment. As one might expect, the majority of students felt that a computer-based assessment was likely to be more objective than a tutor-marked examination. In terms of the perceived impartiality of computer-based assessment, almost a quarter (24.82%) of the group came to feel that computer-based tests were completely impartial. Furthermore, the number of those who felt computer-based assessment to be somewhat biased fell quite significantly. Overall, there appeared to be a general shift in attitudes toward a belief that computer-based assessment could be considered more objective than a tutor-marked examination.
The perceived impartiality of CBA (D)
|Scale||Pre-Test (%)||Post-Test (%)|
Additional groups of questions asked students to compare the difficulty of a computer-based assessment with a conventional examination. Approximately 46% of students involved during the first stage felt computer-based assessment to be easier than a conventional examination and less than 18% of the sample felt computer-based assessment to be more difficult. However, during the second stage of the project, the values recorded were 36.36% and 44.83% respectively.
Perceived difficulty of CBA compared to a conventional exam (D)
Perceived Level of Performance
The final items in the questionnaire asked students to compare their performance in the computer-based assessment with how they felt they would perform in a conventional examination. The majority of students (81.39%) felt that their performance on the computer-based assessment was equal or superior to what they might achieve in an examination.
Perceived level of performance using CBA compared to a conventional exam (D)
|1 (Much Better)||7.88|
|7 (Far Worse)||1.43|
In all 573 individual examination results were recorded. A total of 182 results were recorded from the 1993/4 academic year (stage 1) and 391 were recorded from the 1994/5 academic year (stage 2).
Since the computer-based assessments used were largely composed of multiple choice questions, it was felt that a more accurate comparison of performance could be made by considering CBA scores in relation to the marks achieved in the multiple choice section of the examination paper. A calculation was performed to determine how performance on the multiple choice paper contributed to students overall examination scores. This was done by calculating the score achieved on the multiple choice paper as a proportion of the examination mark. For the purposes of this paper, the value obtained via this calculation is referred to as the "contribution."
An average mark of 46.7% was recorded for students taking examinations during the 1993/4 academic year. The average mark for the multiple choice section of the examination paper was 54.46%. On average, the multiple choice section of the examination accounted for 29.66% of a students final mark. In the 1994/5 examinations, the average mark achieved across all groups was 48.75%. The average mark for the multiple choice section of the examination paper was 61.25%. The contribution of the multiple choice paper to the overall mark awarded for the examination was 34.39%. Taking both years the average examination mark for the sample as a whole was 47.98%. In terms of the multiple choice section of the examination, the average mark for the sample was 58.71%. The average contribution of the multiple choice paper was somewhat lower than the figures given earlier for individual groups. However, it can be said that the average contribution of the multiple choice paper (32.62%) accounted for approximately one third of a students final mark for the examination.
Computer-Based Assessment Results
In the first stage of the study, all computer-based assessments were formative in nature. A total of 102 results were recorded for first year students and a total of 44 results were recorded for second year students. The average score achieved for the group as a whole was 60.43%. The students that took part during the second stage of the study were allowed to sit two different tests. The first was treated as a "mock" test with the intention of allowing students to familiarise themselves with the assessment program prior to sitting the second, assessed test. A total of 284 students sat the first, non-assessed test over a period of one week. The average score achieved was 69.99%.
Within each group, a small number of students took more than one attempt at an assessment. In some cases, students used both formal and "mock" tests for self-assessment purposes. In others, students failed to follow instructions correctly and repeated an attempt at an assessed test. A total of 44 students repeated one or more tests. The average mark achieved for the first test was 55.71%. This is increased to 67.94% at the second attempt. The highest mark achieved in the first test was 88.00% and this too increased (to 96.67%) following the second attempt. In terms of the time taken to complete each test, a significant decrease was noted.
Comparison Of Examination and CBA Results
It was possible to cross-reference detailed data concerning performance upon examinations and computer-based assessments for a total of 83 students. The average examination mark for the whole of the sample was 42.93%. The multiple choice section of the paper accounted for an average of 24.93% of a students final mark for the examination. In terms of the multiple choice section itself, the average mark was 33.65%. The average mark achieved on the computer-based test was considerably higher than for the multiple choice section of the examination, with students achieving an average score of 60.60%. It should also be noted that the lowest mark achieved in the computer-based test was 38%, meaning that no student "failed" the test.
It was possible to make a direct comparison to be made between the number of correct answers recorded for the multiple choice examination paper and those recorded for the computer-based assessment. On average, students gave 13.5 more correct answers during the computer-based assessment than during the examination. If one applied this to the marks achieved for the examination, it would mean a typical increase of 6.75%, increasing the average mark for the examination to 49.68%. One effect of this would be to reduce the failure rate for the examination from approximately 25% to 18%. The graph below illustrates the distribution of marks for both assessments. Note that the CBA marks had a more even distribution with the majority of marks falling above the fiftieth percentile.
Figure 1. Comparison of scores achieved on computer-based assessment and the multiple-choice section of the examination paper. (D)
Several areas appear worthy of further discussion and investigation, including issues related to quality, factors that may influence performance and student perceptions regarding computer-based assessment. Comments made by students were recorded from questionnaires, informal interviews and review group sessions.
One possible explanation for the higher scores achieved in the computer-based tests carried out may be related to the observation that computer-based tests tend to be completed more quickly than conventional assessments. If performance increases and the time taken to complete the assessment decreases, then it can be argued that the assessment technique itself may be acting to enhance the ability of students to focus on questions and recall relevant information. One way in which we might describe this is by suggesting that computer-based assessment appears to improve task-focus. Several factors appear to be related to this, including the environment in which the assessment was taken and the nature of the software used.
In terms of the environment, it was found that many students felt more comfortable and relaxed when seated in a laboratory - as opposed to an examination hall. A typical comment was:
Not as much panic involved - to sit, read and click seems to take away that element of stress.
The majority of students also felt that a computer-based assessment was quicker to work through than a conventional multiple choice test. A large number of students suggested that removing the need to write down answers allowed them to complete the test more quickly. In turn, some felt that this acted to improve concentration and to focus more closely on the questions displayed on the screen:
...the computer allowed you to think rather than wasting time trying to write things down.
The programs used made students confirm their answers before recording them. It seems reasonable to suggest that the need for students to reconsider their answers may have helped to improve performance:
...because the computer checks you have decided on your answer, so you always re-read it to make sure your answer is the right one.
Although a detailed discussion of assessment in relation to teaching and learning is beyond the scope of this paper, it is important to consider whether or not computer-based assessment is a valid means of testing a good range of the cognitive skills higher education seeks to provide students with.
It can be argued that most current assessment packages are incapable of testing high-level cognitive skills. Jones (1990, pp. 269) argues that programs with an essentially linear structure (such as computer-based assessment packages) tend to operate only at the knowledge level of Blooms Taxonomy of Educational Objectives. Additionally, with reference to multiple-choice questions delivered via computer software, Laurillard (1993, pp. 153) states that such programs provide "...extrinsic feedback with more teaching attached. It provides information, and will assist memorisation of a procedure, and this may well be sufficient in many cases, but it will not do much to develop conceptual understanding..." Thus, the majority of the assessment packages currently available appear to be incapable of testing or developing cognitive skills at the higher levels of analysis, synthesis and evaluation. Furthermore, it might be argued that other forms of assessment, such as essays, can be used to test more practical skills, such as the ability to structure a coherent argument.
However, it is suggested that some of the new packages now appearing on the market are capable of testing high-level skills. Whiting (1985, pp. 101) indicates that computer-assisted learning software can allow students to achieve "mastery" of a given subject and that such techniques are particularly valuable when attempting to help students improve "...retention of knowledge, its application and evaluation..." (ibid). Such techniques have now begun to be applied to assessment, in order to measure knowledge and skills. A good example involves those programs that combine multimedia, hypermedia and adaptive testing techniques in order to improve the quality of the assessments created. There are several ways in which such programs can be used to test higher-level skills, such as the ability to construct a coherent argument. A good example involves the ability of some programs to deliver essay-style questions, assigning marks based on key words, phrases and statements recognised as forming part of an acceptable answer.
Pre-test attitudinal measurements suggested that relatively large proportions of students held somewhat neutral opinions toward computer-based assessment. In general, these opinions appeared to change in favour of computer-based assessment following the tests themselves. From this, it appears that students find computer-based tests less threatening than conventional examinations. The perception that computer-based testing provides an impartial means of assessing learning and understanding is also seen as a genuine advantage by the majority of students. Additionally, students feel that CBA is more accurate and tests a wider range of skills than a conventional examination.
It can be argued that formative computer-based assessments may help to improve the long-term recall of key concepts. Whilst this must be considered to be only a tentative conclusion, some of the data collected seems to support this. Differences in the scores achieved by students repeating formative computer-based assessments showed an average increase of approximately 12%. Those who took formative computer-based assessments prior to an examination also achieved marginally higher scores than other groups. Together with data collected through observation, group discussions and questionnaire items, there appears to be sufficient evidence to endorse this view.
As mentioned previously, no significant differences in performance were recorded between those students with previous experience of information technology and those without. Although not directly related to computer-based assessment, there is evidence to suggest that those with a high aptitude in terms of information technology tend to perform at a higher level when using computer-based learning packages than those with lower aptitudes (Rasheed & Cohen, 1990, pp. 36; Tsai, 1989, p. 11). Despite this, the finding of this study suggest that, in relation to computer-based assessment techniques, students with experience of computers appear to have no advantage over their peers.
Much of the material presented within this paper is intended to illustrate some of the benefits that might be associated with computer-based assessment techniques. However, it should also be clear that computer-based assessment must not be seen as a "quick fix" for problems such as rising student numbers. If one accepts that current systems test only a relatively narrow range of skills, then the hasty implementation of CBA systems will result in a distorted and inaccurate view of student performance. In turn, this may serve to reduce the overall quality of courses and - ultimately - detract from the student learning experience. On the other hand, if one adopts a considered and methodical approach to computer-based assessment, positive benefits might include both increased efficiency and quality. Thus efficiency gains from an organisational perspective could be attained from of the ability to deliver material to a large student cohort with facilities such as automated marking of responses for example. Increased quality could be attained from the use of computerised formative assessment as a compliment to traditional teaching methods and the considered use of computerised summative assessment.
Implications For Further Research
The data collected suggest that computer-based assessment methods result in students achieving higher scores than might be obtained via conventional examinations. From this study, it appears the introduction of such techniques might result in apparent performance improvements of approximately 6.75%. For the courses involved, one possible result of this effect is that the number of students failing the assessment might be reduced from approximately 25% to 18%. In terms of the computer-based test itself, it was noted that no students actually failed to achieve a pass mark for the assessment. In other words, the use of computer-based testing as the sole method of assessment might result in sets of marks that are uncharacteristically high. This might also suggest that computer-based assessment has few, if any, adverse effects on student performance.
Clearly, this view raises a number of implications in terms of the quality of the assessment process. One might suggest that a simple way of bringing computer-based assessment results in line with those achieved via other methods might be to increase the difficulty of the computer-based assessment. This might be achieved by increasing the length of the assessment or selecting more challenging questions. However, although methods exist for grading the difficulty of questions, this would require a substantial investment from institutions in terms of staff time and labour (Stephens, 1994, pp. 12). Additionally, there seems to be little evidence to suggest that merely increasing the length of the assessment would have the desired effect. It is felt that this area is worthy of further investigation, particularly since there appear to be few publications that address the issue.
Arnold, S., Barr, N., & Donnelly, P. (1994). Constructing and implementing multimedia teaching packages, Glasgow: University of Glasgow (TLTP).
Atkins, M., Beattie, J., & Dockrell, W. (1993). Assessment issues in higher education, Employment Department.
Bell, C., & Newby, H. (1980). Doing sociological research. (2nd Ed). London: George Allen & Unwin Ltd.
Berclie, D., Anderson, J., & Neibuhr, M. (1986). Questionnaires: Design and use. (2nd Ed.) New York: Scarecrow Press.
Buzan, T. (1993). The mind map book. London: BBC Publications.
Converse, J., & Presser, S. (1986). Survey questions: Handcrafting the standardized questionnaire. Thousand Oaks, CA: Sage.
Craig, J. (1993). A systematic approach to improving in-house computer literacy. Journal of Educational Technology Systems, 21(1), 51-70.
Cunningham, D. (1991) Assessing constructions and constructing assessments: A dialogue. Educational Technology, 31(5), 13 - 17.
Dempster, J. (1994) Question mark designer for windows. Active Learning 1, 47 - 50.
Draper, S., Brown, M., Edgerton, E. (1994). Observing and measuring the performance of educational technology, Glasgow: University of Glasgow (TLTP).
Field, B., & Chandler, E. (May, 1993) Portfolio assessment. Labyrinth.
Gibbs, Habeshaw, S., & Habeshaw, T. (1988) 53 interesting ways to assess your students Bristol: TES.
Gronlund, N. (1968) Constructing achievement tests. Englewood Cliffs, NJ: Prentice Hall.
Honey, P., & Mumford., A. (1992) The Manual Of Learning Styles. (3rd Ed) Maidenhead.
Jones, T. (1990) Towards a typology of educational uses of hypermedia. Lecture notes in computer science. 438, 265-276.
Laurillard, D. (1993) Rethinking university reaching: A framework for the effective use of education technology. London: Routledge.
Oppenheim, A. (1966). Questionnaire design and attitude measurement. London, Heinemann Educational Books.
Rasheed, K. S., & Cohen, P. A. (1990). An evaluation of computer-based instruction versus printed study guides in a dental material course. Journal of dental hygiene 64 (1), 36-39.
Small, R. V., & Grabowski, B. L. (1992) An exploratory study of information-seeking behaviours and learning with hypermedia information systems. Journal of educational multimedia and hypermedia 1 (4), 445-464.
Stephens, D. (1994) Using computer assisted assessment: time saver or sophisticated distraction? Active learning 1, 11-13.
Tsai, C. (1989) Hypertext - technology, applications and research issues. Journal Of Educational Technology Systems 17 (1), 3-14.
Whiting, J. (1985) The use of a computer tutorial as a replacement for human tuition in a mastery learning strategy. Computers and Education 9 (2), 101-109.
Copyright © 1999. All rights reserved.
Last Updated on 11 July 1999. Archived 5 May 2007.
For additional information, contact IJET@lists.ed.uiuc.edu