Volume 4 Arizona State University College of Education Current Issues in Education Current Issues in Education

Search:



Citation Information

Valenti, S., Cucchiarelli, A. and Panti, M. (2001, May 17). A framework for the evaluation of test management systems. Current Issues in Education [On-line], 4 (6). Available: http://cie.ed.asu.edu/volume4/number6/.


A Framework for the Evaluation of Test Management Systems

Salvatore Valenti
Alessandro Cucchiarelli
Maurizio Panti
University of Ancona



Abstract

It is now a well-established and widely accepted concept that assessment plays a central role in the educational process. Although a large number of commercial and free applications exist dealing with computer assisted assessment, there seems to be a lack of metrics for educational teams wishing to select the most appropriate assessment tool for their environment. This paper tries to remedy this situation by suggesting some guidelines that may be used to evaluate a Test Management System, one of the building blocks of an automated assessment system.


Table of Contents


Arrow Up

Introduction

It is now a well-established and widely accepted concept that assessment plays a central role in the educational process. The search for assessment methods able to reach an objective judgment of student's knowledge is a crucial goal for both teachers and educational institutions. The teacher looks for homogeneous treatment of the students and for useful hints on educational activity in terms of clarity, completeness, and effectiveness. The educational institution tries to log the teacher's activity and the quality of the service offered to the students. Moreover, the growing mobility of the manpower everywhere requires the educational institutions to comply with international standards of crediting courses. Finally, the diffusion of computer based distance learning forces them to cope with problems posed by the assessment process.

Dealing with large classes raises a number of problems both from the lecturer's and the students' point of view. Teaching large classes is often seen as a difficult and unwelcome assignment. The lecturer is able to know only a limited number of students and since the lectures follow one another in a short interval of time, only a few of the students with questions about the material can be helped. The same considerations are true for the support that may be given in office hours. Furthermore, exams cannot be taken by all students at the same time, due to lack of resources. This often results in exams being graded by different people with different grading styles.

The freshmen, on their side, often have problems adjusting from the teacher-led form of education used in high schools. Many of them need support as they make the transition to a learning style in which they have to take great responsibility for their own education. In particular they are often worried by the way exams are carried out, since the only possibility to verify the results of the learning process is one final examination at the end of the course. In particular they appreciate frequent feedback on their progress and reassurance that any misconceptions may be identified and remedied. However, the decrease in resources to be used for tutoring and the increase in class sizes often leads to incomplete feedback to students.

In this scenario the Computer Assisted Assessment (CAA) seems to offer a promising alternative. It is able to automatically capture the information required by the actors of the educational process (teachers, students and institutions) for large and/or distributed classrooms. The interest in developing CAA systems grown up in recent years, thanks to the potential market of their applications. Many commercial products, as well as freeware and shareware tools, are the result of studies and research in this field made by companies and public institutions. For an updated survey of course and test delivery/management systems for distance learning see Looms (2000). This site containing a description of a large number of products and is being continuously updated with new items.

The question that originated our work was "Are there any criteria that may be useful to an educational team wishing to select the most appropriate assessment tool for their environment?" According to our findings, the answer seems to be negative. We started our research from a survey of the literature available over the web, through a search procedure based on the use of the most common search-engines. Then we went through a very long list of sites maintaining links to educational resources as for instance the Educational Resources Information Center (ERIC® Clearinghouse on Assessment and Evaluation, http://ericae.net/) and TECFA (the academic unit active in the field of educational technology of the School of Psychology and Education of the University of Geneva, Switzerland; http://agora.unige.ch/tecfa/edutech/welcome_frame.html). After a long and cumbersome research we came out with a list of only two papers devoted to such an important topic (Freemont & Jones, 1994; Gibson et al., 1995).

As we will show in the next section, neither of these two papers satisfies our need of identifying a set of metrics for the evaluation of the quality of a CAA system. First of all, it should be outlined that a CAA system is composed of at least two applications: a Test Management System (TMS) and Test Delivery System (TDS). The former allows the instructor to create questions and tests and to evaluate the tests, while the latter enables the administration of exams and the collection of tests by interfacing with students. While the current market-trend seems to point toward the development of integrated CAA systems, the number of companies distributing stand-alone TMSs is high (Course Test Manager, 2000; FastTest, 2000; LXR!TEST, 2000; MicroCat, 2000; The Question Bank, 2000). This is mainly due to the great number of institutions still working using traditional assessment approaches. In fact, although there is a genuine interest among most teaching staff in the benefits that CAA may bring, there is considerable effort involved in adapting from existing assessment methods and not all teaching staff are convinced that the benefits fully justify the effort (Raine, 1999). Therefore the adoption of a stand-alone TMS may be a good compromise solution for institutions wishing to introduce CAA procedures without addressing the wide investments needed both to the IT infrastructure and to the organizational support. An interesting example of this stand-alone adoption has been recently reported by UK academic institutions adopting TMSs to deliver exams via OMR sheets (Raine, 1999; Sutcliffe et al, 1999).

In this paper we will provide some criteria for the evaluation of stand-alone TMSs, while we are still working on the identification of a similar framework for analyzing existing TDS applications. Specifically, in section two we will start discussing the criteria identified by Gibson et al. (1995) that are by far more structured and general than those presented in the paper by Freemont and Jones (1994). These criteria have been developed considering an assessment application as a single entity, and are focused both on the assessment and educational capabilities of such a tool. In section three we will discuss a more detailed approach to the selection of a TMS based on current research in the field of commercial off-the shelf systems evaluation (Hansen, 1999). Some directions for further research will be discussed in the last section of the paper.


Arrow Up

An Existing Framework

Gibson et al. (1995) describe an approach for evaluating a Web Based testing and evaluation system. They identify six evaluation criteria: testing, tracking, grading, tutorial building, implementation and security issues. In this section we will briefly discuss each of these criteria. The interested reader should reference the Gibson et al (1995) paper to find the results of their comparison of four tools: Mklesson, Eval, Tutorial Gateway, and the package built on Tutorial Gateway by the Open Learning Agency of Australia.


Arrow Up

Testing

In order to evaluate the testing capabilities of a Web/Computer assisted assessment system, the following sub-criteria have been adopted: the classes of questions allowed, the feedback provided, the existence of tools for providing help and hints to the student, the possibility of retrying more than once the answer to a question, and use of multimedia as an integrating part of the testing system. In the following part of this section we will examine briefly these points.

Types of questions
Concern is focused on the classes of questions supported by the system under consideration. Some examples are multiple choice, true-false, simple numeric and simulations.

Help and Hints
This item concerns the capability of the system to provide directions about the completion of the test and hints that usually are related to the content of the questions. This may be considered a measure of the ease of use of the application from the student point of view.

Retries
A retry is another attempt to answer a question. Retry ability may be of great importance for self-assessment, since it may be useful to improve the knowledge of the student, and limiting the need of providing feedback and or tutoring.

Feedback
Feedback provides information to the student once the answer to a given question has been entered. Feedback is also important for self-assessment, since it may be used to correct misconceptions or to deliver additional material to extend the coverage of the topic assessed by the question.

Multimedia
The use of questions incorporating multimedia, for instance sound and video clips, or images, may improve the level of knowledge evaluation. This aspect may be of great importance, for example in language assessment, when comprehension can be assessed by referring to clips of a lecture or movie. The use of multimedia can raise issues related to portability and interoperability since it may require special hardware and software, both for the server delivering the questions and for the client used by the students. Furthermore, it may raise the costs for the adopted solution.


Arrow Up

Tracking

Tracking is the ability of the system to remember where the student has traveled within a lesson and recording her performance on test questions and answers. This allows the instructor to follow the specific pattern of progress and performance of each student, and to fine tune assessment to each student. Gibson et al. (1995) suggest in their paper that by tracking the student's progress, it is possible to provide dynamic guidance on how best to proceed through the lesson.


Arrow Up

Grading

Any software for assessment should be able to compute student grades. Furthermore, grades must be delivered as feedback to the course coordinator, to the instructor, and to the students. Each of these categories of users needs to obtain a different kind of feedback on the grades associated with a test. For instance, a student needs to know where she stands with respect to other students and the class average, as well as her own individual and cumulative grades. This information need raises obvious privacy concerns that may be faced through the security facilities provided with the assessment tool.


Arrow Up

Tutorial Building

This criterion refers to the existence of some facility for automatic inclusion of tutorial in the testing system.


Arrow Up

Implementation Issues

From the point of view of implementation, Gibson et al. (1995) consider only two main issues: ease of use and platform. Ease of use refers to how easy it is for the author of the courseware and for the instructor to use the testing system in implementing assessment. An important point to highlight is that Web-based assessment tools assume the lecturer possesses knowledge of HTML.

Among the issues related to Platform are server functionality, availability of viewers, ability of the hardware to support multimedia (like sound and video) and the requirements of the networking facilities.


Arrow Up

Security

There is a wide range of security issues related to the use of both Computer and Web based assessment system. These issues include the security of the test material, of the HTML code that implements testing, and of the identification of the user (both instructors and students).


Arrow Up

Extending the scope of Gibson's Framework

Although the literature on guidelines to support the selection of Test Management Systems seems to be very poor, there is a lot research in the field of Software Engineering for the definition of general criteria that may be used to evaluate software systems (Anderson, 1989; Ares Casal et al., 1998; Henderson et al., 1995; Nikoukaran et al, 1999; Vlahavas et al. 1999).

A relevant effort in this field is by the International Standard Organization, which defined the standard for "Information Technology - Software quality characteristics and sub-characteristics" (ISO, 1991). Among the metrics defined to evaluate software ISO identifies: Usability, Suitability, Security and Interoperability. Usability is defined as the effort needed for the use of software by a stated or implied set of users. Suitability is defined as the degree of presence of a set of functions for the specific task. Security is the degree to which the software is able to prevent unauthorized access, whether deliberative or accidental. Finally, Interoperability is the ability of the system to interact with other existing software. (Stamelos et al, in press).

None of the metrics discussed above can be measured directly, but must be defined in terms of objective features that can be assessed by the user. These features should be identified keeping in account the context for the evaluation: that is, a description of the target system and the environment into which it will be deployed. To buy a car, the context is the customer situation, i.e. financial resources, driving patterns, aesthetic tastes, and so on. For an organization, the context includes the organization's mission, its structure, and its existing procedures for the tasks that will be affected by the target system. From the context the project personnel will adduce various, possibly ill defined, qualities that the target system should exhibit. (Hansen, 1999).

These ISO criteria can be applied to analyses of TMSs. A TMS is a tool that should provide the instructor an easy to use interface, the ability to create questions and to assemble them into tests, the possibility of grading the tests and to make some statistical evaluations of the results. Therefore, we decided to adopt the Interface as an indirect measure of Usability, the Question and Test Management capabilities as an indirect measure of Suitability, and to discuss the measures of Security and Interoperability under the common umbrella of Implementation Issues. Furthermore, each of these measures has been defined in terms of criteria, so that the Question Management metric can be evaluated in terms of "Types of Questions" and "Question Structure", while "Test Management" can be analyzed in terms of the way in which exams can be prepared (Exam Preparation), of the tools available to the teacher to evaluate both the exam (Test evaluation tools) and the responses produced by the students (Analysis of responses) and of the existence of Test banks that may be used to simplify and better structure the task of building exams. The various metrics adopted are listed in table 1.

Table 1– Metrics for the evaluation of a TMS

Metric Criteria
Interface Friendly GUI
Ease of Editing
Question Management Types of questions
Question Structure
Test Management Exam preparation
Test evaluation tools
Analysis of responses
Test banks
Implementation Issues Security
Communication

With respect to the framework proposed by Gibson, we have omitted the "Tracking" and the "Tutorial Building" metrics, and the "Help and Hints" and "Retries" criteria of the "Testing" metric. As noted earlier in this paper, "Tracking" allows the instructor to follow the specific pattern of progress and performance of each student, and to fine tune self-assessment by each student while allowing dynamic guidance on how best to proceed through the lesson. Therefore we believe that this metric may be used to select educational software falling in the wider range of products that go under the name of computer-based learning and teaching systems as for instance Web-Ct or TopClass (Hazari, 1999). For this reason, tracking has not been taken into account in our framework. Furthermore we believe that the abilities of the system to provide hints and retries may be used as criteria to select a TDS, whose scope is outside the purpose of this paper. This is the reason why we will not discuss these issues any further.


Arrow Up

Interface

The evaluation of the interface is a qualifying aspect for the evaluation of a Computer Assisted Assessment system and obviously for a TMS, given that most teachers involved in the use of a TMS do not possess a degree in computer science, and are not interested in acquiring skills in this field.

There is a substantial body of literature on the criteria to be adopted in order to evaluate a Graphical User Interface (GUI) from the point of view of usability (see for instance Nielsen & Molich, 1990 and Gilham et al., 1995). As Nielsen & Molich (1990) simply proposed, the interface must be easy to learn, efficient to use, easy to remember, error free and subjectively pleasing. The set of criteria that may be adopted to evaluate the usability of a GUI is summarized in the following list:

  • use dialogues simple and natural
  • speak the users' language
  • be consistent
  • provide feedback
  • provide clearly marked exits
  • provide shortcuts
  • have good error messages
  • prevent errors.

Furthermore, a TMS should be designed so that questions and tests can be written in a simple and easy way. With the term "ease of editing" we will address the ease with which a TMS allows construction of questions and tests. This ability can be enhanced through the existence of a GUI that provides standard features as a "wyswyg" editor, a clipboard and cut-and-paste and undo operations. At the same time, the possibility to include text, graphic images for diagrams and properly display mathematical, chemical or other symbols may be of great importance for the instructor. Moreover, the existence of spelling and grammar checking may greatly improve the ease of editing of a TMS by helping the instructor to build up well-formed questions. The existence of ad-hoc dictionaries tailored on the domain to which questions are related, may represent a plus to improve the ease of editing.

An online help system that may provide some sort of tutorial on how to build questions and to prepare exams may greatly improve the ease of editing of the TMS. FastTest (2000) by Assessment System Co. represents a good example of a system showing such features. Finally, the "programming" abilities required of the instructor greatly affects usability. The usability of the system may be dramatically reduced whenever the tutor is required to possess HTML, XML, Perl/CGI, Java or JavaScript knowledge.


Arrow Up

Question and Test Management

Testing represents the vital part of any assessment tool. We suggest adopting two main metrics to evaluate the characteristics of a TMS: Question and Test Management. The former is related with all aspects concerning the authoring of the questions, while the latter concerns the assembling of questions into exams and the evaluation of the results.

Question management

Types of questions. A list of the most common types of questions, along with a simple definition for each class is summarized in table 2. An important point is tied to the learning objective that must be assessed through the questions. Each of the categories listed in the table may be used to evaluate different types of knowledge. Therefore the selection of a TMS may be driven by the ability to be assessed.

On the other hand, it must be noted that many universities are adopting the same tool within all courses in order to reduce costs and to allow students to interact in the same way in each phase of their evaluation process. This obviously imposes the requirement of selecting a TMS that provides the wider range of questions available since different learning outcomes may be assessed within different courses.

Table 2 - Summary of question types

Question Type Meaning
Multiple choice questions where the user is asked to select the correct/best answer from a list of alternatives
Multiple response questions where the user is asked to select a number of correct answers from a list
true/false questions where the user is asked to evaluate the truthfulness of a statement
selection/association questions where the user answers by matching items from two related lists
short answer questions that may be answered by entering a word, a short phrase or number
visual identification/hot spot questions that may be answered by moving a marker onto a part of the screen to identify a hot spot on an image.
Essay questions that may be answered by a free text composed by any number of paragraphs.

While almost all commercial TMSs allow to construction of Multiple Choice Questions (MCQ), very few of them implement Hot Spot or Selection/Association type questions. A yet smaller subset of TMSs claim to implement Essay type questions (C-Quest, 2000; InQsit, 2000). With respect to this last point, it should be noted that although there are some research efforts on the automatic scoring of essay type questions, mainly in the area of natural language understanding (Burstein J. & Chodorow M. 1999; Foltz P.W., Laham D. & Landauer T.K. 1999), the scoring of this class of questions relies on the manual intervention of the teacher for the commercial products actually on the market.

Question structure. It must be noted, at this point, that there is a wide variety of information associated with a test questions. We can distinguish among information that is specific to the question type, information that is tied to the educational task to be assessed through the question, and information that is related to the scoring policy adopted (and therefore that is dependent on the question type). Not all this type of information is made available by existing TMSs. Therefore this is an interesting metric for identifying the system that best matches with the educational needs to be assessed.

As an example of the information specific to question type, we will shortly discuss Multiple Choice Questions. This class of questions is organized into three parts: a stem, a key and some distractors. The problem to which the student should give an answer is known as stem. The list of suggested solutions may include words, numbers, symbols or phrases and are called alternatives, choices or options. The student is asked to read the stem and to select the alternative that is believed to be correct. The correct alternative is the key, whilst the remaining choices are called distractors, since their intended function is to distract students from the correct one.

Therefore, to evaluate the question structure of a TMS, the number of different choices allowed and the format in which they are presented (radio vs. push buttons) must be taken in account. The spread among the maximum number of allowed distractors for different TMSs is very large ranging from 3 to "no reasonable limit" for Perception (Perception, 2000). This could be used as a metric for the evaluation of a TMS, although many authors suggest that four choice items are good enough to reduce the chance of guessing the result while maintaining the effort of devising a real equivalent number of distractors (usually the fourth distractor in five choice questions tend to be difficult to devise and may be weaker than the others).

The educational task to be assessed represents another important type of information that should be associated to questions. In fact, if the test is used to evaluate the instructional process, fields to store the source of each question, the chapter to which it is related, the topic covered along with the author of the question itself should be provided. Furthermore a teacher may wish to assess a topic at different cognitive levels, as for instance those defined in Bloom's taxonomy (Bloom, 1956). Therefore, for each question an additional field for storing such information should be defined. It is worth saying that many commercial TMSs allow creating user-defined fields that may be used to store such information. Therefore, a good selection criterion is the ability to access these additional fields in order to perform test evaluation procedures.

Each class of available questions may support different scoring schemes. As an example, we will discuss briefly two marking philosophies of MCQs. The simplest way to assign a score to a MCQ is to mark 1 to the correct answer and 0 to the other options. This strategy allows students who make blind guesses or give random responses to all questions to obtain a score that may be evaluated as the number of questions divided by the number of distractors used: this means that a lucky student who is submitted to a test with 100 MCQs with 4 distractors may obtain a score up to 25. Another approach called negative marking, assigns 1 for the correct response, 0 for no response and -1/(n-1) for an incorrect response. With this approach, a student who knows nothing, and therefore makes completely blind guesses may be marked with the plausible score of "about" zero. Obviously, a TMS should allow both of these marking schemes.

For Short Answer Questions, the scoring scheme could either take into account or ignore letter case. Furthermore, it could prove useful to find a phrase inside an answer rather than considering the whole answer. The TMS should allow both features. For Hotspot Questions, it should be possible to associate different scores to different areas of the image containing the information to be identified. At this point it is useless to discuss all the possible marking scores for all the available classes of questions but we are preparing a checklist that will support the software evaluation phase that we are planning (see final remarks).

The use of multimedia including sound and video clips, or animated images may improve the level of comprehension of a question. As stated earlier in this paper, the use of multimedia may raise issues related with the portability and the interoperability of the application. These issues may not represent a problem whenever a Web-based assessment approach is selected, since the nature of the WWW is inherently multimedial. In this case, the choice of standard plug-ins for the available browsers may reduce risks of portability and of interoperability. Since most plug-ins used to grant access to multimedia sources are actually free of charge, their use may not interfere with cost problems.

A question should provide feedback that may contain the score and/or comments concerning the users' performance. The feedback could be presented either after any single question (this solution being preferable for self-evaluation tests) or at the end of the test, based on the overall performance.

Test Management

This metric is concerned with the ability to build up a test from a set of questions and to deliver it. Among the criteria that may be used to qualify a TMS with respect to test management, we suggest preparation of exams from items, creation of test banks, ability to evaluate tests, and method of student evaluation are essential. All of these points will be further discussed in the next paragraphs.

Preparation of exams. Once questions have been defined, they must be selected and organized into a test. Test preparation is a non-trivial task, since it may require the ability to "manually" choose the questions from their base, or to construct automatically the exam through a random selection approach. This last point means that the tool should allow to compile tests by selecting questions with respect to educational objectives, keywords, contents, statistical value and so on.

Moreover, it should be possible to create multiple forms by rearranging questions, either by some instructor intervention or automatically, in order to discourage cheating. Tools that provide the ability to randomize the order of answers for a question may further discourage cheating.

The availability of facilities for building adaptive tests may be a plus for the selection of the tool. Adaptive testing is used to allow the student to move forward or backwards in a test depending on what has happened so far. This is a very powerful feature, since it allows the test to react "intelligently" to what the student does. Very few commercial TMSs provide adaptive testing features (Fast Test, 2000; Perception, 2000) and usually the construction of adaptive tests is not very simple from the instructor's point of view.

Creation of test banks. Questions can be assembled together directly in test or in a bank that is further referenced by the test. Test banks may be very useful in a number of ways, since organizing in a bank questions related to the same topic may simplify both random selection of questions and the evaluation of the understanding of the topic itself through statistical measures. Therefore, a TMS should provide the possibility to create multiple banks with unlimited number of items in each bank, and the ability to import existing questions and corresponding data from existing banks. If the same bank can be shared by different tests, it is possible to reuse the same material, saving time and effort. Obviously, different instructors may share the same questions, thus obtaining synergies and homogenizing the way in which the same topic is assessed in different courses. Furthermore, building well-formed questions is very hard and difficult task. The possibility of accessing question banks provided by well-known scientists or by professional organizations represents a great value for the educational community. As an example, a number of Student Chapters of the Association for Computing Machinery are collecting test banks related to computer science (ACM-SC, 1999).

Ability to evaluate tests. Tests should be evaluated both before and after administration (Gronlund, 1985). Evaluating a test before administration means analyzing it in terms of adequacy of test plan, text items and text format and directions. Analysis of the test plan involves asking:

  • Does the test plan adequately describe the instructional objectives, and the contents to be measured?
  • Does the test plan clearly indicate the relative emphasis to be given to each objective and each content area?

Analysis of test items before they are administered involves the evaluation of each item in terms of appropriateness, relevance, conciseness, ideal difficulty, correctness, technical soundness, cultural fairness, independence and sample adequacy. Finally, analysis of test format involves asking:

  • Are the test items of the same type grouped together in the test or within sections of the test?
  • Are the correct answers distributed in such a way that there is no detectable pattern?
  • Is the test material well-spaced, legible, and free of typographical errors?

Evaluating a test after administration helps to verify whether it functioned as intended in order to adequately discriminate between low and high achievers; the test items were of appropriate difficulty and free of irrelevant clues and other defects (so, for instance, all distractors behaved effectively in MCQs).

Method of student evaluation. Once questions have been designed and the test delivered, it is of fundamental importance to obtain an assessment of the student individually and with respect to the class. We have already discussed the importance of providing to the instructor with tools for the assessment of the evaluation process. To attain to such results, the TMS should provide at least the following information to the instructor:

  • test performance report for each individual examinee, with percentage of correct answers and relative ranks;
  • individual response summary by item, with an error report that lists wrong vs. correct responses;
  • class test performance with distributions, means and deviations;
  • item statistics and analysis with indicators that may be useful to evaluate the questions in terms of reliability, discrimination, difficulty and so on.

Although the system may provide some numerical results to measure the test, it is actually up to the instructor to evaluate the results and to identify strategies and policies to modify the educational process in order to improve the understanding of mis-concepted topics.


Arrow Up

Implementation Issues

Among the issues that may be taken into account to evaluate a TMS under implementation point of view, we have identified Security and Communication with other software.

Security

There is a wide range of security issues related to the use of a computer assisted assessment system, including security of the test material, of the HTML code that implements testing, and of the identification of the users. While commercial programs usually implement encrypting approaches to address concerns about the test material and its HTML code it must be outlined that, most freewares do not. In fact, most freeware applications rely either on Perl/CGI or on JavaScript. The former approach is dangerous since it is basically the equivalent of letting the world run a program on the server side. On the other hand, since the JavaScript code runs on the client side of the application, the assessment program cannot be completely hidden, and a "smart" student can access the source discovering the right answer associated to each question. In both cases, some sophisticated techniques may be identified to partially overcome the problems.

Communication

Communication with other existing software may be useful both for exporting answers and for calling external applications. Exporting answers is usually performed through test files and data conversion utilities. This may be useful to customize the reports generated by the application or when a more detailed analysis than that allowed by the assessment tool is needed.

Furthermore, many available tools provide the ability of calling a program as a block within a question. The called program returns a score in points that may be added to the score of the test. This may be useful for assessing abilities that cannot be evaluated through the basic question-answer paradigm of many assessment tools.Some tools allow activating a call to an external application at the very end of the test phase. The designers of Perception (2000), for instance, point out, "this may be useful for: printing certificates for all users who pass the test; electronically submitting the answer file to a central location for analysis and evaluation; storing the results in a file to be accessed by a user program". Finally, communication with other software is required in order to allow the integration with TDSs distributed by different commercial producers.


Arrow Up

Final Remarks

Test Management Systems are available both as commercial and as freeware applications. Commercial TMS may be divided into two main categories: publisher and off-the-shelf systems. Publisher systems are licensed only for use with particular adopted textbooks and therefore usually are proprietary, thus limiting the sharing among different courses/departments. On the other, hand off-the-shelf programs are available with faculty wide site licenses that allow adopting the same system for all courses. This is an important aspect, since development, maintenance and expertise can be shared among instructors. At the same time students need to know only one interface, thus reducing their effort in learning something that is not directly related to their educational process. Furthermore, off-the-shelf programs match most of the criteria discussed in this paper, and therefore are preferable both to publisher systems and to freewares.

The main advantage of free application is tied to cost considerations, and on the availability of source code that allows tailoring of the web assessment to special needs that may not be fulfilled by existing tools. In fact, although objective testing can be used to assess a wide range of learning outcomes, more complex patterns of achievements are very difficult to be evaluated through this approach (Gronlund, 1985; Crabbe, Grainger & Steward, 1997; Ebel, 1979; Gagné & Briggs, 1979). For instance, the abilities to state and to recognize inferences and to recognize the limitations of data are very difficult to evaluate through question/answer mechanisms. In our opinion, freeware tools may be useful in improving the analytical abilities of the students. Therefore, when a model for the evaluation of complex patterns of interaction is needed, it may be useful to start from a freeware program that may be enhanced through the integration with ad hoc facilities.

In this paper we have discussed a framework that may be useful to assist an educational team in the selection of a TMS. The framework has been obtained by modifying and extending existing work in the field (Freemont and Jones, 1994; Gibson et al. 1995). Four main metrics have been identified and discussed: Interface, Testing, Assessment and Implementation.

Our continuing research effort is aimed at identifying a similar framework for the evaluation of a Test Delivery System. The structure of a CAA system is very complex, as shown in figure 1, where tools for the evaluation of the tests both from the point of view of the teacher (Test building support) and from the point of view of the class/student (Test Analysis) have been taken into account. This complexity has been taken into account by a number of commercial developers. For example, the "Better Testing" product developed by Question Mark Computing ltd. is a support system for creation of well formed questions and tests and "TestFact" by Assessment System Co. is a support for the statistical evaluation of exams.

The wide interest shown by the educational communities all over the world for distance learning and on-line course delivery often implies the possibility of delivering tests over the web. This is the rational behind the existence of the "Web enabler" module of figure 1.

CAA tool structure

Figure 1 - The complete structure of a CAA tool

It is important to identify metrics that can be used to evaluate all the modules that belong to this general structure of a CAA system. Once all metrics will be identified, the definition of a set of measurement methods will become necessary to perform the evaluation of the software products. This implies to define a method for assigning either numeric or nominal values (such as "good", "average", "bad", etc.) to every identified attribute. At this point, a review of the commercial and freeware applications referenced in Looms (2000) will be conducted by publishing a questionnaire to be hosted on the website of our department and completed by interested users.


Arrow Up

Authors

Salvatore Valenti received his degree (Laurea) in Electronic Engineering from the University of Ancona, Italy, in 1983. Since 1990, he has been a researcher in the Computer Science Department at the University of Ancona. His main research interests are focused on computer based instruction and assessment, and on the elicitation of functional specifications in the area of requirements engineering. He has been a member of research groups working on several projects funded by the Ministry of Research, the National Research Council, and the European Union. Currently he is involved in MODASPECTRA (MOtor Disability Assessment SPEcialists' TRAining), a research and technology development project pertaining to the "Telematics Application Programme - Education and Training" sector of the 4th Framework Program for R&D of the European Union (Project reference number MM 1041). The project is aimed at developing a remote teaching course for preparing post-graduate specialists on Motor Disability Assessment. He is board member of the Journal of Information Technology Education. He has co-authored a chapter titled "Web-based assessment of Student Learning", in A. K. Aaggarwal (ed.), (2000), Web-based Learning & Teaching Tecnologies: Opportunities and Challenges, Hershey, PA: Idea Group Publishing Co.

Alessandro Cucchiarelli received his degree (Laurea) in Electronic Engineering from the University of Ancona, Italy, in 1985. Since 1991, he has been a researcher in the Computer Science Department at the University of Ancona. His main research interests are focused on Automatic Evaluation of Software and Information Extraction. He has also been involved in research activity on Models and Tools for Cooperative Distributed Information Systems, Requirement Engineering and Robotics. He has been a member of research groups working on several projects funded by the Ministry of Research, the National Research Council, and the European Union (Cost13, ECRAN).

Maurizio Panti is the chair of the Computer Science Department at the University of Ancona. He teaches courses on the Foundations of Computer Science, Information Systems, Fundamentals of Computing, & Databases at the University of Ancona and at those of Macerata and Camerino. He has been active in many projects on information extraction and web based teaching, and has been responsible for the research unit of the University of Ancona in the national project INTERDATA. His research interests are devoted to: (a) databases with a particular focus on the problems of discovery of similarity of conceptual schemas; (b) multiagent techniques applied to the open and Cooperative Infomation Systems; and (c) Distance Learning.

Contact Author:
Sal Valenti
Computer Science Dept.
University of Ancona - 60100 Ancona – Italy
tel.: +39 071 2204824
fax: +39 071 2204474
email: valenti@inform.unian.it
http://www.inform.unian.it/personale/valenti/valenti.html


Arrow Up

References

Association for Computing Machinery –Student Chapters. (1999). Available: http://www.cs.uidaho.edu/~acm/test_bank.html, http://www.utdallas.edu/orgs/acm/testbank.html

Anderson, E. E. (1989). A heuristic for Software Evaluation and Selection. Software Practice and Experience, 19, 707-717.

Ares Casal J. M., Dieste Tubio, O., Garcia Vasquez, R., Lopez Fernandez, M., & Rodriguez Yanez S. (1998). Formalizing the software evaluation process. Proceedings of the 18th International Conference of the Chilean Society of Computer Science, IEEE, 15-24.

Bloom, B. (1956). Taxonomy of Educational Objectives, Handbook I, Cognitive Domain. New York: David McKay Co. Inc.

Burstein, J. & Chodorow, M. (1999). Automated essay scoring for nonnative English speakers, [On-line]. Available: http://www.ets.org/research/acl99rev.pdf

Course Test Manager (2000). Course Technology [Computer program]. Thomson Learning. Available: http://www.course.com/at/assessment/ctm.cfm

C-Quest (2000) [Computer program]. Cogent Computing Co. Available: http://www.cogentcorp.com/

Crabbe, J., Grainger, J. & Steward, R. (1997). Quality assessment of Computer Based Learning. Educational Computing, 8(3),17-19.

Ebel, R.L. (1979). Essentials of Educational Measurement. Englewood Cliffs, NJ: Prentice Hall.

FastTest Pro. (2000). [Computer program]. Assessment System Corporation. Available: http://www.assess.com/FastTEST.html

Foltz, P. W., Laham, D., & Landauer, T. K. (1999). Automated Essay Scoring: Applications to Educational Technology. Procedures of EdMedia'99. Available: http://www-psych.nmsu.edu/~pfoltz/reprints/Edmedia99.html

Freemont, D. J., & Jones, B. (1994). Testing Software: A review. New Currents 1.1. Available: http://www.ucalgary.ca/pubs/Newsletters/Currents/Vol1.1/TestingSoftware.html

Gagné, R. M., & Briggs, L. J. (1979). Principles of instructional design (2nd ed.), New York: Holt, Rinehart and Winston.

Gibson, E. J., Brewer, P. W., Dholakia, A., Vouk, M. A., & Bitzer, D. L. (1995). A comparative analysis of Web-based testing and evaluation systems. Proceedings of the 4th WWW conference, Boston. Available: http://renoir.csc.ncsu.edu/MRA/Reports/WebBasedTesting.html

Gillham, M., Kemp, B., & Buckner, K. (1995), Evaluating Interactive Multimedia Products for the Home. The New Review of Hypermedia and Multimedia, 1, pp. 199-212.

Grondlund, N. E. (1985). Measurement and Evaluation in Teaching. New York: MacMillan Publishing Company.

Hansen, W. J. (1999). A Generic Process and Terminology for Evaluating COTS Software, Software Engineering Institute. Available: http://www.sei.cmu.edu/staff/wjh/Qesta.html

Hazari, S. I. (1999). Evaluation and selection of web course management tools. Available: http://sunil.umd.edu/webct

Henderson, R. D., Smith M. C., Podd, J., & Varena-Alvarez, H. (1995). A comparison of the four prominent user-based methods for evaluating the usability of computer software. Ergonomics, 38, 2030-2044.

InQsit (2000) [Computer program], Ball State University. Available: http://www.bsu.edu/inqsit/

International Standard Organization (1991). International Standard - 9126 Information Technology – Software quality characteristics and sub-characteristics. ISO/IEC.

Looms, T. (2000). Survey of Course and Test Delivery / Management Systems for Distance Learning, [On-line]. Available: http://tangle.seas.gwu.edu/~tlooms/assess.html.

LXR!TEST (2000) [Computer program]. Logic eXtension Resources. Applied Measurement Professionals, Inc. Available: http://www.lxrtest.com/

MicroCat (2000) [Computer program]. Assessment System Corporation. [On-line]. Available: http://www.assess.com/microcat.html

Nikoukaran, J., Hlupic, V., & Paul, R. J. (1999). A hierarchical framework for evaluating simulation software.Simulation Practice and Theory, 7, 219-231.

Nielsen, J. & Molich, R. (1990). Heuristic evaluation of user interfaces. Proceedings of the International Conference on Human Factors in Computing , ACM, 249-256.

Perception (2000) [Computer program], Question Mark Computing Ltd. Available: http://www.qmark.com

Raine, J. (1999). Towards the Introduction of institution wide computer assisted assessment: a service department experience. In M. Danson and R. Sherratt (Eds.) Proceedings of the 3rd Annual Computer Assisted Assessment Conference (pp. 163-175). Loughborough: Learning & Teaching Development.

Stamelos, I., Refanidis, I., Katsaros, P., Tsoukias, A., Vlahavas, I., & Pombortis A. (in press). An adaptable framework for educational software evaluation. Recent Developments and Applications in Decision MakingAmsterdam: Kluwer Academics.

Sutcliffe, R. G., Leonard, E. M., Tierney, A., Howe, C. W., Reid, I., Goodwin, S. T., & Mackenzie D. M. (1999). Introduction of a range of Computer –based objective tests in the examination of Genetics in first year Biology. In M. Danson and R. Sherratt (Eds.), Proceedings of the 3rd Annual Computer Assisted Assessment Conference (pp. 193 – 203). Loughborough: Learning & Teaching Development.

The Question Bank (2000) [Computer Program]. Advanced Teaching Resources, Inc. Available: http://www.teachingtech.com/qbank.htm

Vlahavas, I., Stamelos, I., Refanidis, I., & Tsoukias, A. (1999). ESSE: An expert system for software evaluation. Knowledge Based Systems, 4, 183-197.


Arrow Up

All of our published authors control the copyrights to their published articles in Current Issues in Education. Others who wish to reprint anything they see in this journal should contact the original author directly for permission. When referencing any published articles from this online journal, please credit CIE as the original publisher and include the URL of the CIE publication in your credits and citations. Permission to copy any article is provided to all, provided CIE is credited and copies are not sold.