W EIGHTING OF ITEMS IN A TUTORIAL PERFORMANCE EVALUATION INSTRUMENT : S TATISTICAL ANALYSIS AND RESULTS

Weighting of items in an evaluation instrument contributes to more meaningful and valid interpretations of student performance in respect of each learning outcome or item being assessed. It follows that the validity of instruments is important for meaningful inferences about students’ learning performance, including their performance in tutorial groups. The Delphi technique was used to elicit experts’ subjective judgement of the content validity of items in the tutorial performance evaluation instrument in rounds one and two. A sample of eight experts (n = 8) was


ABstRACt
Weighting of items in an evaluation instrument contributes to more meaningful and valid interpretations of student performance in respect of each learning outcome or item being assessed. It follows that the validity of instruments is important for meaningful inferences about students' learning performance, including their performance in tutorial groups. The Delphi technique was used to elicit experts' subjective judgement of the content validity of items in the tutorial performance evaluation instrument in rounds one and two. A sample of eight experts (n = 8) was selected by purposive, maximum variation sampling.
In round three Delphi a weighted score was determined for each of the instrument items, subitems and Likert scale points through pairwise comparison by the experts. Mathematical modelling of experts' weighting comparisons, recorded on visual analogue scales, resulted in proportional weights for each item; these weights are expressed as a percentage.
The fi nal instrument comprised weighted items measured on a rating scale with points that are not equidistant. A computerised tutorial performance evaluator (TPE) was developed for accurate, economical and effi cient calculation of student scores. The purpose of this article is to report on the statistical analysis and results of the weighting of items in an instrument to assess and evaluate baccalaureate nursing students' performance in problem-based learning tutorials.
Die doel van die artikel is om verslag te lewer oor die statistiese analise en resultate van die waardebepaling van items in 'n instrument om baccalaureate-verpleegkundestudente se prestasie in probleemgebaseerde leergroepe te meet en te evalueer.

INtRodUCtIoN
Weighting of items in any assessment tool is diffi cult in the absence of scientifi c evidence and is a perennial challenge facing nurse educators. Assessing and evaluating student performance require scores that are accurate, valid and free from bias for meaningful interpretations and conclusions about students' learning performance. It follows that instruments used to assess performance in any learning environment must produce results from which valid and unbiased inferences can be drawn. In problem-based learning (PBL) students' group skills or group behaviours in small group tutorials are important indicators of learning. Tutorial behaviours generally assessed include self-directed learning, communication, small group interaction, reasoning and autonomy (Niemenin, Saure & Lonka 2006:65;Rideout 1999:216). The content of this paper is derived from a study that sought to determine, in part, the validity of a tutorial performance evaluation instrument to evaluate group skills in a PBL context; the validation processes are described in a previous article.
The study institution, a university department of nursing, uses a PBL curriculum for the preparation of its undergraduate baccalaureate nursing students. One of the main features associated with this learning approach is the tutorial, where small groups of students discuss and analyse clinical and community problems. These small group discussions are facilitated by a nurse educator (called a facilitator) whose primary role is to foster cooperation, stimulate thinking, promote enquiry and facilitate problem solving through the search for, application and integration of knowledge.
These processes require the individual learner to possess or, in the longer term, to develop a range of skills necessary for effective group functioning. Appropriate communication skills are required together with having to learn the 'new language' associated with health sciences. Growth of the 2 student and of the group is encouraged and promoted to improve self-confidence and to motivate the individual to become a self-directed learner. Good problem-solving skills together with critical thinking skills are paramount and, if not present, need to be developed. These skills should assist the individual to arrive at a logical conclusion when analysing and seeking solutions to a problem. Most of these skills are abstract and difficult to measure and, in a learning programme, must be assessed to determine and provide feedback on students' learning progress. Originally, a 36-item instrument was developed to assess the development of group skills within PBL tutorials without evidence of its validity to assess student learning in tutorial groups. After determining its content validity this instrument was subjected to processes and statistical procedures to determine the value of each item in relation to other items in the instrument, ultimately to assign individual item weights. This paper reports on the statistical analysis and procedures for the weighting of items in a tutorial performance evaluation instrument.

Definition of key concepts
The following definitions applied to this study: Validity refers to the appropriateness, meaningfulness and • usefulness of inferences drawn from instrument scores. Construct refers to a main variable in an evaluation • instrument within which measurable criteria or subitems are located. Construct is used interchangeably with main item in this study. Weighting refers to the value assigned to an item and • subitem based on its importance in a set of items in an evaluation instrument to enhance its internal structure. Tutorial performance refers to student behaviour in a • small-group learning context, which facilitates individual learning, group learning and team work (Rideout, 1999:233).

The tutorial performance evaluation instrument
Historically, students' performance in PBL tutorials had been assessed using an original tutorial performance evaluation instrument comprising seven main items or constructs and 36 subitems. These items are equally weighted and rated against an eight-point Likert scale, with equidistant points. After content validity had been determined in an earlier part of the study this evaluation instrument was sent to experts for them to estimate the value of each item, subitem and the Likert scale through pairwise comparisons. Statistical procedures applied to the experts' estimated values resulted in relative weights being assigned to each item. In preparing the instrument for weighting procedures the seven main items (constructs) were labelled A-G and the subitems inside each construct as 1, 2, and so forth, according to the number of subitems present.

Methods
The purpose of this part of the research was to determine the relative weights of items in a tutorial performance evaluation instrument with a view to enhancing its internal structure and thus its validity.

Design and sample
Within an overarching quantitative design the Delphi technique was used to elicit experts' subjective judgement (Crawford & Williams 1985:3;Miranda 2001:87) regarding the validity of unweighted items in the tutorial performance evaluation tool.
To this end, experts estimated the relative weight of each item in relation to another by a process of pairwise comparison. A sample of eight experts (n = 8) from two South African universities was selected by purposive, maximum variation sampling (Patton 2002:234).

Data collection procedure
Three rounds of the Delphi technique were used for collecting the data. In rounds one and two content validity was established; the seven main items remained and the subitems were reduced from 36 to 34. The Likert scale was reduced from eight to four points with descriptors as determined and verified by the experts. Once content validity had been established the preeminent tutorial performance evaluator (TPE) had the distinct disadvantage of all items carrying the same weight. In round three of the Delphi technique, a weighted score incorporating the weights for each of the constructs or main items (WC), subitems (WI) and Likert scale points (WL) was determined. This was achieved by the experts' subjective judgement, through pairwise comparison (David 1963:9) of the relative value of each of WC, WI and WL. These subjective ratings were recorded on visual analogue scales. The procedure for weighting through pairwise comparison was as follows: Firstly, each expert was asked to rate the value or importance of one item relative to another for each possible pair of main items (constructs) on a 100-mm visual analogue scale; in other words, experts were asked to give a subjective judgement of the weight of one construct against a second construct in a pairwise fashion until all constructs had been rated.

Example:
Construct A Construct B AB 0 mm 60 mm 100 mm Construct A = 60 mm; therefore, Construct B = 40 mm.
Secondly, and in a similar way, experts were required to judge and rate the relative weights of pairs of subitems within each construct until all possible combinations of subitems had been rated. These two steps in the data collection procedure produced 100 visual analogue scales per expert: 21 for main items and 79 for subitems. This step resulted in a total of 800 units of analysis.
Thirdly, experts were required to conduct a similar weighting assessment of the four-point Likert scale by marking the distance between the four points (0 to 3) on a 100-mm visual analogue scale. After pairwise comparisons had been completed all visual analogue scales (n = 808) were returned to the researcher. Visual analogue scales from pairwise comparison of main and subitems (n = 800) were accurately measured for the distance between 0 mm and the experts' marks. These measurements were entered onto an Excel spreadsheet for statistical analysis by a resident statistician. Visual analogue scales (n = 8) for determining the Likert scale weighting were accurately measured for the distance between each point of the scale. Similarly, all measures were entered onto a second Excel spreadsheet for analysis.

Statistical analysis
A linear regression model was fitted to the logarithms of the relative weights obtained during pairwise comparison of main items (WC) and subitems (WI). Regression coefficients were exponentiated and standardised to add up to 100%. Each subitem within a main item (construct) was weighted and expressed as a percentage, the sum of which equals 100%. Each construct now had its own unique weighting represented as a http://www.hsag.co.za Health SA Gesondheid

Original Research
Article #408 Tutorial performance evaluation instrument: Statistical analysis and results 3 5 3 percentage, the sum of which equals 100%. In estimating item weights a mathematical logarithm was used.

Example:
a ijk is the ratio between wi and wj Item i vs Item j 0 mm 60 mm 100 mm a ijk = (wi / wj) = 60/40 In the above example the ratio between Item i / Item j is 1.5/1.0.

ResULts
Statistical analysis and mathematical modelling of instrument items now produced ordinal scale data for all main items (constructs) (n = 7) and subitems (n = 34) with points that are not equidistant. Thereafter, each main and subitem had its own proportional value. Once the subitems in a particular construct or main item had been calculated to an overall percentage, the latter was further calculated in accordance with the percentage specifi c to that construct.

Example:
(WI)(WC) / 100% Results of mathematical modelling of experts' weighting of the four-point rating scale (0-3) were as follows: Assuming that 0 = 0% and 3 = 100% a rating of 1 was weighted as 28% and a rating of 2 as 69%; thus the Likert scale points were no longer equidistant. The results of item weighting are shown in Figure  1. Every tick-box on the instrument, now called the TPE, has a unique weight equal to the product of weight of main item/ construct (WC), weight of subitem (WI) and weight of rating scale (WL), in other words. (WC)(WI)(WL)/100%.

Constructing and implementing the TPE
Each of the seven (7) main items together with the specifi c subitems relative to that main item (construct) had their own given percentage. Being assigned their relative weights the main items were ranked from the highest to the lowest percentage (see Table 1). Once completed, the subitems within each main item were also ranked from the highest to the lowest percentage.
Using the TPE to score students' learning performance requires the nurse educator/facilitator to calculate the (WC)(WI) (WL)/100% for each subitem. This would be time consuming if done manually. Additionally, error may occur rendering the composite score inaccurate, unreliable and invalid. Practicality and ease of use of the TPE could not be overlooked for successful implementation. A computer-based TPE (see Figure 2) was designed to allow for these calculations to be done effi ciently, accurately and quickly for meaningful interpretation of students' scores.
The TPE is used for formative assessment purposes by both the student (self-assessment) and the nurse educator (facilitator assessment). During this process, which is a paper assessment, each subitem on the TPE is given a rating of 0, 1, 2 or 3 by the individual carrying out the assessment according to descriptors of the rating scale (see Table 2). An agreed-upon rating between the student and facilitator is then entered onto the computerised TPE by clicking in the corresponding box. The calculations are computed automatically by identifying the value of a 0, 1, 2 or 3 rating and converting the rating to the relative percentage. The sum of the percentages is computed to produce the total percentage.

dIsCUssIoN
Emerging thoughts on validity suggest that validity is not a property of an evaluation instrument but of instrument scores and interpretation of scores (Beckman, Cook & Mandrekar 2005:1159Cook & Beckman 2006:166.e7). In this regard validity has become a unitary concept to describe the degree to which a score can be interpreted as representing the activity being measured (Cook & Beckman 2006:166.e8) -in this case, PBL tutorial performance. Sources of validity evidence are many; noteworthy, and for the purpose of this study, is the internal structure of an instrument (Beckman et al. 2005(Beckman et al. :1160. Internal structure refers to the degree to which individual items fi t the underlying construct and is usually determined using factor analysis (Beckman et al. 2005(Beckman et al. :1160. Many factors in the instrument itself may threaten its internal structure; equality in weighting or unweighted items has been described as one such factor. Subjective judgement as an alternative to factor analysis has been used in this study to improve the internal structure of a tutorial performance evaluation instrument by way of item weighting. However, the credibility of subjective judgement as method has aroused increasing criticism (Crawford 985:3). As a consequence, quantifi cation of experts' subjective judgements has been posited as a valid and reliable method to assist with weighting and preferential ranking of instrument items. Statistical analysis of subjective data from experts is thus important for valid inferences about student learning: in this instance, learning performance in PBL groups.
Individualised weighting of each item and subitem in the TPE together with the four points on the rating scale provides useful, differentiated information about specifi c aspects of student learning within PBL groups. Scaling each set of subitems within a construct or main item to a value of 100% allows the facilitator to view the student score for each set of subitems or learning domain on the TPE. Remediation and/or support can be offered to the student in areas where a low percentage has been obtained. Additionally, the hierarchical arrangement of constructs and their subitems enables the facilitator to be selective and to prioritise when giving academic support to a student. Depending on the level or year of study the facilitator and student can initially concentrate on domains that are seen to be of greater importance than others.

CoNCLUsIoN
Statistical procedures applied to quantify experts' pairwise comparison of the relative value of instrument items resulted in the weighting and preferential ranking of items. Paired comparisons for the weighting of items produce more valid estimates than no comparison at all or reliance on experts' intuition. It may be concluded that, based on its internal structure, the computerised TPE has validity by virtue of the relative weight of items obtained during pairwise comparisons. The TPE is accurate, economical and easy to use by both the student and the facilitator; entering ratings onto the computer is a quick process with automatic calculation and conversion of scores. Reducing the rating options from eight points to four points and providing concrete descriptors for each point on the rating scale make item ratings more objective and reliable. The validity of a tutorial performance evaluation instrument thus also enhances the reliability of the processes during which scores are produced.