Saturday, May 3, 2008

Chapter 11 – Improving Teacher-Developed Assessments

Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham

Chapter 11 provides procedures to improve teacher developed assessments. There are two general improvement strategies discussed. The first is judgmental item improvement and the second is empirical time improvement. Teachers need to apply adequate time and be motivated to apply either of these strategies.
The first improvement strategy is judgmentally based improvement procedures. These procedures are provided by yourself, your colleagues and your students. Each of the three judges will follow different procedures. The first judge would be you, judging your own assessments. The five criteria procedures for judging your own assessments are; first, adhere to item-specific guidelines and general item-writing commandments, second, contributions to score-based inference, third, accuracy of content, fourth, absence of content lacunae or gaps, and fifth, fairness. The second judge would be a collegial judge. The collegial judge would be provided the same five criteria used by yourself as well as a brief description of each of the five criteria. The third judge would be the students. The students are often overlooked as a source of test analysis, but whom better and most experienced than the test takers. There are five improvement questions students can be asked after an assessment is administered. The first is, if any of the items seemed confusing, and which ones were they? Second, did any items have more than one correct answer? If so which ones? Third, did any item have no correct answers? If so which ones? Fourth, were there words in any items that confused you? If so, which ones? Fifth, were the directions for the test or for particular subsections unclear? If so which ones? These questions can be altered slightly for constructed response assessments, performance assessments, or portfolio assessments. Ultimately the teacher needs to be the final decision maker for any needed changes.
The second improvement strategy is known as empirically based improvement procedures. This improvement strategy is based on the empirical data supplied when students respond to teacher developed assessments. There are a variety of well tuned techniques used over the years that involve simple number or formula calculations. The first technique is difficulty indices. A useful index of item quality is its difficulty. This is also referred to as p-value. The formula looks as follows:
Difficulty P = R - # of students responding correctly to an item
T - # of total students responding to an item

A p-value can range from 0-1.00. Higher p-values indicate items that more students answered correctly. An item with a lower p-value would be an item most students missed. The p-value should be viewed as the relationship to the student’s chance, probability of getting a correct response. For example on a four item multiple choice assessment a .25 p-value, by chance should be expected. The actual difficulty of an item should be tied to the instructional program. So, if the p-value is high that does not necessarily mean the item was too easy rather the content has been taught well.
The second empirically based technique is item-discrimination indices. An item-discrimination index typically tells how frequently an item is answered correctly by those who perform well on the total test. An item-discrimination index reflects the relationship between student’s responses for the total test and their responses to a particular test item. The author uses a correlation coefficient between student’s total scores and their performance on a particular item. This is also referred to as a point-biserial. A positive discriminating item indicates an item is answered correctly more often by those who score well on the total test than by those who score poorly on the total test. A negatively discriminating item is answered correctly more often by those who score poorly on the total test than by those who score well on the total test. A nondiscriminating item, is one that there is no appreciable difference in the correct response for those who score well or poorly on the total test. Four steps can be applied to compute an items discrimination index. First order the test papers from high to low by total score. Then divide the papers evenly into a high group and a low group. Third calculate a p-value for each group. Lastly subtract pl from ph to obtain each items discrimination index (D). Example of this is:
D= Ph – Pl. The following chart can also be created:
Positive Discriminator High scores > Low scores
Negative Discriminator High scores < Low scores
Nondiscriminator High scores = Low scores

Negative discriminating items are a sign that something about the test is not good. This is so because the item is being missed more often by students who are performing well on the test in general and not being missed by students who are not doing well on the test. Ebel and Frisbie (1991) apply the following table.
.40 and above = Very good items
.30 - .39 = Reasonably good possible improvement needed
.20 - .29 = Marginal items should improve these items
.19 and below = Poor items – remove or revise items

The third technique offered for empirically based improvement procedures is distracter analysis. A distracter analysis is when we see how high and low groups are responding to the items distracters. This analysis is used to dig deeper on the basis of p-value or discrimination index’s needing revision. The following table should be set up to look at this technique:


Test Item # Alternatives
p = .50
D = .33 A B* C D Omit
Upper 15 2 5 0 8 0
Lower 15 4 10 0 0 1

In this example the correct response is B*. Notice students doing well are choosing D and students not doing well are choosing B. Also of interest is no one is choosing C. As a result of this analysis something should be changed in C to make it more appealing as well as looking a what about B is causing students to choose it who are not doing as well on the test and students who are doing well to not choose it.
Criterion referenced measurements can also use a form of item analysis to avoid low discriniation indicies from traditional item analysis. The first approach requires administering the assessment to the same group of students prior to and following instruction. Some disadvantages of this are that item analysis can’t occur till after both assessments are given and instruction has occurred. Pre-tests may also be reactive by sensitizing students to certain items, so post-test performance is a result of pre-test and instruction, not just instruction. The formula for this procedure would look like the following:
Dppd = Ppost – Ppre
Ppost – proportion of students answered correct on post-test
Ppre – proportion of students answered correct on pre-test

The value of Dppd (Discrimination based on pretest-posttest differences) should range from -1.00 to +1.00 high positives indicating good instruction. The second approach for item analysis of criterion-referenced measurement is to locate two different groups of students. One group having received instruction, and the other group not having received any instruction. By assessing and applying the item analysis described earlier you can gain information quicker on item quality, avoiding reactive pretests. The disadvantage is you are relying on human judgment.
Two strategies have been described judgment or empirical methods. Both strategies can be applied to improve assessment procedures. When using judgment procedures five criteria should be used for the assessment creator, as well as the collegial judge with descriptions and a series of question for students about the assessment. When using the more empirical methods several different types of item analysis and discrimination practices can be used by applying various formulas. Ultimately the assessment creator needs to apply good techniques and have a willingness to spend some time to improve their assessments.

No comments: