Utilizing Assessments in the Classroom - Educational Assessments: May 2008

Tuesday, May 27, 2008

Chapter 14 – Appropriate and Inappropriate Test-Preparation Practices

Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham

Chapter 14 deals with appropriate and inappropriate test-preparation practices. This is a newly recognized concern as a result of public reporting on education and No Child Left Behind (NCLB) legislation. The pressure to increase test scores has lead to improper and unethical test preparation practices. There are some steps to ensure good testing practices and enhancing students learning.
Educational achievement tests are administered to help teachers make good inferences about students’ knowledge and/or skills in relation to curricular aims being addressed. The achievement test should sample the curricular aim being addressed to serve as an indication to mastery or non-mastery. The relationship of mastery of a curricular aim and a score on an achievement test should be the same. If they are the same, we are seeing good test preparation practice.
There are two guidelines that can be used to ensure appropriate test-preparation practices. The first guideline is the professional ethics guideline. Professional ethics says that no test-preparation practice should violate the ethical norms of the education profession. State imposed security procedures regarding high stakes tests should never be compromised. The breaching of proper test practices removes the confidence of the public and can result in the loss of teaching credentials. The second guideline is the educational defensibility guideline. Educational defensibility states no test-preparation should increase students test scores without simultaneously increasing students mastery of the curricular aim tested. This guideline emphasizes the importance of engaging in instructional practices that focus on the student and their best interest.
There are five common test-preparation practices. These practices generally occur in the classroom and sometimes special instruction needs to occur outside the regular class time. The first practice is previous-form preparation provides special instruction and practice based directly on the use of an actual previously used test. This form of practice actually violates educational defensibility, as test scores may be boosted without a rise in mastery of curricular aims. The second practice is current-form preparation. Current-form preparation provides special instruction and practice based directly on the students’ use of the form of the test currently being used. This practice violates both educational defensibility and ethical guidelines. Any access to a current form before its official release is a form of stealing or cheating. The third practice is generalized test-taking preparation. Generalized test-taking preparation provides special instruction covering test taking skills for dealing with a variety of achievement test formats. Students learn how to make calculated guesses and how to judiciously use test taking time. This form of test practice is an appropriate use of test-preparation skills. Students are able to cope with different test types and are apt to be less intimidated by various test items they may encounter. This skill also promotes a more accurate reflection of true knowledge and skill. The fourth test preparation skill is same-format preparation. Same-Format preparation provides regular classroom instruction dealing directly with content covered on the test using practice items in the format of the actual test. The items are a clone of the test and for many students they are indistinguishable from the actual test. While this practice may not be unethical it is not educationally defensible. If students only recognize or deal with the test format they are not prepared to show what they have learned. In this case test scores may rise but demonstration of curricular aim master may not. The fifth test preparation practice is varied-format preparation. Varied format preparation provides regular classroom instruction dealing with content covered on the test, but practice items represent a variety of test-item formats. This test practice also satisfies ethical and educationally defensible guide-lines. Content on the tests as well as content applied to the curricular aims is applied in a variety of formats. Both test scores and curricular aim mastery should be seen in this test preparation practice.
A popular expression used today is “teaching to the test”. The meaning of this statement can have a negative connotation. Negatively applied the teacher is directing instruction specifically to test items on the test itself. This of course is a form of bad instruction. A positive way to “teach to the test” would be to aim instruction toward curricular aims supported on the actual test. This would be a positive form of instruction. The author suggests not using the phrase “teaching to the test” to avoid any confusion on instructional practices. A suggested phrase by the author is, “Teaching to the curricular aim represented by the test”.
Raising students test scores on high stakes tests is a common theme throughout public education. If teachers and administrators are being pressured to do this they should only do so if they are provided curricular aims that are aligned with the assessment. Test items should be accompanied by descriptions of what the test items represent that is suitable for instructional planning. If curricular aim descriptions are not provided for test items then score boosting should not be expected.
This chapter provided appropriate and inappropriate test-preparation practices. Two guidelines were explained professional ethics and educational defensibility. There are also five test practices described and how the guidelines applied to these practices. Varied-format and generalized test-preparation were the two sound practices that enforced both guidelines. The phrase “teaching to the test” was described, both negatively and positively and a whole new phrase was suggested, “Teaching to the curricular aim”. The idea of high stakes test supplying curricular aims descriptions for solid instructional decisions to occur was also discussed.

Wednesday, May 21, 2008

Chapter 13 – Making Sense Out of Standardized Test Scores

Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham

Chapter 13 focuses on standardized test scores and making sense out of these scores. There are a variety of ways to interpret standardized tests and their scores. Depending upon what the scores are intending to measure a variety of interpretations can be used. Teachers need to understand how to score and assess the various tests administered. By understanding how these tests are scored, teachers can inform their instruction, as well as, making sense of students’ performance on these tests and what the scores mean.
There are two types of standardized tests; one is designed to yield norm-referenced inferences the other criterion-referenced inferences. These tests are administered, scored and interpreted in a standard pre-determined manner. National standardized achievement tests are typically used to provide normed-referenced interpretations focused on the measurement of aptitude or achievement. State developed standardized achievement tests are used in many states for accountability purposes. Some states use them to assess basic skills of high school students to advance or allow or disallow a student to receive a diploma, even if curriculum requirements are met. Results are often publicized and try to indicate educators’ effectiveness. State standardized tests are intended to produce criterion-referenced interpretations.
Test scores can be interpreted individually or as a group. Group-focused interpretations are necessary for looking at all your students or groups of students. There are three ways described to do this. The first is by computing central tendency. The central tendency is an index of the groups’ scores, such as mean or median. A raw score is used to show the number of items answered correctly by a student. The median raw score is the midpoint of a set of scores. The mean raw score is the average of a set of scores. The mean and median show a center point for group scores. A second useful way to describe group scores is variability. Variability is how spread out the scores are. A simple measure of variability of a set of students scores is a range. A range is calculated easily by subtracting the lowest score and the highest score. A third way to look at group scores is the standard deviation. The standard deviation is the average difference between the individual scores in a group of scores and the mean of that set of scores. The larger the size of the standard deviation the more spread out the scores in the distribution. The formula for standard deviation is as follows:

SD = √∑(x-M)2
N
∑ (x-M)2 = Sum of the squared raw scores (x) – M (mean)
N = number of scores in the distribution

The mean and standard deviation are the best ways to discuss and describe group scores. It may however, be easier to compute median and range. However, this is not a very reliable way to always look at scores so the standard deviation method is much more reliable.
Individual test interpretations are also necessary. There are two ways to interpret individual test scores. They are interpreted in absolute or relative terms. An absolute inference is when we infer from a score what the student has mastered or not mastered and the skills and/or knowledge being assessed. A relative inference is when we infer from a score, how the student stacks up against other students currently taking the test, or already taken the test. The terms below average and above average are typically used in a relative inference.
There are three interpretative schemes to be considered from relative score-interpretations. The first scheme is percentiles or percentile ranks. A percentile compares a student’s score with those of other students in a norm group. The percentile indicates the percent of students in the norm group the student outperformed. The norm group is the students who took the test before it was published, to establish a norm group and help identify the test items that are appropriate. There are also different types of norm groups. There can be national norm groups and local norm groups. The second interpretive scheme is called grade-equivalent scores. A grade-equivalent is an indicator of student test performance based on grade levels and months of the school year. The purpose of grade-equivalent scores is to convert scores on a standardized assessment to an index score reflecting a student’s grade level progress in school. This score is also a developmental score and is indicated as follows:
Grade.month of school year
5.6

Grade equivalent scores are typically seen in reading and math. Grade-equivalent scores are determined by administering the same test to several grade levels establishing a trend-line which reflects the raw score increases at each grade level. Estimates at points along the trend line are established indicating what the grade-equivalent of the raw score would be. There are many assumptions made in this scoring theme making it a rather questionable scoring theme. It also can be misleading to parents and what it really translates to. The appropriate assumption is to say the grade equivalent score is an estimate of how the average student taking the test at a certain grade might score. The third scoring scheme is scale score interpretations. Scale scores are converted raw scores that use a new arbitrary chosen scale to represent levels of achievement or ability. The most popular scale score system is an item-response theory (IRT). This is different from a raw score reporting system. The difference is that IRT scales take into consideration the difficulty and other technical properties of every single item on the test. There is a different average scale score for each grade level. Scale scores are used heavily to describe group test performances at the state, district, and school levels. Scale scores can be used to permit longitudinal tracking of students’ progress and making direct comparisons of classes, schools and districts. It is very important to remember that, not all scale scores are similar and therefore, can’t be compared consistently on different scales score exams. Standardized tests use normal curve equivalent or NCE to attempt to use students’ raw scores to arrive at a percentile for a raw score, if the students’ scores were perfectly symmetrical a bell curve would be formed, however; sometimes the normal curve does not form and the NCE evaporates. Therefore, NCE’s were not a solution for comparing different standardized tests stanine’s like an NCE but it divides a score distribution into nine segments that though equal along the baseline of a set of scores contain different proportions of the distribution scores. Stanines are approximate scale scores.
There are two tests used to predict a high school student’s academic success in college. The first test is known as the SAT or Scholastic Aptitude Test. It’s function was originally to assist admissions officials in a group of elite Northeastern universities to determine who to admit. The test was designed to compare inherited verbal, quantitative and spatial aptitudes. Today, however; it is divided in three sections. The three sections are; critical reading, writing and mathematics. The SAT uses a score range from 200 to 800 for each of the three sections. The highest score as of the year 2005, that can be earned is a 2400. This is a total from all three sections. The test takes about three hours to administer. There is also a PSAT to help students prepare for the SAT. The second type of test is the ACT. The ACT is also an aptitude test. Different from the SAT the ACT or the American College Test was created as a measure of a student’s educational development for the soldiers that were taking advantage of the GI money being awarded to them for college. The SAT was sometimes too difficult or inappropriate so a new measure was needed. There are four content areas it addresses, they are English, Mathematics, Reading, and Science. The ACT also takes three hours to administer similar to the SAT. The ACT is scored by giving 1 point for every correct answer and no subtraction of points for wrong answers. Then an average is computed for each of the four sections, unlike the SAT where scores are added together. One very important aspect of the SAT and ACT tests is that only 25% of academic success in college is associated with a high school student’s performance. The other 75% has to do with non-test factors. Therefore, students who may not do well on these tests should not be discouraged from attending college.
Standardized tests are assessment instruments that are administered, scored and interpreted in a typically, predetermined standard format. The standardized tests are used to get a handle on students’ achievement and aptitude. Test scores are described two ways by central tendency and variability. Central tendency uses mean and median and variability uses range and standard deviation to describe scores meaning. Interpreting results by percentiles, grade-equivalent scores, scale scores, stanines and normal curve-equivalent were also explained for strengths and weaknesses. The SAT and ACT were also described and explained.

Saturday, May 10, 2008

Chapter 12 – Instructionally Oriented Assessment

Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham

Chapter 12 discusses how classroom instruction can be guided by assessment. Two strategies are discussed for making instructional decisions before the assessment results and planning instruction to achieve the curricular aim(s) represented by an assessment.
The first strategy of applying instructional decisions as a result of assessment results is typically known by most teachers, it is just not widely used. Since teachers are typically in the practice of using assessments to assign grades, they don’t always see them as tools to make instructional decisions. Teachers use tests to find out a level of knowledge learned. Curricular aims tested for could be cognitive affect, or psychomotor abilities, each of these areas being quite substantial in scope. A sampling of these curricular aims through assessment can provide inferences of the student status. This inference can then be used to make instructional decisions. Other types of tests can be used to help determine grades, but assessments written to determine levels of knowledge should be used to inform instruction only.
Assessments that are used to inform instruction should use the following three categories as decision bases. The first what to teach? This can be as a pre-assessment given before instruction, for a specific objective that is necessary. The second decision category is, how long to keep teaching toward a particular objective. This can be assessed during the time of instruction. The decision from this type of formative assessment can be used to determine whether to continue, or cease instruction for an objective, for a student, or the whole class. The third decision category is, how effective an instructional lesson or unit was. This can be assessed by comparing students’ pre and post tests results. The decision to retain, discard or modify a given instructional lesson or unit can be determined by this decision.
The second assessment based strategy for improving classroom instruction is planning instruction to achieve curricular aim(s) represented by an assessment. This strategy is a rethink of educational assessments. For testing to influence teaching, tests should be constructed prior to instructional planning. This would allow planned instructional units to better coincide with the content of the test and can then inform instructional planning. In this model the teacher starts with the instructional objectives set forth in the curriculum, then moves to create an assessments based on these goals and then after, the pre-assessment helps plan instructional activities intended to promote student mastery of knowledge, skills and/or attitudes to be post-assessed. Curriculum, should always be the starting point. The assessment then acts as clarity to the instructional aims, and whether the skills intended are mastered, teachers should therefore never teach toward the test themselves.
There are three benefits to this strategy. The first is more accurate task analysis. Since you will have a clearer idea of the results you are after, you can better enable the knowledge and skills students need to achieve before mastering what is taught. Second more on-target practice activities can be used. You will have a better sense of your end-of-unit instruction outcomes so you can choose guided-practice and independent-practice activities more aligned with targeted outcomes. The third benefit would be more lucid expositions. As a result of understanding more clearly what needs to be assessed at the conclusion of instruction you can provide clearer explorations to students about content and where instruction is heading.
The idea of assessments for learning has also evolved over time. In the past teachers used assessments to assess mostly what students know and don’t know. This is known as assessments of learning. While assessments of learning are important and should be utilized assessments for learning should also be utilized and more than assessments of learning. It has been shown that students who are given assessments for learning were able to achieve in six to seven months what it took others a full year to achieve. There are five strategies suggested to implement strategies for learning sequence. The first strategy is to clarify and share learning intentions and criteria for success. The second strategy is to engineer effective classroom discussions, questions and learning tasks. Thirdly, to provide feedback that moves learners forward. Fourth, activate students as owners of their learning and fifth, to activate students as instructional resources for one another. Again this is a huge shift for many teachers but the benefits of learning for students are tremendous. Formative assessments can also be used to help achieve higher scores especially on summative assessments.
The ideas of this chapter were really based on improving instructional decisions and instruction based on information gained from assessments. There were two strategies described, making instructional decisions in light of assessment results and planning instruction to achieve the curricular aim(s) represented by an assessment.

Saturday, May 3, 2008

Chapter 11 – Improving Teacher-Developed Assessments

Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham

Chapter 11 provides procedures to improve teacher developed assessments. There are two general improvement strategies discussed. The first is judgmental item improvement and the second is empirical time improvement. Teachers need to apply adequate time and be motivated to apply either of these strategies.
The first improvement strategy is judgmentally based improvement procedures. These procedures are provided by yourself, your colleagues and your students. Each of the three judges will follow different procedures. The first judge would be you, judging your own assessments. The five criteria procedures for judging your own assessments are; first, adhere to item-specific guidelines and general item-writing commandments, second, contributions to score-based inference, third, accuracy of content, fourth, absence of content lacunae or gaps, and fifth, fairness. The second judge would be a collegial judge. The collegial judge would be provided the same five criteria used by yourself as well as a brief description of each of the five criteria. The third judge would be the students. The students are often overlooked as a source of test analysis, but whom better and most experienced than the test takers. There are five improvement questions students can be asked after an assessment is administered. The first is, if any of the items seemed confusing, and which ones were they? Second, did any items have more than one correct answer? If so which ones? Third, did any item have no correct answers? If so which ones? Fourth, were there words in any items that confused you? If so, which ones? Fifth, were the directions for the test or for particular subsections unclear? If so which ones? These questions can be altered slightly for constructed response assessments, performance assessments, or portfolio assessments. Ultimately the teacher needs to be the final decision maker for any needed changes.
The second improvement strategy is known as empirically based improvement procedures. This improvement strategy is based on the empirical data supplied when students respond to teacher developed assessments. There are a variety of well tuned techniques used over the years that involve simple number or formula calculations. The first technique is difficulty indices. A useful index of item quality is its difficulty. This is also referred to as p-value. The formula looks as follows:
Difficulty P = R - # of students responding correctly to an item
T - # of total students responding to an item

A p-value can range from 0-1.00. Higher p-values indicate items that more students answered correctly. An item with a lower p-value would be an item most students missed. The p-value should be viewed as the relationship to the student’s chance, probability of getting a correct response. For example on a four item multiple choice assessment a .25 p-value, by chance should be expected. The actual difficulty of an item should be tied to the instructional program. So, if the p-value is high that does not necessarily mean the item was too easy rather the content has been taught well.
The second empirically based technique is item-discrimination indices. An item-discrimination index typically tells how frequently an item is answered correctly by those who perform well on the total test. An item-discrimination index reflects the relationship between student’s responses for the total test and their responses to a particular test item. The author uses a correlation coefficient between student’s total scores and their performance on a particular item. This is also referred to as a point-biserial. A positive discriminating item indicates an item is answered correctly more often by those who score well on the total test than by those who score poorly on the total test. A negatively discriminating item is answered correctly more often by those who score poorly on the total test than by those who score well on the total test. A nondiscriminating item, is one that there is no appreciable difference in the correct response for those who score well or poorly on the total test. Four steps can be applied to compute an items discrimination index. First order the test papers from high to low by total score. Then divide the papers evenly into a high group and a low group. Third calculate a p-value for each group. Lastly subtract pl from ph to obtain each items discrimination index (D). Example of this is:
D= Ph – Pl. The following chart can also be created:
Positive Discriminator High scores > Low scores
Negative Discriminator High scores < Low scores
Nondiscriminator High scores = Low scores

Negative discriminating items are a sign that something about the test is not good. This is so because the item is being missed more often by students who are performing well on the test in general and not being missed by students who are not doing well on the test. Ebel and Frisbie (1991) apply the following table.
.40 and above = Very good items
.30 - .39 = Reasonably good possible improvement needed
.20 - .29 = Marginal items should improve these items
.19 and below = Poor items – remove or revise items

The third technique offered for empirically based improvement procedures is distracter analysis. A distracter analysis is when we see how high and low groups are responding to the items distracters. This analysis is used to dig deeper on the basis of p-value or discrimination index’s needing revision. The following table should be set up to look at this technique:

Test Item # Alternatives
p = .50
D = .33 A B* C D Omit
Upper 15 2 5 0 8 0
Lower 15 4 10 0 0 1

In this example the correct response is B*. Notice students doing well are choosing D and students not doing well are choosing B. Also of interest is no one is choosing C. As a result of this analysis something should be changed in C to make it more appealing as well as looking a what about B is causing students to choose it who are not doing as well on the test and students who are doing well to not choose it.
Criterion referenced measurements can also use a form of item analysis to avoid low discriniation indicies from traditional item analysis. The first approach requires administering the assessment to the same group of students prior to and following instruction. Some disadvantages of this are that item analysis can’t occur till after both assessments are given and instruction has occurred. Pre-tests may also be reactive by sensitizing students to certain items, so post-test performance is a result of pre-test and instruction, not just instruction. The formula for this procedure would look like the following:
Dppd = Ppost – Ppre
Ppost – proportion of students answered correct on post-test
Ppre – proportion of students answered correct on pre-test

The value of Dppd (Discrimination based on pretest-posttest differences) should range from -1.00 to +1.00 high positives indicating good instruction. The second approach for item analysis of criterion-referenced measurement is to locate two different groups of students. One group having received instruction, and the other group not having received any instruction. By assessing and applying the item analysis described earlier you can gain information quicker on item quality, avoiding reactive pretests. The disadvantage is you are relying on human judgment.
Two strategies have been described judgment or empirical methods. Both strategies can be applied to improve assessment procedures. When using judgment procedures five criteria should be used for the assessment creator, as well as the collegial judge with descriptions and a series of question for students about the assessment. When using the more empirical methods several different types of item analysis and discrimination practices can be used by applying various formulas. Ultimately the assessment creator needs to apply good techniques and have a willingness to spend some time to improve their assessments.

Utilizing Assessments in the Classroom - Educational Assessments