Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Educational Assessment – Mentor: Lorraine Lander
Final Project: Aligning Classroom Assessments to Content Standards
Assessments in the classroom today should have one main focus. Assessments should be aligned with the content or curricular aims intending to be learned. Content standards or curricular aims are described as the knowledge or skills teachers and educators want students to learn. When teachers plan for instruction they need to mindfully plan their assessments prior to instruction in order to properly align the curricular aims with the instruction. Thus, allowing for the adjustment of instruction when curricular aims are not being mastered. The assessments also need to be valid and reliable so that the inferences that are made about students can be determined upon the results or outcomes of the assessments. Popham suggests that using various types of assessments will help teachers to get a sampling of what knowledge or skills students have acquired.
One very important assessment that is used in New York State is the 4th Grade NYS Math Assessment. The 4th Grade NYS Math Assessment is one of many grade level based exams used to comply with No Child Left Behind Legislation. Popham refers to this type of test as a high stakes exam. The NYS 4th Grade assessment is a standards based assessment that according to Popham’s definition would be considered instructionally sensitive. The 4th Grade NYS Math assessment attempts to measure a manageable number of curricular aims., The assessment also attempts to clearly describes the skills and/or knowledge being assessed and it is able to report students’ performance in a way to benefit teachers and students instructionally which according to Popham is essential. Popham would also agree that an instructionally sensitive assessment is also a way to measure a teacher’s effectiveness instructionally. Since the exam is not created in the classroom however, it may need to be carefully evaluated for assessment bias as well as item bias. Many districts in New York State are beginning to take advantage of this aspect of the assessment. In order for teachers to align and inform their instruction a group of teachers in New York State are using benchmark testing three times a year along with unit assessments and other formative assessment measures, to help inform their instruction and get a handle on what their students are really learning.
The process being used by a current group of New York State Teachers is described as follows. The first step for the teachers was to aligning their instruction to the assessment they are creating. Currently they are taking the current grade level being taught; for the purpose of this paper I will use Grade 5 as the base grade. The teachers will use the scores as well as the Performance Indicators from the March, 4th Grade NYS Assessment as their base assessment. If a new or untested student arrives in the fall, they will use only the score from the September 5th Grade Benchmark exam for alignment purposes. A Fall Benchmark Assessment is then developed. In order to develop this benchmark assessment the teachers will gather the performance indicators from the NYS 4th Grade Assessment. Then with the help of the other fourth grade teachers along with the current fifth grade teachers they will look at the key curricular aims in 4th grade as well as 5th Grade. The key curricular aims that these students should have learned and mastered from 4th grade are used from the NYS 4th Grade Assessment. They will then, align these curricular aims with the current aims expected in 5th grade to see where there are matches or similar curricular aims. Popham also recommends this type of alignment, but stresses not using too many content standards when trying to get a handle on content mastery. The teachers will align the curriculum maps of these two grade levels side by side and then select the curricular aims that appear across the grade levels. A form to help collect this information is used to help simplify the process for them. Below is a sample of this form.
Sample Form:
Targeted Grade Level: _________
Projected Benchmark Testing Dates: __________________________________
Test Format: __________________________________
Targeted Content Strand (Curricular Aim): ___Measurement____________________________
Performance Indicator Sample Test Question
4M1 – Select tools and units (customary and metric) appropriate for length measured.
5M1 – Use a ruler to measure the nearest in., ½, ¼ , and 1/8 inch Measure the following line.:
________________________
6M1 – Measure capacity and calculate volume of a rectangular prism.
Bands within the Content Strands:
Measurement: Units of measurement
Tools and Methods
Units
Error and Magnitude
Estimation
Once they have decided the appropriate curricular aims the teachers will then begin to look at test questions that are aligned with the curricular aims they have chosen. To do this they utilize the bank of older test questions provided from NYS of previous tests or software that they have available to use that is aligned with the curricular aims chosen. This practice also aligns with sound assessment practices recommended by Popham. Because the teachers are pulling questions from multiple sources they are showing generalized test-taking preparation. This allows students to be prepared for many different types of test items according to test preparation guidelines explained by Popham. Popham also recommends the more questions used the better you are able to gage the students’ mastery of skills and knowledge. There are some key questions that the teachers use to ensure that they are designing this Benchmark Assessment accordingly. These key questions also align with many of Popham’s recommendations for classroom assessment creation
1. What will be the format for your test questions? Popham recommends keeping test items in sequence and size aligned. He also suggests not having problems run onto another page or causing pages to be flipped when taking an assessment.
2. For each curricular aim selected, how many questions will we have? Popham would suggest an even distribution. For example 2 questions per curricular aim.
3. How many total questions will be on the test? Popham would recommend 50 questions or more for reliable results.
4. How many of each type of question will be included? Again, an even distribution is recommended by Popham.
5. How will we ensure that there is a range in the level of difficulty for the question selected for each curricular aim? An item difficulty indices might be used to determine
this.
6. Will there be a time limit? If so what will the time limit be?
For this assessment the teachers will incorporate multiple choice as well extended response or constructed response questions. The NYS Math exam uses three types of question formats; Multiple Choice, Short Response and Extended Response. Generalized mastery should be promoted with whatever is being taught, therefore varied assessment practices should also be employed whenever possible (Popham, 2008). Performance assessments and Portfolio assessments as well as Affect assessments can also be used throughout the year to assess where students are in relation to the curricular aims intending to be assessed. By using these strategies the teachers are promoting educational defensibility as well as professional ethics. This practice is also teaching to the curricular aims represented by the assessment. While the test is important it is not the single defining piece of learning for the teacher. The curricular aims addressed are also important to the grade level curriculum. That is how they are proving educational defensibility and by using older exam questions aligned to the curricular aims the teachers are intending to measure they are also ensuring professional ethics. If the teachers incorporate some of the other assessment techniques throughout the year they will be using general assessment practices and preparing students for a variety of different testing formats. Since the 4th Grade NYS Assessment is largely comprised of Number Sense and Operation questions the teachers have chosen many of the questions from that standard. A couple from the Algebra and Statistics and probability standards will also be addressed but the fall instruction for 5th grade is based largely on the Number Sense and Operations curriculum. The test will have a total of 20 questions 16 multiple choice and 4 Constructed Response. The students will be given an hour to complete the assessment. To make the assessment a more reliable assessment they might have chosen more questions and used a couple of questions per curricular aim being assessed. This would have also helped with the reliability of the assessment.
Some final considerations suggested by the teachers before successfully administering the assessment are described next. The teachers recommend the use of a cover page. The directions should be clearly written and contain understandable wording for each section of the test. A practice also outlined by Popham in his five general Item-Writing Commandments. (Popham, 2008). The teachers suggest that modeling your assessment after the NYS Math assessment, directions should be modeled as well. Popham would suggest a more generalized modeling in order to prepare students for various test items they may encounter. Bubble sheets for Multiple Choice. Again, clearly labeled sheets and directions on how to bubble the answer sheet, as well as, how to correct mis-bubbles or mistakes is very important. Determine ahead of time the number of copies needed of assessments and answers sheets. (if this is a grade-wide assessment, determine who will be in charge of copying?). Testing Modifications and accommodations for Special Education students should be used. This practice is ensuring that these students are assessed in the same way that they are instructed for valid inferences to be gained and to minimize assessment bias. Date and time that the grade level or class will be given the exam. (To ensure that students are administered the same exam at the same time to disallow test reactive-ness to occur). Scoring scale and rubric are established far before the assessment is administered and before instruction occurs. Having students participate in the rubric building is also recommended. Grading sheets prepared. Who will be scoring the test and when is determined ahead of time. How the data will be analyzed and reported.
The fifth grade students are then given the assessment in early September. The multiple choice portion is then graded separately from the constructed response portion. A group of teachers will meet to grade the constructed response portion during a mutual meeting time. A grid is created in Excel for all portions of the test. Down the left side of the grid is the students names along the top of the grid are the curricular aims addressed and the corresponding test question numbers. Along the bottom of the grid is a tally for the number missed. Each question that is missed gets a check under the question and curricular aim it corresponds to. All the items are tallied at the bottom and off to the left side a total correct tally is also created for both parts of the assessment along with a total score tally area. (See attached example for clarification purposes).
The items that seem to have the most problems are the items considered to not have been mastered by those students. Since there are a number of items the group as a whole is struggling with the teacher will use these items as key targets in upcoming instruction. When we look at this analysis grid it might seem that the items should be thrown out however, since this assessment is being used to guide instruction the teachers are using it to see what was learned and not learned, they are therefore, valid questions. On the December benchmark these items would hopefully not be as problematic and instead show learning that has occurred. The same can be said for the items scored well on. These items in some assessment situations might be items to through out however, in this case they are items that have been mastered and show learning that has occurred. The teachers in this group might want to create groups to target their instruction for these students in the areas of weakness and use formative assessments to show mastery along the way. While the students who preformed well, newer curricular aims can be used to begin instruction for them as they have shown mastery and are ready to move forward. A pre-assessment should be created to administer to them and to then guide that instruction.
In December these teachers are going to create another assessment. This time the assessment will contain problems from the areas that were not mastered by the majority during the September assessment as well as new curricular aims they have been introduced to, to see if mastery has occurred for that learning, as well as areas that need to be targeted for instruction. The teachers use grouping models for the math students and will also re-group students according to areas of mastery and weakness.
The process described above is very through and seems to cover the areas suggested by the author in creating assessments that are not only valid but also reliable. I was especially encouraged by the idea that, these teachers get together over the summer and during the year to re-evaluate their assessments and continually inform their instruction. This is a practice that I feel all teachers should strive for and be encouraged to participate in. Many of the practices described by Popham for good sound assessment building and administration were followed by these teachers and in areas where they were not the fact that they were trying might suggest that over time they will change to align tighter with practices suggested by Popham. It was very exciting to attend this session and hear about the thoughtfulness that is occurring in testing practices within New York State. While some teachers may not agree with the thoughtfulness and approach these teachers used, they should strive for more collaboration and sharing. The promotion of shared learning and teaching will only benefit student learning and mastery. The forms and processes described by these teachers aligned very nicely with the text used for this study and was very beneficial to pulling it all together.
Friday, June 20, 2008
Sunday, June 8, 2008
Chapter 15 – Appropriate Evaluating Teaching and Grading Students
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 15 addresses evaluating teaching and grading students. These two topics while sometimes used interchangeably are separate functions. Evaluation is an activity focused on determining the effectiveness of the teachers. Grading is an activity focused on informing students how they are performing. Pre-instruction and post-instruction assessment practices are discussed as well as split-and-switch design for informing instruction. The use of standardized achievement tests for evaluating students and instruction was also weighed. Three schemes for grading are also describe along with a more commonly used practice.
There are two types of evaluation used in apprising instructional efforts of teachers. The first is formative evaluation. Formative evaluation is the appraisal of the teacher’s instructional program for purposes of improving the program. The second form of evaluation is summative evaluation. Summative evaluation is not improvement focused, it is an appraisal of teachers competencies to make more permanent decisions about teachers. These decisions are typically about continuation of employment or awarding of tenure. Summative evaluation is usually made by an administrator or supervisor. Teachers will typically do their own formative evaluation in order to better their own instruction. Summative data may be supplied to administrators or supervisors to show effectiveness of teaching.
Instructional impact can be gauged by pre-instruction and post-instruction. Assessing students prior to instruction is pre-assessment and then assessing after instruction has occurred is post-assessment and an indication of learning that has occurred. This scenario however, can be reactive. Reactive is when students are sensitized to what needs to be learned from the assessment and then perform well on the post-assessment as a result. An alternative to this problem might be a split-and switch design. This alternative data gathering design works best on large groups of students versus smaller groups. In this model you will split your class and administer two similar tests to each half. Mark the test as pre-tests instruct the group and then switch the tests for each group and post-test. Blind scoring should then occur. Blind scoring is when someone else grades the tests, another teacher, parent or assistant. The test results are then pulled together for each test. There is no problem caused in this design by differential difficulty and students will not have previously seen the post-test so reactive impact is not a consideration or problem. As a result instructional impact should be seen.
A common use of evaluating teaching has been through performance of students on standardized achievement tests. For most achievement tests there is a very inappropriate way to evaluation instructional quality. A standardized test is any exam administered and scored in a predetermined, standard manner. There are two major forms of standardized tests they are aptitude tests and achievement tests. Schools effectiveness is typically based on standardized achievement tests. There are three types of standardized achievement tests. The first is a traditional national standardized achievement test. The second is a standards-based achievement test that is instructionally insensitive and the third is a standards-based achievement test that is instructionally sensitive.
The purpose of Nationally Standardized Achievement tests is to allow valid inferences to be made about the knowledge and skills a student possesses in a certain content area. These inferences are then compared with a norm group of students of the same age and grade. The dilemma of this is that there is so much that would need to be tested that only a small sampling is possible. The consequence of this is an assumption that the norm group is a genuine representation of th nation at large. If this is the case these tests should not evaluate the quality of education that is not their purpose. There is a likelihood the tests are not aligned rigorously with a state’s curricular aims. Items covering important emphasized content by the classroom teacher may be eliminated in a quest for score spread. The final reason nationally standardized achievement tests should be used to evaluate teachers success is many items are linked to students SES – Social economic status or their inherited academic aptitude. In essence they are measuring what students bring to school not what they learn at school.
Standards-based tests sound like they would make much more sense. Two problems that have occurred with standards-based instructionally insensitive tests are the large number of content standards needing to be addressed and then reporting results used have limited instructional value. If properly designed these are standards-based tests that are instructionally sensitive. Three attributes must be present for standards-based test to be instructionally sensitive. They are the skills and/or bodies of knowledge must be clearly described so students’ mastery is very clear and the test results must allow clear identification of each assessed skill or body of knowledge mastered by a student. A standards-based test not possessing all three of these attributes is not instructionally sensitive and therefore, is useless. Instructionally sensitive standards-based tests are the right kind of tests to use to evaluate schools.
Teachers also need to inform students of how well they are doing and how well they have done. This is a demonstration of what they have learned and the extent of their achievement. Serious thought should be given to identifying factors to consider when grading and how much those factors will count. There are three common grade giving approaches. The first is absolute grading. In this model a grade is given based on the teachers’ idea of what level of students performance is necessary to earn each grade. This method is similar to criterion-referenced approach to assessments. The second form of grading is relative grading. Relative grading is a grade based on how students perform in relation to one another. This type of grading requires flexibility from class to class due to make-up of class changes. This form is close to norm-referenced grading approach. The third grade option is aptitude-based grading. Aptitude-based grading is a grade assigned to each student based on how well the students perform in relation to the students’ potential. This form of grading tends to “level the playing field”, by grading according to ability and encouraging full potential. Given these three options researchers have found that teachers really use a more “Hodgepodge” form of grading based loosely on judgment of students assessed achievement, effort, attitude, in-class conduct, and growth. The results of this type of grading are low performance in any of these areas results in a low grade for a student. There are not scientific quantitational models for clear cut grades using the “hodgepodge” method. It is purely judgmental on most levels but is widely used and accepted by teachers and students.
The final chapter has described distinctions of evaluating and grading. Evaluating of teachers quality of instruction and grading of students. Also discussed was the inappropriateness of using national standardized achievement tests to evaluate teachers. The difference between instructionally insensitive standards-based achievement tests and instructionally sensitive achievement tests was shown. Grading was then discussed and the importance of developing criteria and weighting of grades ahead of actual grade dispensing. Three grading options were described the reality of “hodgepodge” grading was presented.
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 15 addresses evaluating teaching and grading students. These two topics while sometimes used interchangeably are separate functions. Evaluation is an activity focused on determining the effectiveness of the teachers. Grading is an activity focused on informing students how they are performing. Pre-instruction and post-instruction assessment practices are discussed as well as split-and-switch design for informing instruction. The use of standardized achievement tests for evaluating students and instruction was also weighed. Three schemes for grading are also describe along with a more commonly used practice.
There are two types of evaluation used in apprising instructional efforts of teachers. The first is formative evaluation. Formative evaluation is the appraisal of the teacher’s instructional program for purposes of improving the program. The second form of evaluation is summative evaluation. Summative evaluation is not improvement focused, it is an appraisal of teachers competencies to make more permanent decisions about teachers. These decisions are typically about continuation of employment or awarding of tenure. Summative evaluation is usually made by an administrator or supervisor. Teachers will typically do their own formative evaluation in order to better their own instruction. Summative data may be supplied to administrators or supervisors to show effectiveness of teaching.
Instructional impact can be gauged by pre-instruction and post-instruction. Assessing students prior to instruction is pre-assessment and then assessing after instruction has occurred is post-assessment and an indication of learning that has occurred. This scenario however, can be reactive. Reactive is when students are sensitized to what needs to be learned from the assessment and then perform well on the post-assessment as a result. An alternative to this problem might be a split-and switch design. This alternative data gathering design works best on large groups of students versus smaller groups. In this model you will split your class and administer two similar tests to each half. Mark the test as pre-tests instruct the group and then switch the tests for each group and post-test. Blind scoring should then occur. Blind scoring is when someone else grades the tests, another teacher, parent or assistant. The test results are then pulled together for each test. There is no problem caused in this design by differential difficulty and students will not have previously seen the post-test so reactive impact is not a consideration or problem. As a result instructional impact should be seen.
A common use of evaluating teaching has been through performance of students on standardized achievement tests. For most achievement tests there is a very inappropriate way to evaluation instructional quality. A standardized test is any exam administered and scored in a predetermined, standard manner. There are two major forms of standardized tests they are aptitude tests and achievement tests. Schools effectiveness is typically based on standardized achievement tests. There are three types of standardized achievement tests. The first is a traditional national standardized achievement test. The second is a standards-based achievement test that is instructionally insensitive and the third is a standards-based achievement test that is instructionally sensitive.
The purpose of Nationally Standardized Achievement tests is to allow valid inferences to be made about the knowledge and skills a student possesses in a certain content area. These inferences are then compared with a norm group of students of the same age and grade. The dilemma of this is that there is so much that would need to be tested that only a small sampling is possible. The consequence of this is an assumption that the norm group is a genuine representation of th nation at large. If this is the case these tests should not evaluate the quality of education that is not their purpose. There is a likelihood the tests are not aligned rigorously with a state’s curricular aims. Items covering important emphasized content by the classroom teacher may be eliminated in a quest for score spread. The final reason nationally standardized achievement tests should be used to evaluate teachers success is many items are linked to students SES – Social economic status or their inherited academic aptitude. In essence they are measuring what students bring to school not what they learn at school.
Standards-based tests sound like they would make much more sense. Two problems that have occurred with standards-based instructionally insensitive tests are the large number of content standards needing to be addressed and then reporting results used have limited instructional value. If properly designed these are standards-based tests that are instructionally sensitive. Three attributes must be present for standards-based test to be instructionally sensitive. They are the skills and/or bodies of knowledge must be clearly described so students’ mastery is very clear and the test results must allow clear identification of each assessed skill or body of knowledge mastered by a student. A standards-based test not possessing all three of these attributes is not instructionally sensitive and therefore, is useless. Instructionally sensitive standards-based tests are the right kind of tests to use to evaluate schools.
Teachers also need to inform students of how well they are doing and how well they have done. This is a demonstration of what they have learned and the extent of their achievement. Serious thought should be given to identifying factors to consider when grading and how much those factors will count. There are three common grade giving approaches. The first is absolute grading. In this model a grade is given based on the teachers’ idea of what level of students performance is necessary to earn each grade. This method is similar to criterion-referenced approach to assessments. The second form of grading is relative grading. Relative grading is a grade based on how students perform in relation to one another. This type of grading requires flexibility from class to class due to make-up of class changes. This form is close to norm-referenced grading approach. The third grade option is aptitude-based grading. Aptitude-based grading is a grade assigned to each student based on how well the students perform in relation to the students’ potential. This form of grading tends to “level the playing field”, by grading according to ability and encouraging full potential. Given these three options researchers have found that teachers really use a more “Hodgepodge” form of grading based loosely on judgment of students assessed achievement, effort, attitude, in-class conduct, and growth. The results of this type of grading are low performance in any of these areas results in a low grade for a student. There are not scientific quantitational models for clear cut grades using the “hodgepodge” method. It is purely judgmental on most levels but is widely used and accepted by teachers and students.
The final chapter has described distinctions of evaluating and grading. Evaluating of teachers quality of instruction and grading of students. Also discussed was the inappropriateness of using national standardized achievement tests to evaluate teachers. The difference between instructionally insensitive standards-based achievement tests and instructionally sensitive achievement tests was shown. Grading was then discussed and the importance of developing criteria and weighting of grades ahead of actual grade dispensing. Three grading options were described the reality of “hodgepodge” grading was presented.
Tuesday, May 27, 2008
Chapter 14 – Appropriate and Inappropriate Test-Preparation Practices
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 14 deals with appropriate and inappropriate test-preparation practices. This is a newly recognized concern as a result of public reporting on education and No Child Left Behind (NCLB) legislation. The pressure to increase test scores has lead to improper and unethical test preparation practices. There are some steps to ensure good testing practices and enhancing students learning.
Educational achievement tests are administered to help teachers make good inferences about students’ knowledge and/or skills in relation to curricular aims being addressed. The achievement test should sample the curricular aim being addressed to serve as an indication to mastery or non-mastery. The relationship of mastery of a curricular aim and a score on an achievement test should be the same. If they are the same, we are seeing good test preparation practice.
There are two guidelines that can be used to ensure appropriate test-preparation practices. The first guideline is the professional ethics guideline. Professional ethics says that no test-preparation practice should violate the ethical norms of the education profession. State imposed security procedures regarding high stakes tests should never be compromised. The breaching of proper test practices removes the confidence of the public and can result in the loss of teaching credentials. The second guideline is the educational defensibility guideline. Educational defensibility states no test-preparation should increase students test scores without simultaneously increasing students mastery of the curricular aim tested. This guideline emphasizes the importance of engaging in instructional practices that focus on the student and their best interest.
There are five common test-preparation practices. These practices generally occur in the classroom and sometimes special instruction needs to occur outside the regular class time. The first practice is previous-form preparation provides special instruction and practice based directly on the use of an actual previously used test. This form of practice actually violates educational defensibility, as test scores may be boosted without a rise in mastery of curricular aims. The second practice is current-form preparation. Current-form preparation provides special instruction and practice based directly on the students’ use of the form of the test currently being used. This practice violates both educational defensibility and ethical guidelines. Any access to a current form before its official release is a form of stealing or cheating. The third practice is generalized test-taking preparation. Generalized test-taking preparation provides special instruction covering test taking skills for dealing with a variety of achievement test formats. Students learn how to make calculated guesses and how to judiciously use test taking time. This form of test practice is an appropriate use of test-preparation skills. Students are able to cope with different test types and are apt to be less intimidated by various test items they may encounter. This skill also promotes a more accurate reflection of true knowledge and skill. The fourth test preparation skill is same-format preparation. Same-Format preparation provides regular classroom instruction dealing directly with content covered on the test using practice items in the format of the actual test. The items are a clone of the test and for many students they are indistinguishable from the actual test. While this practice may not be unethical it is not educationally defensible. If students only recognize or deal with the test format they are not prepared to show what they have learned. In this case test scores may rise but demonstration of curricular aim master may not. The fifth test preparation practice is varied-format preparation. Varied format preparation provides regular classroom instruction dealing with content covered on the test, but practice items represent a variety of test-item formats. This test practice also satisfies ethical and educationally defensible guide-lines. Content on the tests as well as content applied to the curricular aims is applied in a variety of formats. Both test scores and curricular aim mastery should be seen in this test preparation practice.
A popular expression used today is “teaching to the test”. The meaning of this statement can have a negative connotation. Negatively applied the teacher is directing instruction specifically to test items on the test itself. This of course is a form of bad instruction. A positive way to “teach to the test” would be to aim instruction toward curricular aims supported on the actual test. This would be a positive form of instruction. The author suggests not using the phrase “teaching to the test” to avoid any confusion on instructional practices. A suggested phrase by the author is, “Teaching to the curricular aim represented by the test”.
Raising students test scores on high stakes tests is a common theme throughout public education. If teachers and administrators are being pressured to do this they should only do so if they are provided curricular aims that are aligned with the assessment. Test items should be accompanied by descriptions of what the test items represent that is suitable for instructional planning. If curricular aim descriptions are not provided for test items then score boosting should not be expected.
This chapter provided appropriate and inappropriate test-preparation practices. Two guidelines were explained professional ethics and educational defensibility. There are also five test practices described and how the guidelines applied to these practices. Varied-format and generalized test-preparation were the two sound practices that enforced both guidelines. The phrase “teaching to the test” was described, both negatively and positively and a whole new phrase was suggested, “Teaching to the curricular aim”. The idea of high stakes test supplying curricular aims descriptions for solid instructional decisions to occur was also discussed.
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 14 deals with appropriate and inappropriate test-preparation practices. This is a newly recognized concern as a result of public reporting on education and No Child Left Behind (NCLB) legislation. The pressure to increase test scores has lead to improper and unethical test preparation practices. There are some steps to ensure good testing practices and enhancing students learning.
Educational achievement tests are administered to help teachers make good inferences about students’ knowledge and/or skills in relation to curricular aims being addressed. The achievement test should sample the curricular aim being addressed to serve as an indication to mastery or non-mastery. The relationship of mastery of a curricular aim and a score on an achievement test should be the same. If they are the same, we are seeing good test preparation practice.
There are two guidelines that can be used to ensure appropriate test-preparation practices. The first guideline is the professional ethics guideline. Professional ethics says that no test-preparation practice should violate the ethical norms of the education profession. State imposed security procedures regarding high stakes tests should never be compromised. The breaching of proper test practices removes the confidence of the public and can result in the loss of teaching credentials. The second guideline is the educational defensibility guideline. Educational defensibility states no test-preparation should increase students test scores without simultaneously increasing students mastery of the curricular aim tested. This guideline emphasizes the importance of engaging in instructional practices that focus on the student and their best interest.
There are five common test-preparation practices. These practices generally occur in the classroom and sometimes special instruction needs to occur outside the regular class time. The first practice is previous-form preparation provides special instruction and practice based directly on the use of an actual previously used test. This form of practice actually violates educational defensibility, as test scores may be boosted without a rise in mastery of curricular aims. The second practice is current-form preparation. Current-form preparation provides special instruction and practice based directly on the students’ use of the form of the test currently being used. This practice violates both educational defensibility and ethical guidelines. Any access to a current form before its official release is a form of stealing or cheating. The third practice is generalized test-taking preparation. Generalized test-taking preparation provides special instruction covering test taking skills for dealing with a variety of achievement test formats. Students learn how to make calculated guesses and how to judiciously use test taking time. This form of test practice is an appropriate use of test-preparation skills. Students are able to cope with different test types and are apt to be less intimidated by various test items they may encounter. This skill also promotes a more accurate reflection of true knowledge and skill. The fourth test preparation skill is same-format preparation. Same-Format preparation provides regular classroom instruction dealing directly with content covered on the test using practice items in the format of the actual test. The items are a clone of the test and for many students they are indistinguishable from the actual test. While this practice may not be unethical it is not educationally defensible. If students only recognize or deal with the test format they are not prepared to show what they have learned. In this case test scores may rise but demonstration of curricular aim master may not. The fifth test preparation practice is varied-format preparation. Varied format preparation provides regular classroom instruction dealing with content covered on the test, but practice items represent a variety of test-item formats. This test practice also satisfies ethical and educationally defensible guide-lines. Content on the tests as well as content applied to the curricular aims is applied in a variety of formats. Both test scores and curricular aim mastery should be seen in this test preparation practice.
A popular expression used today is “teaching to the test”. The meaning of this statement can have a negative connotation. Negatively applied the teacher is directing instruction specifically to test items on the test itself. This of course is a form of bad instruction. A positive way to “teach to the test” would be to aim instruction toward curricular aims supported on the actual test. This would be a positive form of instruction. The author suggests not using the phrase “teaching to the test” to avoid any confusion on instructional practices. A suggested phrase by the author is, “Teaching to the curricular aim represented by the test”.
Raising students test scores on high stakes tests is a common theme throughout public education. If teachers and administrators are being pressured to do this they should only do so if they are provided curricular aims that are aligned with the assessment. Test items should be accompanied by descriptions of what the test items represent that is suitable for instructional planning. If curricular aim descriptions are not provided for test items then score boosting should not be expected.
This chapter provided appropriate and inappropriate test-preparation practices. Two guidelines were explained professional ethics and educational defensibility. There are also five test practices described and how the guidelines applied to these practices. Varied-format and generalized test-preparation were the two sound practices that enforced both guidelines. The phrase “teaching to the test” was described, both negatively and positively and a whole new phrase was suggested, “Teaching to the curricular aim”. The idea of high stakes test supplying curricular aims descriptions for solid instructional decisions to occur was also discussed.
Wednesday, May 21, 2008
Chapter 13 – Making Sense Out of Standardized Test Scores
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 13 focuses on standardized test scores and making sense out of these scores. There are a variety of ways to interpret standardized tests and their scores. Depending upon what the scores are intending to measure a variety of interpretations can be used. Teachers need to understand how to score and assess the various tests administered. By understanding how these tests are scored, teachers can inform their instruction, as well as, making sense of students’ performance on these tests and what the scores mean.
There are two types of standardized tests; one is designed to yield norm-referenced inferences the other criterion-referenced inferences. These tests are administered, scored and interpreted in a standard pre-determined manner. National standardized achievement tests are typically used to provide normed-referenced interpretations focused on the measurement of aptitude or achievement. State developed standardized achievement tests are used in many states for accountability purposes. Some states use them to assess basic skills of high school students to advance or allow or disallow a student to receive a diploma, even if curriculum requirements are met. Results are often publicized and try to indicate educators’ effectiveness. State standardized tests are intended to produce criterion-referenced interpretations.
Test scores can be interpreted individually or as a group. Group-focused interpretations are necessary for looking at all your students or groups of students. There are three ways described to do this. The first is by computing central tendency. The central tendency is an index of the groups’ scores, such as mean or median. A raw score is used to show the number of items answered correctly by a student. The median raw score is the midpoint of a set of scores. The mean raw score is the average of a set of scores. The mean and median show a center point for group scores. A second useful way to describe group scores is variability. Variability is how spread out the scores are. A simple measure of variability of a set of students scores is a range. A range is calculated easily by subtracting the lowest score and the highest score. A third way to look at group scores is the standard deviation. The standard deviation is the average difference between the individual scores in a group of scores and the mean of that set of scores. The larger the size of the standard deviation the more spread out the scores in the distribution. The formula for standard deviation is as follows:
SD = √∑(x-M)2
N
∑ (x-M)2 = Sum of the squared raw scores (x) – M (mean)
N = number of scores in the distribution
The mean and standard deviation are the best ways to discuss and describe group scores. It may however, be easier to compute median and range. However, this is not a very reliable way to always look at scores so the standard deviation method is much more reliable.
Individual test interpretations are also necessary. There are two ways to interpret individual test scores. They are interpreted in absolute or relative terms. An absolute inference is when we infer from a score what the student has mastered or not mastered and the skills and/or knowledge being assessed. A relative inference is when we infer from a score, how the student stacks up against other students currently taking the test, or already taken the test. The terms below average and above average are typically used in a relative inference.
There are three interpretative schemes to be considered from relative score-interpretations. The first scheme is percentiles or percentile ranks. A percentile compares a student’s score with those of other students in a norm group. The percentile indicates the percent of students in the norm group the student outperformed. The norm group is the students who took the test before it was published, to establish a norm group and help identify the test items that are appropriate. There are also different types of norm groups. There can be national norm groups and local norm groups. The second interpretive scheme is called grade-equivalent scores. A grade-equivalent is an indicator of student test performance based on grade levels and months of the school year. The purpose of grade-equivalent scores is to convert scores on a standardized assessment to an index score reflecting a student’s grade level progress in school. This score is also a developmental score and is indicated as follows:
Grade.month of school year
5.6
Grade equivalent scores are typically seen in reading and math. Grade-equivalent scores are determined by administering the same test to several grade levels establishing a trend-line which reflects the raw score increases at each grade level. Estimates at points along the trend line are established indicating what the grade-equivalent of the raw score would be. There are many assumptions made in this scoring theme making it a rather questionable scoring theme. It also can be misleading to parents and what it really translates to. The appropriate assumption is to say the grade equivalent score is an estimate of how the average student taking the test at a certain grade might score. The third scoring scheme is scale score interpretations. Scale scores are converted raw scores that use a new arbitrary chosen scale to represent levels of achievement or ability. The most popular scale score system is an item-response theory (IRT). This is different from a raw score reporting system. The difference is that IRT scales take into consideration the difficulty and other technical properties of every single item on the test. There is a different average scale score for each grade level. Scale scores are used heavily to describe group test performances at the state, district, and school levels. Scale scores can be used to permit longitudinal tracking of students’ progress and making direct comparisons of classes, schools and districts. It is very important to remember that, not all scale scores are similar and therefore, can’t be compared consistently on different scales score exams. Standardized tests use normal curve equivalent or NCE to attempt to use students’ raw scores to arrive at a percentile for a raw score, if the students’ scores were perfectly symmetrical a bell curve would be formed, however; sometimes the normal curve does not form and the NCE evaporates. Therefore, NCE’s were not a solution for comparing different standardized tests stanine’s like an NCE but it divides a score distribution into nine segments that though equal along the baseline of a set of scores contain different proportions of the distribution scores. Stanines are approximate scale scores.
There are two tests used to predict a high school student’s academic success in college. The first test is known as the SAT or Scholastic Aptitude Test. It’s function was originally to assist admissions officials in a group of elite Northeastern universities to determine who to admit. The test was designed to compare inherited verbal, quantitative and spatial aptitudes. Today, however; it is divided in three sections. The three sections are; critical reading, writing and mathematics. The SAT uses a score range from 200 to 800 for each of the three sections. The highest score as of the year 2005, that can be earned is a 2400. This is a total from all three sections. The test takes about three hours to administer. There is also a PSAT to help students prepare for the SAT. The second type of test is the ACT. The ACT is also an aptitude test. Different from the SAT the ACT or the American College Test was created as a measure of a student’s educational development for the soldiers that were taking advantage of the GI money being awarded to them for college. The SAT was sometimes too difficult or inappropriate so a new measure was needed. There are four content areas it addresses, they are English, Mathematics, Reading, and Science. The ACT also takes three hours to administer similar to the SAT. The ACT is scored by giving 1 point for every correct answer and no subtraction of points for wrong answers. Then an average is computed for each of the four sections, unlike the SAT where scores are added together. One very important aspect of the SAT and ACT tests is that only 25% of academic success in college is associated with a high school student’s performance. The other 75% has to do with non-test factors. Therefore, students who may not do well on these tests should not be discouraged from attending college.
Standardized tests are assessment instruments that are administered, scored and interpreted in a typically, predetermined standard format. The standardized tests are used to get a handle on students’ achievement and aptitude. Test scores are described two ways by central tendency and variability. Central tendency uses mean and median and variability uses range and standard deviation to describe scores meaning. Interpreting results by percentiles, grade-equivalent scores, scale scores, stanines and normal curve-equivalent were also explained for strengths and weaknesses. The SAT and ACT were also described and explained.
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 13 focuses on standardized test scores and making sense out of these scores. There are a variety of ways to interpret standardized tests and their scores. Depending upon what the scores are intending to measure a variety of interpretations can be used. Teachers need to understand how to score and assess the various tests administered. By understanding how these tests are scored, teachers can inform their instruction, as well as, making sense of students’ performance on these tests and what the scores mean.
There are two types of standardized tests; one is designed to yield norm-referenced inferences the other criterion-referenced inferences. These tests are administered, scored and interpreted in a standard pre-determined manner. National standardized achievement tests are typically used to provide normed-referenced interpretations focused on the measurement of aptitude or achievement. State developed standardized achievement tests are used in many states for accountability purposes. Some states use them to assess basic skills of high school students to advance or allow or disallow a student to receive a diploma, even if curriculum requirements are met. Results are often publicized and try to indicate educators’ effectiveness. State standardized tests are intended to produce criterion-referenced interpretations.
Test scores can be interpreted individually or as a group. Group-focused interpretations are necessary for looking at all your students or groups of students. There are three ways described to do this. The first is by computing central tendency. The central tendency is an index of the groups’ scores, such as mean or median. A raw score is used to show the number of items answered correctly by a student. The median raw score is the midpoint of a set of scores. The mean raw score is the average of a set of scores. The mean and median show a center point for group scores. A second useful way to describe group scores is variability. Variability is how spread out the scores are. A simple measure of variability of a set of students scores is a range. A range is calculated easily by subtracting the lowest score and the highest score. A third way to look at group scores is the standard deviation. The standard deviation is the average difference between the individual scores in a group of scores and the mean of that set of scores. The larger the size of the standard deviation the more spread out the scores in the distribution. The formula for standard deviation is as follows:
SD = √∑(x-M)2
N
∑ (x-M)2 = Sum of the squared raw scores (x) – M (mean)
N = number of scores in the distribution
The mean and standard deviation are the best ways to discuss and describe group scores. It may however, be easier to compute median and range. However, this is not a very reliable way to always look at scores so the standard deviation method is much more reliable.
Individual test interpretations are also necessary. There are two ways to interpret individual test scores. They are interpreted in absolute or relative terms. An absolute inference is when we infer from a score what the student has mastered or not mastered and the skills and/or knowledge being assessed. A relative inference is when we infer from a score, how the student stacks up against other students currently taking the test, or already taken the test. The terms below average and above average are typically used in a relative inference.
There are three interpretative schemes to be considered from relative score-interpretations. The first scheme is percentiles or percentile ranks. A percentile compares a student’s score with those of other students in a norm group. The percentile indicates the percent of students in the norm group the student outperformed. The norm group is the students who took the test before it was published, to establish a norm group and help identify the test items that are appropriate. There are also different types of norm groups. There can be national norm groups and local norm groups. The second interpretive scheme is called grade-equivalent scores. A grade-equivalent is an indicator of student test performance based on grade levels and months of the school year. The purpose of grade-equivalent scores is to convert scores on a standardized assessment to an index score reflecting a student’s grade level progress in school. This score is also a developmental score and is indicated as follows:
Grade.month of school year
5.6
Grade equivalent scores are typically seen in reading and math. Grade-equivalent scores are determined by administering the same test to several grade levels establishing a trend-line which reflects the raw score increases at each grade level. Estimates at points along the trend line are established indicating what the grade-equivalent of the raw score would be. There are many assumptions made in this scoring theme making it a rather questionable scoring theme. It also can be misleading to parents and what it really translates to. The appropriate assumption is to say the grade equivalent score is an estimate of how the average student taking the test at a certain grade might score. The third scoring scheme is scale score interpretations. Scale scores are converted raw scores that use a new arbitrary chosen scale to represent levels of achievement or ability. The most popular scale score system is an item-response theory (IRT). This is different from a raw score reporting system. The difference is that IRT scales take into consideration the difficulty and other technical properties of every single item on the test. There is a different average scale score for each grade level. Scale scores are used heavily to describe group test performances at the state, district, and school levels. Scale scores can be used to permit longitudinal tracking of students’ progress and making direct comparisons of classes, schools and districts. It is very important to remember that, not all scale scores are similar and therefore, can’t be compared consistently on different scales score exams. Standardized tests use normal curve equivalent or NCE to attempt to use students’ raw scores to arrive at a percentile for a raw score, if the students’ scores were perfectly symmetrical a bell curve would be formed, however; sometimes the normal curve does not form and the NCE evaporates. Therefore, NCE’s were not a solution for comparing different standardized tests stanine’s like an NCE but it divides a score distribution into nine segments that though equal along the baseline of a set of scores contain different proportions of the distribution scores. Stanines are approximate scale scores.
There are two tests used to predict a high school student’s academic success in college. The first test is known as the SAT or Scholastic Aptitude Test. It’s function was originally to assist admissions officials in a group of elite Northeastern universities to determine who to admit. The test was designed to compare inherited verbal, quantitative and spatial aptitudes. Today, however; it is divided in three sections. The three sections are; critical reading, writing and mathematics. The SAT uses a score range from 200 to 800 for each of the three sections. The highest score as of the year 2005, that can be earned is a 2400. This is a total from all three sections. The test takes about three hours to administer. There is also a PSAT to help students prepare for the SAT. The second type of test is the ACT. The ACT is also an aptitude test. Different from the SAT the ACT or the American College Test was created as a measure of a student’s educational development for the soldiers that were taking advantage of the GI money being awarded to them for college. The SAT was sometimes too difficult or inappropriate so a new measure was needed. There are four content areas it addresses, they are English, Mathematics, Reading, and Science. The ACT also takes three hours to administer similar to the SAT. The ACT is scored by giving 1 point for every correct answer and no subtraction of points for wrong answers. Then an average is computed for each of the four sections, unlike the SAT where scores are added together. One very important aspect of the SAT and ACT tests is that only 25% of academic success in college is associated with a high school student’s performance. The other 75% has to do with non-test factors. Therefore, students who may not do well on these tests should not be discouraged from attending college.
Standardized tests are assessment instruments that are administered, scored and interpreted in a typically, predetermined standard format. The standardized tests are used to get a handle on students’ achievement and aptitude. Test scores are described two ways by central tendency and variability. Central tendency uses mean and median and variability uses range and standard deviation to describe scores meaning. Interpreting results by percentiles, grade-equivalent scores, scale scores, stanines and normal curve-equivalent were also explained for strengths and weaknesses. The SAT and ACT were also described and explained.
Saturday, May 10, 2008
Chapter 12 – Instructionally Oriented Assessment
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 12 discusses how classroom instruction can be guided by assessment. Two strategies are discussed for making instructional decisions before the assessment results and planning instruction to achieve the curricular aim(s) represented by an assessment.
The first strategy of applying instructional decisions as a result of assessment results is typically known by most teachers, it is just not widely used. Since teachers are typically in the practice of using assessments to assign grades, they don’t always see them as tools to make instructional decisions. Teachers use tests to find out a level of knowledge learned. Curricular aims tested for could be cognitive affect, or psychomotor abilities, each of these areas being quite substantial in scope. A sampling of these curricular aims through assessment can provide inferences of the student status. This inference can then be used to make instructional decisions. Other types of tests can be used to help determine grades, but assessments written to determine levels of knowledge should be used to inform instruction only.
Assessments that are used to inform instruction should use the following three categories as decision bases. The first what to teach? This can be as a pre-assessment given before instruction, for a specific objective that is necessary. The second decision category is, how long to keep teaching toward a particular objective. This can be assessed during the time of instruction. The decision from this type of formative assessment can be used to determine whether to continue, or cease instruction for an objective, for a student, or the whole class. The third decision category is, how effective an instructional lesson or unit was. This can be assessed by comparing students’ pre and post tests results. The decision to retain, discard or modify a given instructional lesson or unit can be determined by this decision.
The second assessment based strategy for improving classroom instruction is planning instruction to achieve curricular aim(s) represented by an assessment. This strategy is a rethink of educational assessments. For testing to influence teaching, tests should be constructed prior to instructional planning. This would allow planned instructional units to better coincide with the content of the test and can then inform instructional planning. In this model the teacher starts with the instructional objectives set forth in the curriculum, then moves to create an assessments based on these goals and then after, the pre-assessment helps plan instructional activities intended to promote student mastery of knowledge, skills and/or attitudes to be post-assessed. Curriculum, should always be the starting point. The assessment then acts as clarity to the instructional aims, and whether the skills intended are mastered, teachers should therefore never teach toward the test themselves.
There are three benefits to this strategy. The first is more accurate task analysis. Since you will have a clearer idea of the results you are after, you can better enable the knowledge and skills students need to achieve before mastering what is taught. Second more on-target practice activities can be used. You will have a better sense of your end-of-unit instruction outcomes so you can choose guided-practice and independent-practice activities more aligned with targeted outcomes. The third benefit would be more lucid expositions. As a result of understanding more clearly what needs to be assessed at the conclusion of instruction you can provide clearer explorations to students about content and where instruction is heading.
The idea of assessments for learning has also evolved over time. In the past teachers used assessments to assess mostly what students know and don’t know. This is known as assessments of learning. While assessments of learning are important and should be utilized assessments for learning should also be utilized and more than assessments of learning. It has been shown that students who are given assessments for learning were able to achieve in six to seven months what it took others a full year to achieve. There are five strategies suggested to implement strategies for learning sequence. The first strategy is to clarify and share learning intentions and criteria for success. The second strategy is to engineer effective classroom discussions, questions and learning tasks. Thirdly, to provide feedback that moves learners forward. Fourth, activate students as owners of their learning and fifth, to activate students as instructional resources for one another. Again this is a huge shift for many teachers but the benefits of learning for students are tremendous. Formative assessments can also be used to help achieve higher scores especially on summative assessments.
The ideas of this chapter were really based on improving instructional decisions and instruction based on information gained from assessments. There were two strategies described, making instructional decisions in light of assessment results and planning instruction to achieve the curricular aim(s) represented by an assessment.
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 12 discusses how classroom instruction can be guided by assessment. Two strategies are discussed for making instructional decisions before the assessment results and planning instruction to achieve the curricular aim(s) represented by an assessment.
The first strategy of applying instructional decisions as a result of assessment results is typically known by most teachers, it is just not widely used. Since teachers are typically in the practice of using assessments to assign grades, they don’t always see them as tools to make instructional decisions. Teachers use tests to find out a level of knowledge learned. Curricular aims tested for could be cognitive affect, or psychomotor abilities, each of these areas being quite substantial in scope. A sampling of these curricular aims through assessment can provide inferences of the student status. This inference can then be used to make instructional decisions. Other types of tests can be used to help determine grades, but assessments written to determine levels of knowledge should be used to inform instruction only.
Assessments that are used to inform instruction should use the following three categories as decision bases. The first what to teach? This can be as a pre-assessment given before instruction, for a specific objective that is necessary. The second decision category is, how long to keep teaching toward a particular objective. This can be assessed during the time of instruction. The decision from this type of formative assessment can be used to determine whether to continue, or cease instruction for an objective, for a student, or the whole class. The third decision category is, how effective an instructional lesson or unit was. This can be assessed by comparing students’ pre and post tests results. The decision to retain, discard or modify a given instructional lesson or unit can be determined by this decision.
The second assessment based strategy for improving classroom instruction is planning instruction to achieve curricular aim(s) represented by an assessment. This strategy is a rethink of educational assessments. For testing to influence teaching, tests should be constructed prior to instructional planning. This would allow planned instructional units to better coincide with the content of the test and can then inform instructional planning. In this model the teacher starts with the instructional objectives set forth in the curriculum, then moves to create an assessments based on these goals and then after, the pre-assessment helps plan instructional activities intended to promote student mastery of knowledge, skills and/or attitudes to be post-assessed. Curriculum, should always be the starting point. The assessment then acts as clarity to the instructional aims, and whether the skills intended are mastered, teachers should therefore never teach toward the test themselves.
There are three benefits to this strategy. The first is more accurate task analysis. Since you will have a clearer idea of the results you are after, you can better enable the knowledge and skills students need to achieve before mastering what is taught. Second more on-target practice activities can be used. You will have a better sense of your end-of-unit instruction outcomes so you can choose guided-practice and independent-practice activities more aligned with targeted outcomes. The third benefit would be more lucid expositions. As a result of understanding more clearly what needs to be assessed at the conclusion of instruction you can provide clearer explorations to students about content and where instruction is heading.
The idea of assessments for learning has also evolved over time. In the past teachers used assessments to assess mostly what students know and don’t know. This is known as assessments of learning. While assessments of learning are important and should be utilized assessments for learning should also be utilized and more than assessments of learning. It has been shown that students who are given assessments for learning were able to achieve in six to seven months what it took others a full year to achieve. There are five strategies suggested to implement strategies for learning sequence. The first strategy is to clarify and share learning intentions and criteria for success. The second strategy is to engineer effective classroom discussions, questions and learning tasks. Thirdly, to provide feedback that moves learners forward. Fourth, activate students as owners of their learning and fifth, to activate students as instructional resources for one another. Again this is a huge shift for many teachers but the benefits of learning for students are tremendous. Formative assessments can also be used to help achieve higher scores especially on summative assessments.
The ideas of this chapter were really based on improving instructional decisions and instruction based on information gained from assessments. There were two strategies described, making instructional decisions in light of assessment results and planning instruction to achieve the curricular aim(s) represented by an assessment.
Saturday, May 3, 2008
Chapter 11 – Improving Teacher-Developed Assessments
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 11 provides procedures to improve teacher developed assessments. There are two general improvement strategies discussed. The first is judgmental item improvement and the second is empirical time improvement. Teachers need to apply adequate time and be motivated to apply either of these strategies.
The first improvement strategy is judgmentally based improvement procedures. These procedures are provided by yourself, your colleagues and your students. Each of the three judges will follow different procedures. The first judge would be you, judging your own assessments. The five criteria procedures for judging your own assessments are; first, adhere to item-specific guidelines and general item-writing commandments, second, contributions to score-based inference, third, accuracy of content, fourth, absence of content lacunae or gaps, and fifth, fairness. The second judge would be a collegial judge. The collegial judge would be provided the same five criteria used by yourself as well as a brief description of each of the five criteria. The third judge would be the students. The students are often overlooked as a source of test analysis, but whom better and most experienced than the test takers. There are five improvement questions students can be asked after an assessment is administered. The first is, if any of the items seemed confusing, and which ones were they? Second, did any items have more than one correct answer? If so which ones? Third, did any item have no correct answers? If so which ones? Fourth, were there words in any items that confused you? If so, which ones? Fifth, were the directions for the test or for particular subsections unclear? If so which ones? These questions can be altered slightly for constructed response assessments, performance assessments, or portfolio assessments. Ultimately the teacher needs to be the final decision maker for any needed changes.
The second improvement strategy is known as empirically based improvement procedures. This improvement strategy is based on the empirical data supplied when students respond to teacher developed assessments. There are a variety of well tuned techniques used over the years that involve simple number or formula calculations. The first technique is difficulty indices. A useful index of item quality is its difficulty. This is also referred to as p-value. The formula looks as follows:
Difficulty P = R - # of students responding correctly to an item
T - # of total students responding to an item
A p-value can range from 0-1.00. Higher p-values indicate items that more students answered correctly. An item with a lower p-value would be an item most students missed. The p-value should be viewed as the relationship to the student’s chance, probability of getting a correct response. For example on a four item multiple choice assessment a .25 p-value, by chance should be expected. The actual difficulty of an item should be tied to the instructional program. So, if the p-value is high that does not necessarily mean the item was too easy rather the content has been taught well.
The second empirically based technique is item-discrimination indices. An item-discrimination index typically tells how frequently an item is answered correctly by those who perform well on the total test. An item-discrimination index reflects the relationship between student’s responses for the total test and their responses to a particular test item. The author uses a correlation coefficient between student’s total scores and their performance on a particular item. This is also referred to as a point-biserial. A positive discriminating item indicates an item is answered correctly more often by those who score well on the total test than by those who score poorly on the total test. A negatively discriminating item is answered correctly more often by those who score poorly on the total test than by those who score well on the total test. A nondiscriminating item, is one that there is no appreciable difference in the correct response for those who score well or poorly on the total test. Four steps can be applied to compute an items discrimination index. First order the test papers from high to low by total score. Then divide the papers evenly into a high group and a low group. Third calculate a p-value for each group. Lastly subtract pl from ph to obtain each items discrimination index (D). Example of this is:
D= Ph – Pl. The following chart can also be created:
Positive Discriminator High scores > Low scores
Negative Discriminator High scores < Low scores
Nondiscriminator High scores = Low scores
Negative discriminating items are a sign that something about the test is not good. This is so because the item is being missed more often by students who are performing well on the test in general and not being missed by students who are not doing well on the test. Ebel and Frisbie (1991) apply the following table.
.40 and above = Very good items
.30 - .39 = Reasonably good possible improvement needed
.20 - .29 = Marginal items should improve these items
.19 and below = Poor items – remove or revise items
The third technique offered for empirically based improvement procedures is distracter analysis. A distracter analysis is when we see how high and low groups are responding to the items distracters. This analysis is used to dig deeper on the basis of p-value or discrimination index’s needing revision. The following table should be set up to look at this technique:
Test Item # Alternatives
p = .50
D = .33 A B* C D Omit
Upper 15 2 5 0 8 0
Lower 15 4 10 0 0 1
In this example the correct response is B*. Notice students doing well are choosing D and students not doing well are choosing B. Also of interest is no one is choosing C. As a result of this analysis something should be changed in C to make it more appealing as well as looking a what about B is causing students to choose it who are not doing as well on the test and students who are doing well to not choose it.
Criterion referenced measurements can also use a form of item analysis to avoid low discriniation indicies from traditional item analysis. The first approach requires administering the assessment to the same group of students prior to and following instruction. Some disadvantages of this are that item analysis can’t occur till after both assessments are given and instruction has occurred. Pre-tests may also be reactive by sensitizing students to certain items, so post-test performance is a result of pre-test and instruction, not just instruction. The formula for this procedure would look like the following:
Dppd = Ppost – Ppre
Ppost – proportion of students answered correct on post-test
Ppre – proportion of students answered correct on pre-test
The value of Dppd (Discrimination based on pretest-posttest differences) should range from -1.00 to +1.00 high positives indicating good instruction. The second approach for item analysis of criterion-referenced measurement is to locate two different groups of students. One group having received instruction, and the other group not having received any instruction. By assessing and applying the item analysis described earlier you can gain information quicker on item quality, avoiding reactive pretests. The disadvantage is you are relying on human judgment.
Two strategies have been described judgment or empirical methods. Both strategies can be applied to improve assessment procedures. When using judgment procedures five criteria should be used for the assessment creator, as well as the collegial judge with descriptions and a series of question for students about the assessment. When using the more empirical methods several different types of item analysis and discrimination practices can be used by applying various formulas. Ultimately the assessment creator needs to apply good techniques and have a willingness to spend some time to improve their assessments.
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 11 provides procedures to improve teacher developed assessments. There are two general improvement strategies discussed. The first is judgmental item improvement and the second is empirical time improvement. Teachers need to apply adequate time and be motivated to apply either of these strategies.
The first improvement strategy is judgmentally based improvement procedures. These procedures are provided by yourself, your colleagues and your students. Each of the three judges will follow different procedures. The first judge would be you, judging your own assessments. The five criteria procedures for judging your own assessments are; first, adhere to item-specific guidelines and general item-writing commandments, second, contributions to score-based inference, third, accuracy of content, fourth, absence of content lacunae or gaps, and fifth, fairness. The second judge would be a collegial judge. The collegial judge would be provided the same five criteria used by yourself as well as a brief description of each of the five criteria. The third judge would be the students. The students are often overlooked as a source of test analysis, but whom better and most experienced than the test takers. There are five improvement questions students can be asked after an assessment is administered. The first is, if any of the items seemed confusing, and which ones were they? Second, did any items have more than one correct answer? If so which ones? Third, did any item have no correct answers? If so which ones? Fourth, were there words in any items that confused you? If so, which ones? Fifth, were the directions for the test or for particular subsections unclear? If so which ones? These questions can be altered slightly for constructed response assessments, performance assessments, or portfolio assessments. Ultimately the teacher needs to be the final decision maker for any needed changes.
The second improvement strategy is known as empirically based improvement procedures. This improvement strategy is based on the empirical data supplied when students respond to teacher developed assessments. There are a variety of well tuned techniques used over the years that involve simple number or formula calculations. The first technique is difficulty indices. A useful index of item quality is its difficulty. This is also referred to as p-value. The formula looks as follows:
Difficulty P = R - # of students responding correctly to an item
T - # of total students responding to an item
A p-value can range from 0-1.00. Higher p-values indicate items that more students answered correctly. An item with a lower p-value would be an item most students missed. The p-value should be viewed as the relationship to the student’s chance, probability of getting a correct response. For example on a four item multiple choice assessment a .25 p-value, by chance should be expected. The actual difficulty of an item should be tied to the instructional program. So, if the p-value is high that does not necessarily mean the item was too easy rather the content has been taught well.
The second empirically based technique is item-discrimination indices. An item-discrimination index typically tells how frequently an item is answered correctly by those who perform well on the total test. An item-discrimination index reflects the relationship between student’s responses for the total test and their responses to a particular test item. The author uses a correlation coefficient between student’s total scores and their performance on a particular item. This is also referred to as a point-biserial. A positive discriminating item indicates an item is answered correctly more often by those who score well on the total test than by those who score poorly on the total test. A negatively discriminating item is answered correctly more often by those who score poorly on the total test than by those who score well on the total test. A nondiscriminating item, is one that there is no appreciable difference in the correct response for those who score well or poorly on the total test. Four steps can be applied to compute an items discrimination index. First order the test papers from high to low by total score. Then divide the papers evenly into a high group and a low group. Third calculate a p-value for each group. Lastly subtract pl from ph to obtain each items discrimination index (D). Example of this is:
D= Ph – Pl. The following chart can also be created:
Positive Discriminator High scores > Low scores
Negative Discriminator High scores < Low scores
Nondiscriminator High scores = Low scores
Negative discriminating items are a sign that something about the test is not good. This is so because the item is being missed more often by students who are performing well on the test in general and not being missed by students who are not doing well on the test. Ebel and Frisbie (1991) apply the following table.
.40 and above = Very good items
.30 - .39 = Reasonably good possible improvement needed
.20 - .29 = Marginal items should improve these items
.19 and below = Poor items – remove or revise items
The third technique offered for empirically based improvement procedures is distracter analysis. A distracter analysis is when we see how high and low groups are responding to the items distracters. This analysis is used to dig deeper on the basis of p-value or discrimination index’s needing revision. The following table should be set up to look at this technique:
Test Item # Alternatives
p = .50
D = .33 A B* C D Omit
Upper 15 2 5 0 8 0
Lower 15 4 10 0 0 1
In this example the correct response is B*. Notice students doing well are choosing D and students not doing well are choosing B. Also of interest is no one is choosing C. As a result of this analysis something should be changed in C to make it more appealing as well as looking a what about B is causing students to choose it who are not doing as well on the test and students who are doing well to not choose it.
Criterion referenced measurements can also use a form of item analysis to avoid low discriniation indicies from traditional item analysis. The first approach requires administering the assessment to the same group of students prior to and following instruction. Some disadvantages of this are that item analysis can’t occur till after both assessments are given and instruction has occurred. Pre-tests may also be reactive by sensitizing students to certain items, so post-test performance is a result of pre-test and instruction, not just instruction. The formula for this procedure would look like the following:
Dppd = Ppost – Ppre
Ppost – proportion of students answered correct on post-test
Ppre – proportion of students answered correct on pre-test
The value of Dppd (Discrimination based on pretest-posttest differences) should range from -1.00 to +1.00 high positives indicating good instruction. The second approach for item analysis of criterion-referenced measurement is to locate two different groups of students. One group having received instruction, and the other group not having received any instruction. By assessing and applying the item analysis described earlier you can gain information quicker on item quality, avoiding reactive pretests. The disadvantage is you are relying on human judgment.
Two strategies have been described judgment or empirical methods. Both strategies can be applied to improve assessment procedures. When using judgment procedures five criteria should be used for the assessment creator, as well as the collegial judge with descriptions and a series of question for students about the assessment. When using the more empirical methods several different types of item analysis and discrimination practices can be used by applying various formulas. Ultimately the assessment creator needs to apply good techniques and have a willingness to spend some time to improve their assessments.
Sunday, April 27, 2008
Chapter 10 – Affective Assessments
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 10 discusses affect and its relevance as well as, the importance of assessing affect. The use of affective assessments could allow teachers to write goals and focus whole group instruction on students' affect, especially when individual students are not shifting their affective status, or for better monitoring and predicting of future behaviors. Likert inventories and Multifocus affective inventories are described. A five step process for creating Multifocus inventories is also discussed. Suggestions for when and how to assess are also presented.
The first question that is asked is why assess affect at all. Many teachers feel that they only need to address student’s skills and knowledge base. The idea of affect assessing is believed to be important and uninfluenced in the classroom. The author stresses however, that affect-variables are even more significant than cognitive variables or at leas as significant. Change in the belief that affect is not significant needs to occur. Many times teachers are totally unaware of a student’s attitude, interests and values. If a teacher were to become more aware of these things, especially in early years of education, they could have opportunities to better adjust a students affect toward instruction. By continually monitoring this from year to year students would be less likely to form negative attitudes and behaviors towards school. Especially of these attitudes can be detected and influenced more positively.
The other side of this argument points to vocal groups of people who only want traditional cognitive educations offered in public schools. The argument has been that affect should be the job of family and church. The problem with this entirely is the focus. There needs to be universal agreement to focus affect on specific areas. Specific areas, such as, “the promotion of students’ positive attitudes towards learning as an affective aspiration”. One would hope that everyone would agree to support affective learning. Therefore, it is very important to have a clear focus on affect and its relevance to learning.
There are some specific variables that could be assessed to promote affect universally. They are attitude, interests and values. Students who feel good about learning are more likely to continue to learn if learning is continued to be promoted. Some potential attitude focuses may be, positive attitudes; toward learning, toward self, toward self as a learner and toward those who differ from us. Another target would be student’s interests. Some interest targets might be subject-related interests, interest in reading and interest in emerging technologies. The third target would be values. While there are values some feel schools should have no part in, there are some that generally are non-controversial. Those values are honesty, integrity, justice and freedom. The goal would be to try not to assess too many variables but to target a few important ones to promote positive future affect in students.
There are a few different ways affect can be assessed in a classroom. The easier the ability to assess the more likely and successful a teacher will be as well. The best way to assess affect might be to ask students to complete a self-report inventory. An example of this type of affect assessment is a Likert inventory. Likert inventories are a series of statements that are responded to by agreement or disagreement. There are about eight steps in building a good Likert inventory. The first step is to choose the affective variable you want to assess. Determine which education variable to assess, an attitude, interest or value. Next generate a series of favorable or unfavorable statements regarding the affective variable. Try to use equal numbers of positive and negative statements. The third step is to get several people to classify each statement as positive or negative. Throw out any that aren’t agreed upon. The fourth step is to decide on the number and phrasing of the response options for each statement. Typically Likert scales uses five options, SD = strongly disagree, D = disagree, NS = not sure, A = agree, SA = strongly agree. Younger students may benefit from fewer choices. The fifth step is prepare the self-report inventory, giving students directions regarding how to respond and stipulating the inventory must be completed anonymously. Clear directions and general sample statement are important to producing good Likert assessments. The sixth step would be to administer the inventory either to your own students or if possible another set of students in a class that is not yours. Based on the responses you can make improvements before you administer to your students or the next time you administer to another group of your students. The seventh step is to score the inventories. Scoring should be clearly addressed in the directions and should conform to the number of responses. There should also be equal positive and negative distribution. An example of a scoring scenario could be, if there is a 5 choice response sequence; the SD and SA responses could be 5 points. Then lower as you move in D and A responses could be 3 points, and NS would be 0 points. The final and eighth step would be to identify and eliminate statements that fail to function in accord with the other statements. This can be done by completing a correlation coefficient. Remove statements that are not consistent in response and re-score the inventory without those responses. This process is referred to as Likerts criterion of internal consistency. Since there are many steps to the Likert inventories and that may be discouraging, a teacher could eliminate some steps.
A second type of inventory that focuses on collecting information about a number of students affective dispositions is a Multifocus affective inventory. Since Likert inventories focus on 10 to 20 items to a single affective area with fewer questions, Multifocus Affective Inventories can cover more areas. There are five steps to creating a Multifocus affective inventory. The first step is to select the affective variables to measure. Again here you will need to identify the educationally significant variables. The second step would be to determine how many items to allocate to each affective variable. The importance here is to include equal number of positive and negative responses. It is also recommended each item have two responses one positive and one negative. The more items the increments increase equally. The third step would be creating a series of positive and negative statements related to each affective variable. Statements need to be designed to elicit differential responses from students, but at the same time, the statements need to be scattered and not grouped together. The fourth step is to determine the number and phrasing of students’ response options. Traditional Likert responses can by used for Multifocus assessments as well. The fifth and final step would be to create clear directions for the inventory and an appropriate presentation format. It is important to include lucid directions about how to respond, at least one sample item, a request for anonymous, honest responses, and a reminder there is no right or wrong answers. These assessments are scored just like the Likert assessments. The Multifocus assessments purpose is to gather inferential data on student affect with fewer statements.
Affect can be assessed in systematic ways to allow a teacher to make instructional decisions about student’s current and future affect. Group focused inferences are the best way to use affective assessments. Individual inferences should be avoided. Attitudes, interests and values are variables that can be looked at universally for measuring affect. Self-report assessments such as Likert Inventories or Multifocus Inventories can created and used to assess affect.
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 10 discusses affect and its relevance as well as, the importance of assessing affect. The use of affective assessments could allow teachers to write goals and focus whole group instruction on students' affect, especially when individual students are not shifting their affective status, or for better monitoring and predicting of future behaviors. Likert inventories and Multifocus affective inventories are described. A five step process for creating Multifocus inventories is also discussed. Suggestions for when and how to assess are also presented.
The first question that is asked is why assess affect at all. Many teachers feel that they only need to address student’s skills and knowledge base. The idea of affect assessing is believed to be important and uninfluenced in the classroom. The author stresses however, that affect-variables are even more significant than cognitive variables or at leas as significant. Change in the belief that affect is not significant needs to occur. Many times teachers are totally unaware of a student’s attitude, interests and values. If a teacher were to become more aware of these things, especially in early years of education, they could have opportunities to better adjust a students affect toward instruction. By continually monitoring this from year to year students would be less likely to form negative attitudes and behaviors towards school. Especially of these attitudes can be detected and influenced more positively.
The other side of this argument points to vocal groups of people who only want traditional cognitive educations offered in public schools. The argument has been that affect should be the job of family and church. The problem with this entirely is the focus. There needs to be universal agreement to focus affect on specific areas. Specific areas, such as, “the promotion of students’ positive attitudes towards learning as an affective aspiration”. One would hope that everyone would agree to support affective learning. Therefore, it is very important to have a clear focus on affect and its relevance to learning.
There are some specific variables that could be assessed to promote affect universally. They are attitude, interests and values. Students who feel good about learning are more likely to continue to learn if learning is continued to be promoted. Some potential attitude focuses may be, positive attitudes; toward learning, toward self, toward self as a learner and toward those who differ from us. Another target would be student’s interests. Some interest targets might be subject-related interests, interest in reading and interest in emerging technologies. The third target would be values. While there are values some feel schools should have no part in, there are some that generally are non-controversial. Those values are honesty, integrity, justice and freedom. The goal would be to try not to assess too many variables but to target a few important ones to promote positive future affect in students.
There are a few different ways affect can be assessed in a classroom. The easier the ability to assess the more likely and successful a teacher will be as well. The best way to assess affect might be to ask students to complete a self-report inventory. An example of this type of affect assessment is a Likert inventory. Likert inventories are a series of statements that are responded to by agreement or disagreement. There are about eight steps in building a good Likert inventory. The first step is to choose the affective variable you want to assess. Determine which education variable to assess, an attitude, interest or value. Next generate a series of favorable or unfavorable statements regarding the affective variable. Try to use equal numbers of positive and negative statements. The third step is to get several people to classify each statement as positive or negative. Throw out any that aren’t agreed upon. The fourth step is to decide on the number and phrasing of the response options for each statement. Typically Likert scales uses five options, SD = strongly disagree, D = disagree, NS = not sure, A = agree, SA = strongly agree. Younger students may benefit from fewer choices. The fifth step is prepare the self-report inventory, giving students directions regarding how to respond and stipulating the inventory must be completed anonymously. Clear directions and general sample statement are important to producing good Likert assessments. The sixth step would be to administer the inventory either to your own students or if possible another set of students in a class that is not yours. Based on the responses you can make improvements before you administer to your students or the next time you administer to another group of your students. The seventh step is to score the inventories. Scoring should be clearly addressed in the directions and should conform to the number of responses. There should also be equal positive and negative distribution. An example of a scoring scenario could be, if there is a 5 choice response sequence; the SD and SA responses could be 5 points. Then lower as you move in D and A responses could be 3 points, and NS would be 0 points. The final and eighth step would be to identify and eliminate statements that fail to function in accord with the other statements. This can be done by completing a correlation coefficient. Remove statements that are not consistent in response and re-score the inventory without those responses. This process is referred to as Likerts criterion of internal consistency. Since there are many steps to the Likert inventories and that may be discouraging, a teacher could eliminate some steps.
A second type of inventory that focuses on collecting information about a number of students affective dispositions is a Multifocus affective inventory. Since Likert inventories focus on 10 to 20 items to a single affective area with fewer questions, Multifocus Affective Inventories can cover more areas. There are five steps to creating a Multifocus affective inventory. The first step is to select the affective variables to measure. Again here you will need to identify the educationally significant variables. The second step would be to determine how many items to allocate to each affective variable. The importance here is to include equal number of positive and negative responses. It is also recommended each item have two responses one positive and one negative. The more items the increments increase equally. The third step would be creating a series of positive and negative statements related to each affective variable. Statements need to be designed to elicit differential responses from students, but at the same time, the statements need to be scattered and not grouped together. The fourth step is to determine the number and phrasing of students’ response options. Traditional Likert responses can by used for Multifocus assessments as well. The fifth and final step would be to create clear directions for the inventory and an appropriate presentation format. It is important to include lucid directions about how to respond, at least one sample item, a request for anonymous, honest responses, and a reminder there is no right or wrong answers. These assessments are scored just like the Likert assessments. The Multifocus assessments purpose is to gather inferential data on student affect with fewer statements.
Affect can be assessed in systematic ways to allow a teacher to make instructional decisions about student’s current and future affect. Group focused inferences are the best way to use affective assessments. Individual inferences should be avoided. Attitudes, interests and values are variables that can be looked at universally for measuring affect. Self-report assessments such as Likert Inventories or Multifocus Inventories can created and used to assess affect.
Sunday, April 20, 2008
Chapter 9 – Portfolio Assessments
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - W. James Popham
Chapter 9 discusses Portfolio Assessments. The chapter shows how portfolio assessments in the classroom are more effective than using them as a standardized testing tool. Also discussed, is the importance of self-assessment when using portfolio assessment. Seven steps should also be implemented by the classroom teacher when using portfolio assessments. There are three functions of portfolio assessments that should be thought about by a teacher wanting to utilize portfolio assessments. As with most assessment tools there are also negatives and positives to be looked at as well.
The definition of a portfolio is a “systematic collection of one’s work”. In the educational setting we can think of a portfolio as the collection of a students work. While portfolio’s have been used by many professions over the years, they are somewhat new to education. They have been embraced by educators who are not keen on standardized tests. However, efforts to employ portfolios as large scale applications to accountability have not been very encouraging. One of the biggest reasons for this is cost for trained scorers and then centralized scoring. In some states where they have the classroom teachers do the scoring reliability has become an issue. This is due to the improper or no training and student bias concerns. In general using portfolio assessments in accountability testing may not be the best use of this form of assessment.
Using portfolio assessments in the classroom however, may be more realistic. The author suggests a seven stop sequence to making portfolio assessment successful in the classroom. The first step is to make sure your students “own” their portfolios. Students need to understand that the portfolio is a collection of their work not just a project to earn a grade. The second step would be to decide what kind of samples to collect. Collaboratively, the teacher and student should decide what work should be collected. A wide variety is also recommended. The third step is to collect and store work samples. This should be planned out with students as to how and where they will store and collect work. The fourth step is to select criteria by which to evaluate portfolio work samples. Again the teacher and student need to carefully plan out evaluative criteria for the work. Students need to clearly understand the rubric’s evaluative criteria. The fifth step is to require students to evaluate continually their own portfolio products. Using index cards students should provide a written self-evaluation of their work on an ongoing basis. The sixth step is to schedule and conduct portfolio conferences. This will take time but is very important for teacher and student. This step will also help the student make good self-evaluative comments. The seventh step is to involve parents in the portfolio assessment process. Parents should receive information as well about expectations and should also be encouraged to review student’s portfolios from time to time. They can even be involved in the self-assessment and reviews. These activities will encourage and promote the importance and show the value of portfolio assessment in the classroom.
There are several main purposes of portfolios identified by portfolio specialists. The first purpose is documentation of student progress. The major function would be to provide the teacher, student and parents with evidence of growth. Typically, this is known as a working portfolio and student self-evaluations are useful tools in this purpose. Student achievement levels should also influence instructional decisions. To do this information should be collected or assessed as close to the marking terms as possible. The second purpose of a portfolio is to provide an opportunity for showcasing student opportunities. The author Robert Stiggins refers to these as celebration portfolios and encourages them especially in early grades. Showcase portfolios should be a selection of best work and a thoughtful reflection of its quality provided by the student is essential. The third purpose of a portfolio assessment is evaluation of student status. This purpose would serve as determination of previously established evaluative criteria. Standardization of how portfolios are appraised is important in this purpose, such as a pre-established rubric provided with clear examples for the student to follow. These three purposes show why it is important for a classroom teacher to decide first the primary purpose of portfolios and then to determine how they should look and be prepared by students. Portfolio assessments should have one priority or purpose. One purpose can not satisfy multiple functions. The three purposes can not be provided in one function.
There are pros and cons of portfolio assessments as there are with all forms of assessments. The greatest strength of portfolio assessment is its ability to be tailored to a student’s needs, interests and abilities. Portfolios also show growth and learning of students. It provides a way to document and evaluate growth and learning in the classroom that standardized or written tests can not. Self-evaluation is also fostered which guides student learning over time. Personal ownership is also experienced by students in relation to their work and the progress they experience. There are also some cons of portfolios. The time factor sometimes makes it difficult to have consistent evaluations, as well as, creating appropriate scoring guides. The amount of time needed to properly carry out the task in properly creating and reviewing portfolios. The biggest problem can be proper training in carrying out portfolio assessments.
Classroom teachers really need to understand that portfolio assessments are not a one time measurement approach to address short term objectives. Portfolio assessments should be used for a big goal addressed throughout the year. Self-evaluation should be nurtured along as well. Teachers should pick one core area to use portfolios and not try to implement them the same way for every subject.
There are many good uses for portfolio assessments. While they should not be used in place of standardized testing or in conjunction with large scale accountability assessments. Portfolio assessments do have a place in the classroom setting. The student progress over-time can be addressed using portfolio assessments. The seven key ingredients to utilize portfolio assessments were also discussed. Three different functions of portfolio assessments were highlighted; documentation of progress, showcasing accomplishments and evaluation of status. Also discussed were the pros and cons of this form of assessment.
Classroom Assessment – What Teachers Need to Know - W. James Popham
Chapter 9 discusses Portfolio Assessments. The chapter shows how portfolio assessments in the classroom are more effective than using them as a standardized testing tool. Also discussed, is the importance of self-assessment when using portfolio assessment. Seven steps should also be implemented by the classroom teacher when using portfolio assessments. There are three functions of portfolio assessments that should be thought about by a teacher wanting to utilize portfolio assessments. As with most assessment tools there are also negatives and positives to be looked at as well.
The definition of a portfolio is a “systematic collection of one’s work”. In the educational setting we can think of a portfolio as the collection of a students work. While portfolio’s have been used by many professions over the years, they are somewhat new to education. They have been embraced by educators who are not keen on standardized tests. However, efforts to employ portfolios as large scale applications to accountability have not been very encouraging. One of the biggest reasons for this is cost for trained scorers and then centralized scoring. In some states where they have the classroom teachers do the scoring reliability has become an issue. This is due to the improper or no training and student bias concerns. In general using portfolio assessments in accountability testing may not be the best use of this form of assessment.
Using portfolio assessments in the classroom however, may be more realistic. The author suggests a seven stop sequence to making portfolio assessment successful in the classroom. The first step is to make sure your students “own” their portfolios. Students need to understand that the portfolio is a collection of their work not just a project to earn a grade. The second step would be to decide what kind of samples to collect. Collaboratively, the teacher and student should decide what work should be collected. A wide variety is also recommended. The third step is to collect and store work samples. This should be planned out with students as to how and where they will store and collect work. The fourth step is to select criteria by which to evaluate portfolio work samples. Again the teacher and student need to carefully plan out evaluative criteria for the work. Students need to clearly understand the rubric’s evaluative criteria. The fifth step is to require students to evaluate continually their own portfolio products. Using index cards students should provide a written self-evaluation of their work on an ongoing basis. The sixth step is to schedule and conduct portfolio conferences. This will take time but is very important for teacher and student. This step will also help the student make good self-evaluative comments. The seventh step is to involve parents in the portfolio assessment process. Parents should receive information as well about expectations and should also be encouraged to review student’s portfolios from time to time. They can even be involved in the self-assessment and reviews. These activities will encourage and promote the importance and show the value of portfolio assessment in the classroom.
There are several main purposes of portfolios identified by portfolio specialists. The first purpose is documentation of student progress. The major function would be to provide the teacher, student and parents with evidence of growth. Typically, this is known as a working portfolio and student self-evaluations are useful tools in this purpose. Student achievement levels should also influence instructional decisions. To do this information should be collected or assessed as close to the marking terms as possible. The second purpose of a portfolio is to provide an opportunity for showcasing student opportunities. The author Robert Stiggins refers to these as celebration portfolios and encourages them especially in early grades. Showcase portfolios should be a selection of best work and a thoughtful reflection of its quality provided by the student is essential. The third purpose of a portfolio assessment is evaluation of student status. This purpose would serve as determination of previously established evaluative criteria. Standardization of how portfolios are appraised is important in this purpose, such as a pre-established rubric provided with clear examples for the student to follow. These three purposes show why it is important for a classroom teacher to decide first the primary purpose of portfolios and then to determine how they should look and be prepared by students. Portfolio assessments should have one priority or purpose. One purpose can not satisfy multiple functions. The three purposes can not be provided in one function.
There are pros and cons of portfolio assessments as there are with all forms of assessments. The greatest strength of portfolio assessment is its ability to be tailored to a student’s needs, interests and abilities. Portfolios also show growth and learning of students. It provides a way to document and evaluate growth and learning in the classroom that standardized or written tests can not. Self-evaluation is also fostered which guides student learning over time. Personal ownership is also experienced by students in relation to their work and the progress they experience. There are also some cons of portfolios. The time factor sometimes makes it difficult to have consistent evaluations, as well as, creating appropriate scoring guides. The amount of time needed to properly carry out the task in properly creating and reviewing portfolios. The biggest problem can be proper training in carrying out portfolio assessments.
Classroom teachers really need to understand that portfolio assessments are not a one time measurement approach to address short term objectives. Portfolio assessments should be used for a big goal addressed throughout the year. Self-evaluation should be nurtured along as well. Teachers should pick one core area to use portfolios and not try to implement them the same way for every subject.
There are many good uses for portfolio assessments. While they should not be used in place of standardized testing or in conjunction with large scale accountability assessments. Portfolio assessments do have a place in the classroom setting. The student progress over-time can be addressed using portfolio assessments. The seven key ingredients to utilize portfolio assessments were also discussed. Three different functions of portfolio assessments were highlighted; documentation of progress, showcasing accomplishments and evaluation of status. Also discussed were the pros and cons of this form of assessment.
Saturday, April 12, 2008
Chapter 8 – Performance Assessment
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - W. James Popham
Chapter eight takes a look at performance assessments. Performance assessments try to create real-life situations and apply assessment inferences to the tasks being performed. Appropriate skills tasks are essential for performance assessments to be valid. There are seven evaluative criteria for performance assessment tasks. The skills to be assessed must also be significant. Evaluative criteria are one of the most important components of a rubric used to evaluate responses on performance assessments. Distinctions should be drawn for three rubric types as well as to enhance instruction.
A performance assessment is defined as an approach to measuring a student’s status based on the way the student completes a specified task. There are varied opinions about what a true performance assessment is. Some educators feel that short-answer and essay assessments constitute performance assessments. Other educators feel that there are three criteria that a true performance assessment must possess. The first criterion is multiple evaluative criteria. Performance must be judged using more than one criterion. The second criterion is pre-specified quality standards. This criterion states that each evaluative criterion that is being judged is clearly explained, in advance of any judgment of the quality of performance. The third criterion is judgmental appraisal. Human judgments are used to determine how acceptable a student’s performance is. There are still others who feel performance assessments must be demanding and aligned according to Blooms taxonomies. Regardless of the criteria, performance assessments are very different than selected or constructed response assessments.
Suitable tasks should be identified for performance assessments. Teachers will need to generate their own performance tests tasks or select tasks from other educators. Teachers will also need to make inferences about students and decisions based on those inferences. All of this should be based on the curricular aims established early on. One of the biggest drawbacks of performance assessments is, because students are responding to fewer tasks than typical pencil and paper tests, it is more difficult to generalize accurately about the skills and knowledge gained by a student. There are several evaluative criteria you can consider when evaluating performance-test tasks. The first is generalize-ability. Is there is a high likelihood the students performance on the tasks can be compared to other tasks? The second is authenticity. Is the task true to life as opposed to school only? The third is multiple foci, does the task measure multiple instructional outcomes, not just one? The fourth criterion is teach-ability, is the task one that students can become more proficient in, as a consequence of the teacher’s instructional efforts? The fifth is fairness. Is the task fair to all students? This is also a form of test-bias. The sixth is feasibility. Is the task realistically implementable? The seventh and final criterion is score-ability. Is the task likely to elicit content that can be reliably and accurately evaluated? The best case scenario would be to apply all of these criteria, but as many as possible will work as well. One last important factor to consider about performance assessments is the significance of the skill you’re evaluating. The performance assessment should be used for the most significant skills due to the amount of time in developing and scoring them.
A scoring rubric is typically used to score student responses on performance assessments. There are three important features of a scoring rubric. The first is evaluative criteria. This factor should be used to determine the quality of the response. No more than three or four evaluative criteria should be used. Descriptions of qualitative differences for the evaluative criteria should be included. A description must be supplied so qualitative distinctions in a response can be made using the criterion. Describe in words what a perfect response should be. An indication of whether a holistic or analytic scoring approach is to be used. The rubric must indicate if evaluative criteria are to be applied collectively in the form of holistic scoring or on a criterion-by-criterion basis in the form of analytic scoring. A well planned rubric will benefit instruction greatly.
There are a variety of rubrics seen today. Two types that are described as sorid by the author are task-specific and hyper-general. One that is described as super is a skill-focused rubric. The first sorid rubric is a task-specific rubric. In this rubric evaluative criteria are linked only to a specific task embodied in a specific performance test. This rubric does to provide insight into instruction for teacher. Students should be taught to perform well on a variety of tasks not a single task. The second sorid rubric is described as hyper-general rubric. In this rubric evaluative criteria are seen as general with very lucid terms used. This leads to inadequate essay or organization. These rubrics may as well be scored with letter grades of A through F as they provide no instructional value to student performance. The third rubric described is a rubric of value and one that should be used. This rubric is called a skill-focused rubric. These rubrics are developed around constructed response assessments being measured, as well as, what is being pursued instructionally by the teacher. The key here is to develop the scoring rubric before instructional planning begins. There are two areas of organization that should be appraised in a skill-focused rubric, overall structure and sequence.
There are five rules that should be followed in creating a skill-focused rubric. You will generate this rubric before you plan your instruction. The first rule is making sure the skill to be assessed is significant. Skills that are assessed with a skill-focused rubric should be demanding accomplishments, if they are not other assessment forms are more appropriate to use. Rule number two is to make certain all of the rubric’s evaluative criteria can be addressed instructionally. Scrutinize all evaluative criteria to ensure you can teach students to master all criteria. The third rule is to employ as few evaluative criteria as possible. Always try to focus on three or four evaluative criteria. If there are more criteria you are trying to achieve mastery on, it will be difficult to use performance assessments properly. The fourth rule is to provide a succinct label for each evaluative criterion. Using one word labels allows students to keep focused on what is expected to achieve mastery. The fifth rule is to match the length of the rubric to your own tolerance for detail. If more than one page rubrics seem overwhelming to you than keep them short. Rubrics should be built to match the detail preference of the teacher.
Performance assessments provide an alternative to traditional paper and pencil assessments. They are also sometimes seen as more true to life and what one would be expected to do in the real world. The tasks in performance assessments align closer to high level cognitive skills, allowing more accurate inferences to be derived about students. This allows for more positive influence on instruction. These assessments do however; require much more time and energy from students as well as teachers. The development and scoring must be done correctly in order for the inferences to be valid and effective.
Classroom Assessment – What Teachers Need to Know - W. James Popham
Chapter eight takes a look at performance assessments. Performance assessments try to create real-life situations and apply assessment inferences to the tasks being performed. Appropriate skills tasks are essential for performance assessments to be valid. There are seven evaluative criteria for performance assessment tasks. The skills to be assessed must also be significant. Evaluative criteria are one of the most important components of a rubric used to evaluate responses on performance assessments. Distinctions should be drawn for three rubric types as well as to enhance instruction.
A performance assessment is defined as an approach to measuring a student’s status based on the way the student completes a specified task. There are varied opinions about what a true performance assessment is. Some educators feel that short-answer and essay assessments constitute performance assessments. Other educators feel that there are three criteria that a true performance assessment must possess. The first criterion is multiple evaluative criteria. Performance must be judged using more than one criterion. The second criterion is pre-specified quality standards. This criterion states that each evaluative criterion that is being judged is clearly explained, in advance of any judgment of the quality of performance. The third criterion is judgmental appraisal. Human judgments are used to determine how acceptable a student’s performance is. There are still others who feel performance assessments must be demanding and aligned according to Blooms taxonomies. Regardless of the criteria, performance assessments are very different than selected or constructed response assessments.
Suitable tasks should be identified for performance assessments. Teachers will need to generate their own performance tests tasks or select tasks from other educators. Teachers will also need to make inferences about students and decisions based on those inferences. All of this should be based on the curricular aims established early on. One of the biggest drawbacks of performance assessments is, because students are responding to fewer tasks than typical pencil and paper tests, it is more difficult to generalize accurately about the skills and knowledge gained by a student. There are several evaluative criteria you can consider when evaluating performance-test tasks. The first is generalize-ability. Is there is a high likelihood the students performance on the tasks can be compared to other tasks? The second is authenticity. Is the task true to life as opposed to school only? The third is multiple foci, does the task measure multiple instructional outcomes, not just one? The fourth criterion is teach-ability, is the task one that students can become more proficient in, as a consequence of the teacher’s instructional efforts? The fifth is fairness. Is the task fair to all students? This is also a form of test-bias. The sixth is feasibility. Is the task realistically implementable? The seventh and final criterion is score-ability. Is the task likely to elicit content that can be reliably and accurately evaluated? The best case scenario would be to apply all of these criteria, but as many as possible will work as well. One last important factor to consider about performance assessments is the significance of the skill you’re evaluating. The performance assessment should be used for the most significant skills due to the amount of time in developing and scoring them.
A scoring rubric is typically used to score student responses on performance assessments. There are three important features of a scoring rubric. The first is evaluative criteria. This factor should be used to determine the quality of the response. No more than three or four evaluative criteria should be used. Descriptions of qualitative differences for the evaluative criteria should be included. A description must be supplied so qualitative distinctions in a response can be made using the criterion. Describe in words what a perfect response should be. An indication of whether a holistic or analytic scoring approach is to be used. The rubric must indicate if evaluative criteria are to be applied collectively in the form of holistic scoring or on a criterion-by-criterion basis in the form of analytic scoring. A well planned rubric will benefit instruction greatly.
There are a variety of rubrics seen today. Two types that are described as sorid by the author are task-specific and hyper-general. One that is described as super is a skill-focused rubric. The first sorid rubric is a task-specific rubric. In this rubric evaluative criteria are linked only to a specific task embodied in a specific performance test. This rubric does to provide insight into instruction for teacher. Students should be taught to perform well on a variety of tasks not a single task. The second sorid rubric is described as hyper-general rubric. In this rubric evaluative criteria are seen as general with very lucid terms used. This leads to inadequate essay or organization. These rubrics may as well be scored with letter grades of A through F as they provide no instructional value to student performance. The third rubric described is a rubric of value and one that should be used. This rubric is called a skill-focused rubric. These rubrics are developed around constructed response assessments being measured, as well as, what is being pursued instructionally by the teacher. The key here is to develop the scoring rubric before instructional planning begins. There are two areas of organization that should be appraised in a skill-focused rubric, overall structure and sequence.
There are five rules that should be followed in creating a skill-focused rubric. You will generate this rubric before you plan your instruction. The first rule is making sure the skill to be assessed is significant. Skills that are assessed with a skill-focused rubric should be demanding accomplishments, if they are not other assessment forms are more appropriate to use. Rule number two is to make certain all of the rubric’s evaluative criteria can be addressed instructionally. Scrutinize all evaluative criteria to ensure you can teach students to master all criteria. The third rule is to employ as few evaluative criteria as possible. Always try to focus on three or four evaluative criteria. If there are more criteria you are trying to achieve mastery on, it will be difficult to use performance assessments properly. The fourth rule is to provide a succinct label for each evaluative criterion. Using one word labels allows students to keep focused on what is expected to achieve mastery. The fifth rule is to match the length of the rubric to your own tolerance for detail. If more than one page rubrics seem overwhelming to you than keep them short. Rubrics should be built to match the detail preference of the teacher.
Performance assessments provide an alternative to traditional paper and pencil assessments. They are also sometimes seen as more true to life and what one would be expected to do in the real world. The tasks in performance assessments align closer to high level cognitive skills, allowing more accurate inferences to be derived about students. This allows for more positive influence on instruction. These assessments do however; require much more time and energy from students as well as teachers. The development and scoring must be done correctly in order for the inferences to be valid and effective.
Tuesday, April 8, 2008
Chapter 7 – Constructed Response Tests
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - W. James Popham
Constructed-response tests are the focus of chapter seven. Constructed-response tests are great for assessing if a student has achieved mastery. The creation and scoring of constructed response and effort involved is greater than selected response. They should be used when teachers want to know what is truly mastered. Scoring should be established ahead of time to preserve valid inferences. There are two kinds of constructed-response assessments discussed, short answer and essay.
The first type of constructed response discussed is short-answer item. Short-answer items require a word, phrase or sentence in response to a direct question or incomplete statement. Typically, short-answer response helps answer learning outcomes such as those focused on. The advantage of short-answer is students must produce a correct response, not pick-out a familiar choice. The major disadvantage of short answer is, responses are difficult to score. In accurate scoring leads to reduced reliability and then reduces the validity of the assessment based inferences made about students.
There are five item-writing guidelines for short-answer items. The first guideline is to employ direct questions rather than incomplete statements, particularly for younger students. Younger students especially will be less confused by direct questions. The use of direct question format also helps ensure the item writer phrases the item so less ambiguity is present. The second short-answer guideline is to structure the item so that a response should be concise. The test items should be created to elicit brief clear responses. The third guideline is to place blanks in the margin for direct questions or near the end of incomplete statements. This will also aid in scoring so items are aligned. By placing fill-in-the-blanks toward the end students will read the whole statement to ensure more accurate responses. The fourth guideline is for incomplete statements; use only one, or at most, two blanks. The use of more than two blanks leaves the item with holes galore causing meaning to be lost. The fifth guideline for short-answer response items is to make sure blanks for all items are equal in length. This practice will eliminate unintended clues to the responses.
The second kind of constructed response assessment item is essay items. Essay items are the most common form used of constructed response. Essay items are used to gage a student’s ability to synthesize evaluate and compose. A form of essay item is a writing sample. These are used heavily in performance assessments. A strength of essay item is assessment of complex learning outcomes. Some weaknesses of essay items are, they are difficult to write properly, and scoring responses reliably can also be a challenge.
There are five item-writing guidelines for essay items. The first guideline for essay items is convey to students a clear idea regarding the extensiveness of the response desired. There are two forms used for this, a restricted-response, which limits the form and content of the response. The second is an extended-response item, which gives more latitude in the response. When using these forms you can provide certain amount of space or number of word limits. The second guideline is to construct items so the student’s task is explicitly described. The nature of the assessment task has to be set forth clearly, so the student knows exactly how to respond. The third guideline is to provide students with the approximate time to be expended on each item, as well as each items value. Directions should state clearly how much time and the point values of your essay item responses. The forth guideline is to not employ optional items. Offering a menu of options in turn represents different exams altogether, the consequence is the impossibility of scoring on a common scale. The fifth guideline for essay-items is recursively judge an items quality by composing, mentally or in writing, the item as well as, a precursor for your expectations for a response.
As mentioned earlier the most difficult problem with constructed response items is scoring these items. There are five guidelines for scoring responses to essay items. The first is score responses holistically and/or analytically. Holistic scoring focuses on the essay response as a whole using evaluative criteria. The second scoring focus is analytic; this is a specific point-allocation approach. The second guideline is to prepare a tentative scoring key in advance of judging responses. To avoid being influenced by the students actions in class or quality of the first few responses decide ahead of time how you will score. The third guideline is to make decisions regarding the importance of mechanics writing prior to scoring. It is important to decide upfront how you will score mechanics. If the material is more important to what inferences you want to make, be sure to establish this before you begin scoring. The fourth guideline is to score all responses to one item before scoring responses to the next item. When scoring that item is complete, go to the next item on the same essay as well. Complete a whole essay at once. When scoring the whole essay at once, you increase the reliability of your scoring. This will also avoid having to constantly shift your focus to what you are scoring. It seems like it would take longer but it really wont. The fifth guideline is as much as possible, evaluate responses anonymously, try to have students sign their papers on the back and while scoring do not look at the names. This will help not making judgments outside your scoring rubric.
Assessing students using the constructed response format allows for more in-depth awareness of learned skills. There are two forms used for this short-answer and essay-item. There are also guidelines to implementing these item responses, as well as, scoring guidelines to ensure greater reliability and validity to score-bases inferences.
Classroom Assessment – What Teachers Need to Know - W. James Popham
Constructed-response tests are the focus of chapter seven. Constructed-response tests are great for assessing if a student has achieved mastery. The creation and scoring of constructed response and effort involved is greater than selected response. They should be used when teachers want to know what is truly mastered. Scoring should be established ahead of time to preserve valid inferences. There are two kinds of constructed-response assessments discussed, short answer and essay.
The first type of constructed response discussed is short-answer item. Short-answer items require a word, phrase or sentence in response to a direct question or incomplete statement. Typically, short-answer response helps answer learning outcomes such as those focused on. The advantage of short-answer is students must produce a correct response, not pick-out a familiar choice. The major disadvantage of short answer is, responses are difficult to score. In accurate scoring leads to reduced reliability and then reduces the validity of the assessment based inferences made about students.
There are five item-writing guidelines for short-answer items. The first guideline is to employ direct questions rather than incomplete statements, particularly for younger students. Younger students especially will be less confused by direct questions. The use of direct question format also helps ensure the item writer phrases the item so less ambiguity is present. The second short-answer guideline is to structure the item so that a response should be concise. The test items should be created to elicit brief clear responses. The third guideline is to place blanks in the margin for direct questions or near the end of incomplete statements. This will also aid in scoring so items are aligned. By placing fill-in-the-blanks toward the end students will read the whole statement to ensure more accurate responses. The fourth guideline is for incomplete statements; use only one, or at most, two blanks. The use of more than two blanks leaves the item with holes galore causing meaning to be lost. The fifth guideline for short-answer response items is to make sure blanks for all items are equal in length. This practice will eliminate unintended clues to the responses.
The second kind of constructed response assessment item is essay items. Essay items are the most common form used of constructed response. Essay items are used to gage a student’s ability to synthesize evaluate and compose. A form of essay item is a writing sample. These are used heavily in performance assessments. A strength of essay item is assessment of complex learning outcomes. Some weaknesses of essay items are, they are difficult to write properly, and scoring responses reliably can also be a challenge.
There are five item-writing guidelines for essay items. The first guideline for essay items is convey to students a clear idea regarding the extensiveness of the response desired. There are two forms used for this, a restricted-response, which limits the form and content of the response. The second is an extended-response item, which gives more latitude in the response. When using these forms you can provide certain amount of space or number of word limits. The second guideline is to construct items so the student’s task is explicitly described. The nature of the assessment task has to be set forth clearly, so the student knows exactly how to respond. The third guideline is to provide students with the approximate time to be expended on each item, as well as each items value. Directions should state clearly how much time and the point values of your essay item responses. The forth guideline is to not employ optional items. Offering a menu of options in turn represents different exams altogether, the consequence is the impossibility of scoring on a common scale. The fifth guideline for essay-items is recursively judge an items quality by composing, mentally or in writing, the item as well as, a precursor for your expectations for a response.
As mentioned earlier the most difficult problem with constructed response items is scoring these items. There are five guidelines for scoring responses to essay items. The first is score responses holistically and/or analytically. Holistic scoring focuses on the essay response as a whole using evaluative criteria. The second scoring focus is analytic; this is a specific point-allocation approach. The second guideline is to prepare a tentative scoring key in advance of judging responses. To avoid being influenced by the students actions in class or quality of the first few responses decide ahead of time how you will score. The third guideline is to make decisions regarding the importance of mechanics writing prior to scoring. It is important to decide upfront how you will score mechanics. If the material is more important to what inferences you want to make, be sure to establish this before you begin scoring. The fourth guideline is to score all responses to one item before scoring responses to the next item. When scoring that item is complete, go to the next item on the same essay as well. Complete a whole essay at once. When scoring the whole essay at once, you increase the reliability of your scoring. This will also avoid having to constantly shift your focus to what you are scoring. It seems like it would take longer but it really wont. The fifth guideline is as much as possible, evaluate responses anonymously, try to have students sign their papers on the back and while scoring do not look at the names. This will help not making judgments outside your scoring rubric.
Assessing students using the constructed response format allows for more in-depth awareness of learned skills. There are two forms used for this short-answer and essay-item. There are also guidelines to implementing these item responses, as well as, scoring guidelines to ensure greater reliability and validity to score-bases inferences.
Sunday, April 6, 2008
Chapter 6 – Selected Response Tests
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - W. James Popham
Chapter six focuses on creating and evaluating appropriate selected response assessments. There are five general item-writing commandments for selected and constructed response items. The author also discusses strengths and weaknesses for four types of selected response assessments. Selected item responses if written properly can assess higher level thinking skills. A best practice is to have a content knowledge colleague review your assessments as well as using formulas that can be used to review your assessments.
The five general item-writing commandments are essential to remember when creating selected and constructed response tests. The first commandment is not to provide opaque directions to students regarding how to respond to your assessments. Teachers typically don’t put serious thought into their directions. When a student is introduced to a testing format that is unfamiliar, wordy directions can be a distracter and cause incorrect responses that are unintentional and impair the implications from the assessment. The second commandment is not to employ ambiguous statements for your assessment items. If your questions are unclear and students are unsure of what you mean the questions can be misinterpreted. Again, this could cause a wrong response, when the student knows the answer, but does not know what you are asking. The third commandment is not to provide unintentional clues, regarding the correct response. Students in this case, will come up with a correct response because they were lead to it from the wording of the items. The student may not really know the correct response and now attention to that item is not assessed properly and follow-up may not occur when it would have been beneficial to gain actual learning. An unintentional clue can be as simple as the use of a word like “an” and how it completes the sentence or answer. The fourth commandment is to not employ complex syntax in assessment items. The goal in this case is to use very simple sentences. Too many clauses mess up the flow of the test item and what it is asking. The fifth and final commandment is to not use vocabulary that is more advanced than required or understood by the student. To get a fix on the students status you need to assess what is taught and learned not introduce new material. The first type of selected response tests is binary-choice items. This form of test item is commonly seen as true-false. This form is probably one of the oldest forms as well. There are five guidelines for writing binary-choice items. The first is phrase items so that superficial analysis will lead to wrong answers. By doing this you are trying to get students to think about the test item and present a way for you to assess how much good thinking they can do. The second guidelines is rarely use negative statements, and never use double negatives. It can be tempting to use the word “not” in a true statement, but this will only confuse the question and should be avoided. The third guideline is to only include one concept in each statement. If a test item has a concept in the first part that is true and the second concept is given that is false it makes it difficult for the student to respond correctly. This also leads to false inferences about the students’ true learning. The fourth guideline is importance of balancing. Keeping equal number of true-false responses is important and should be easy to do. The fifth guideline is keep item length similar for both categories being assessed. This guideline like the fourth guideline encourages structure to avoid guessing. If students see two answers are worded longer and that becomes a common pattern, they will begin responding that way, and again you will not get a true assessment of learning.
The second type of selected response assessments is multiple binary-choice items. A multiple binary-choice item is, when a cluster of items is presented, which requires a binary response to each of the items in the cluster. These types of clusters look similar to multiple choice however; they are statement clusters that require a single response for each cluster. Two important guidelines should be used for multiple binary-choice items. The first is separate item clutters clearly from one another. Since students are more familiar with multiple choice, when using binary multiple choice cluster items together, to clearly identify what is clustered together. Use stars or bold each new cluster very clearly. The second guideline is, to make certain that each item fits well with the clusters stem. The part preceding the response is the stem, all items should be linked to the stem in a meaningful way. One large benefit of using multiple binary-choice is if the stem contains new material and the binary-choice depends on the new material, it is certain that the student will need to go beyond recall knowledge to answer the question. Therefore, more intellectually demanding thought will be required, than just an ability to memorize.
The third style of selected response is multiple choices. This form of testing has also been widely used for achievement testing. It is typically used to measure student’s possession of knowledge as well as their ability to engage in higher levels of thinking. There are five guidelines for multiple choice items that should be employed. The first is that the stem should consist of a self-contained question or problem. Therefore, it is important to put as much content in the stem as necessary, to understand what the question item is getting at. The second guideline is to avoid negatively stated stems. Again using “not” may only confuse the testing item. Sometimes, it is even overlooked, so if it must be used use italics or bold the word “not”. The third guideline for multiple choice items is, not to let the length of alternative responses supply unintended clues. Try to keep all responses the same length or at least two short and two long. Distracters should align with the correct response. The fourth guideline is to make sure you scatter your correct responses. If students notice a pattern in your answers they may respond to that instead of your assessment. A good rule of thumb is 25% of your answers should represent the correct answers if they are A B C or D, or however many answer’s you have evenly divided. The fifth guideline is a suggestion to never use “all-of-the-above”. However; you can use “none-of-the-above”. “None-of-the-above” can be used to increase item difficulty. The reason “all-of-the-above” is not a good idea, as the assessment taker may only look at the first response see it is correct and choose it without looking further. To increase an items level of difficulty using “none-of-the-above” will work for test based inferences such as math problem solving. If a problem is displayed and the student must properly solve the answer if they guess something close you will know they did not work the problem correctly.
The fourth selected response type is matching items. Matching items consist of two parallel lists of words or phrases, requiring students to match items with appropriate items on the second list. One side should be premises and the other responses. There are six guidelines to follow for well constructed matching assessments. The first is employing homogenous lists. Each side should be as close to equal as possible, otherwise matching should not be used. The second guideline is to use relatively brief lists and place shorter words or phases to the right. Use about ten or less premise statements or words to cut down on distracters from choosing the correct response. The third guideline would be to use more responses than premises. The use of more responses will decrease the ability to answer by process of elimination. The fourth guideline is to order the responses logically, to avoid unintended clues. If you order the responses logically or alphabetically, you can avoid giving unintended clues. The fifth guideline is to describe the basis for matching and the number of times a response can be used. Students need to clearly understand how they should respond accurately. The more accurate they respond, the more valid your test is for making score based inferences. The sixth and final guideline is to place all premises and responses for an item on one page. Page flipping only creates confusion and leads to wrong responses, as well as, distractions for other assessment takers.
This chapter addressed important rules for constructing selected and constructive response assessments. Also addressed were the four types of selected response tests commonly used. Guidelines for the use of each of these types of selected response assessments were also discussed. By practicing these concepts assessment based validity and reliability of selected response assessments can be increased. Assessment based inferences can also be employed appropriately.
Classroom Assessment – What Teachers Need to Know - W. James Popham
Chapter six focuses on creating and evaluating appropriate selected response assessments. There are five general item-writing commandments for selected and constructed response items. The author also discusses strengths and weaknesses for four types of selected response assessments. Selected item responses if written properly can assess higher level thinking skills. A best practice is to have a content knowledge colleague review your assessments as well as using formulas that can be used to review your assessments.
The five general item-writing commandments are essential to remember when creating selected and constructed response tests. The first commandment is not to provide opaque directions to students regarding how to respond to your assessments. Teachers typically don’t put serious thought into their directions. When a student is introduced to a testing format that is unfamiliar, wordy directions can be a distracter and cause incorrect responses that are unintentional and impair the implications from the assessment. The second commandment is not to employ ambiguous statements for your assessment items. If your questions are unclear and students are unsure of what you mean the questions can be misinterpreted. Again, this could cause a wrong response, when the student knows the answer, but does not know what you are asking. The third commandment is not to provide unintentional clues, regarding the correct response. Students in this case, will come up with a correct response because they were lead to it from the wording of the items. The student may not really know the correct response and now attention to that item is not assessed properly and follow-up may not occur when it would have been beneficial to gain actual learning. An unintentional clue can be as simple as the use of a word like “an” and how it completes the sentence or answer. The fourth commandment is to not employ complex syntax in assessment items. The goal in this case is to use very simple sentences. Too many clauses mess up the flow of the test item and what it is asking. The fifth and final commandment is to not use vocabulary that is more advanced than required or understood by the student. To get a fix on the students status you need to assess what is taught and learned not introduce new material. The first type of selected response tests is binary-choice items. This form of test item is commonly seen as true-false. This form is probably one of the oldest forms as well. There are five guidelines for writing binary-choice items. The first is phrase items so that superficial analysis will lead to wrong answers. By doing this you are trying to get students to think about the test item and present a way for you to assess how much good thinking they can do. The second guidelines is rarely use negative statements, and never use double negatives. It can be tempting to use the word “not” in a true statement, but this will only confuse the question and should be avoided. The third guideline is to only include one concept in each statement. If a test item has a concept in the first part that is true and the second concept is given that is false it makes it difficult for the student to respond correctly. This also leads to false inferences about the students’ true learning. The fourth guideline is importance of balancing. Keeping equal number of true-false responses is important and should be easy to do. The fifth guideline is keep item length similar for both categories being assessed. This guideline like the fourth guideline encourages structure to avoid guessing. If students see two answers are worded longer and that becomes a common pattern, they will begin responding that way, and again you will not get a true assessment of learning.
The second type of selected response assessments is multiple binary-choice items. A multiple binary-choice item is, when a cluster of items is presented, which requires a binary response to each of the items in the cluster. These types of clusters look similar to multiple choice however; they are statement clusters that require a single response for each cluster. Two important guidelines should be used for multiple binary-choice items. The first is separate item clutters clearly from one another. Since students are more familiar with multiple choice, when using binary multiple choice cluster items together, to clearly identify what is clustered together. Use stars or bold each new cluster very clearly. The second guideline is, to make certain that each item fits well with the clusters stem. The part preceding the response is the stem, all items should be linked to the stem in a meaningful way. One large benefit of using multiple binary-choice is if the stem contains new material and the binary-choice depends on the new material, it is certain that the student will need to go beyond recall knowledge to answer the question. Therefore, more intellectually demanding thought will be required, than just an ability to memorize.
The third style of selected response is multiple choices. This form of testing has also been widely used for achievement testing. It is typically used to measure student’s possession of knowledge as well as their ability to engage in higher levels of thinking. There are five guidelines for multiple choice items that should be employed. The first is that the stem should consist of a self-contained question or problem. Therefore, it is important to put as much content in the stem as necessary, to understand what the question item is getting at. The second guideline is to avoid negatively stated stems. Again using “not” may only confuse the testing item. Sometimes, it is even overlooked, so if it must be used use italics or bold the word “not”. The third guideline for multiple choice items is, not to let the length of alternative responses supply unintended clues. Try to keep all responses the same length or at least two short and two long. Distracters should align with the correct response. The fourth guideline is to make sure you scatter your correct responses. If students notice a pattern in your answers they may respond to that instead of your assessment. A good rule of thumb is 25% of your answers should represent the correct answers if they are A B C or D, or however many answer’s you have evenly divided. The fifth guideline is a suggestion to never use “all-of-the-above”. However; you can use “none-of-the-above”. “None-of-the-above” can be used to increase item difficulty. The reason “all-of-the-above” is not a good idea, as the assessment taker may only look at the first response see it is correct and choose it without looking further. To increase an items level of difficulty using “none-of-the-above” will work for test based inferences such as math problem solving. If a problem is displayed and the student must properly solve the answer if they guess something close you will know they did not work the problem correctly.
The fourth selected response type is matching items. Matching items consist of two parallel lists of words or phrases, requiring students to match items with appropriate items on the second list. One side should be premises and the other responses. There are six guidelines to follow for well constructed matching assessments. The first is employing homogenous lists. Each side should be as close to equal as possible, otherwise matching should not be used. The second guideline is to use relatively brief lists and place shorter words or phases to the right. Use about ten or less premise statements or words to cut down on distracters from choosing the correct response. The third guideline would be to use more responses than premises. The use of more responses will decrease the ability to answer by process of elimination. The fourth guideline is to order the responses logically, to avoid unintended clues. If you order the responses logically or alphabetically, you can avoid giving unintended clues. The fifth guideline is to describe the basis for matching and the number of times a response can be used. Students need to clearly understand how they should respond accurately. The more accurate they respond, the more valid your test is for making score based inferences. The sixth and final guideline is to place all premises and responses for an item on one page. Page flipping only creates confusion and leads to wrong responses, as well as, distractions for other assessment takers.
This chapter addressed important rules for constructing selected and constructive response assessments. Also addressed were the four types of selected response tests commonly used. Guidelines for the use of each of these types of selected response assessments were also discussed. By practicing these concepts assessment based validity and reliability of selected response assessments can be increased. Assessment based inferences can also be employed appropriately.
Wednesday, April 2, 2008
Chapter 5 – Deciding What to Assess and How to Assess It
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - W. James Popham
Chapter 5 focuses on what a classroom teacher should be assessing as well as the procedures for properly assessing students. These questions should be guided by what information a teacher hopes to gather on students. Curricular standards should also play a role in assessment targets. Blooms Taxonomy is a helpful framework to decide cognitive outcomes from assessments and instruction. Deciding how to assess focuses on norm-referenced and criterion-referenced approaches. Selected response or constructed response type assessments are other considerations teachers should be aware of for assessing.
In this chapter we first focus is on what to assess. Decision-driven assessments help teachers gain information about their students. It is important to clarify, before an assessment is created, what decisions or decision will be influenced by a student’s performance on the assessment. Many times the knowledge and skills of the student are not the only expectation of the results. The attitudes toward what is being taught, as well as, effectiveness of instruction or need for further instruction are essential outcomes. By determining these things before the instruction and assessment a teacher can better inform instruction and assessment.
Curricular objectives also play a role in what to assess. Considering what your instructional objectives are can help you get a fix on what you should assess. In the past there was a demand for behavioral objectives. These objectives were sometimes too abundant and small-scoped which overwhelmed teachers. Today the goal is conceptualize curricular aims that are framed broadly and are measurable in order to organize instruction around them. The measurability is the key to starting good objectives. Even if they are broad, if they are measurable the can be managed by your instructional aims.
There are three potential assessment targets. The first is cognitive assessment which deals with students intellectual operations. These are the ability to display acquired knowledge or demonstrating thinking skills. The second target is affective assessment which deals with attitudes, interests, and values, like self-esteem, risk taking or attitude toward learning. The third target is psychomotor assessment targets which deal with a students’ large-muscle or small-muscle skills. These would be demonstrated in keyboarding skills or shooting a basketball in physical education. These ideas were presented by Benjamin Bloom through a classification system for educational objectives known as The 1956 Taxonomy of Educational Objectives.
The next area of focus is how to assess. There are two suggested strategies that are widely accepted. The first is norm-referenced measurement. In this strategy educators interpret a students performance in relation to the performance of students who have previously taken the same assessment. The previous group known is known as the norm group. The second strategy is criterion referenced strategy or criterion referenced interpretation. Criterion-referenced is an absolute interpretation as it hinges on the extent to which the curricular aim represented by the test are actually mastered by the student. The biggest differences in these approaches are how they are interpreted. Norm-referenced strategy should really only be used when a group of students need to be chosen for a specific educational experience. Otherwise, criterion referenced interpretations proved a much better idea of what students can and can not do, to allow teachers to make good instructional decisions.
Once teachers decide what to assess and then how to assess, the next thought should be how they will respond. There are really only two types of ways a student is able to respond, they are selected response and constructed response. Selected responses can be multiple choice or true or false type selections. Constructed response can be essay constructions, oral speeches, or product produced results. In deciding which type of response works best, scoring ease should not be a consideration. The assessment procedure used should focus on the student’s status in regards to an unobservable variable the teacher hopes to determine.
The more up-front thought a teacher gives to what to assess and how to assess the more likely they are to assess appropriately. Teachers need to be flexible and willing to change instruction based on assessment and strategies used to assess. By understanding instructional objectives and making them measurable assessments can be written to answer these questions for teachers and students.
Classroom Assessment – What Teachers Need to Know - W. James Popham
Chapter 5 focuses on what a classroom teacher should be assessing as well as the procedures for properly assessing students. These questions should be guided by what information a teacher hopes to gather on students. Curricular standards should also play a role in assessment targets. Blooms Taxonomy is a helpful framework to decide cognitive outcomes from assessments and instruction. Deciding how to assess focuses on norm-referenced and criterion-referenced approaches. Selected response or constructed response type assessments are other considerations teachers should be aware of for assessing.
In this chapter we first focus is on what to assess. Decision-driven assessments help teachers gain information about their students. It is important to clarify, before an assessment is created, what decisions or decision will be influenced by a student’s performance on the assessment. Many times the knowledge and skills of the student are not the only expectation of the results. The attitudes toward what is being taught, as well as, effectiveness of instruction or need for further instruction are essential outcomes. By determining these things before the instruction and assessment a teacher can better inform instruction and assessment.
Curricular objectives also play a role in what to assess. Considering what your instructional objectives are can help you get a fix on what you should assess. In the past there was a demand for behavioral objectives. These objectives were sometimes too abundant and small-scoped which overwhelmed teachers. Today the goal is conceptualize curricular aims that are framed broadly and are measurable in order to organize instruction around them. The measurability is the key to starting good objectives. Even if they are broad, if they are measurable the can be managed by your instructional aims.
There are three potential assessment targets. The first is cognitive assessment which deals with students intellectual operations. These are the ability to display acquired knowledge or demonstrating thinking skills. The second target is affective assessment which deals with attitudes, interests, and values, like self-esteem, risk taking or attitude toward learning. The third target is psychomotor assessment targets which deal with a students’ large-muscle or small-muscle skills. These would be demonstrated in keyboarding skills or shooting a basketball in physical education. These ideas were presented by Benjamin Bloom through a classification system for educational objectives known as The 1956 Taxonomy of Educational Objectives.
The next area of focus is how to assess. There are two suggested strategies that are widely accepted. The first is norm-referenced measurement. In this strategy educators interpret a students performance in relation to the performance of students who have previously taken the same assessment. The previous group known is known as the norm group. The second strategy is criterion referenced strategy or criterion referenced interpretation. Criterion-referenced is an absolute interpretation as it hinges on the extent to which the curricular aim represented by the test are actually mastered by the student. The biggest differences in these approaches are how they are interpreted. Norm-referenced strategy should really only be used when a group of students need to be chosen for a specific educational experience. Otherwise, criterion referenced interpretations proved a much better idea of what students can and can not do, to allow teachers to make good instructional decisions.
Once teachers decide what to assess and then how to assess, the next thought should be how they will respond. There are really only two types of ways a student is able to respond, they are selected response and constructed response. Selected responses can be multiple choice or true or false type selections. Constructed response can be essay constructions, oral speeches, or product produced results. In deciding which type of response works best, scoring ease should not be a consideration. The assessment procedure used should focus on the student’s status in regards to an unobservable variable the teacher hopes to determine.
The more up-front thought a teacher gives to what to assess and how to assess the more likely they are to assess appropriately. Teachers need to be flexible and willing to change instruction based on assessment and strategies used to assess. By understanding instructional objectives and making them measurable assessments can be written to answer these questions for teachers and students.
Saturday, March 29, 2008
Chapter 4 – Absence of Bias
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - W. James Popham
Chapter 4 discusses absence-of-bias, the last of the three essential criteria for evaluating educational assessments. Assessment bias is defined as qualities of any assessment instrument that offend or unfairly penalize a group of students because of the students’ gender, race, ethnicity, socioeconomic status, religion or other such group-defining characteristic. Assessment bias occurs when elements of an assessment distort a subgroups performance on the assessment. Bias review panels should be created to review especially high stakes tests. Students who are disabled or English language learners also experience assessment bias, due to the nature of their needs. An understanding of assessment bias is essential for classroom teachers to be aware of.
Assessment-bias occurs when elements of an assessment distort a student’s performance based on personal characteristics of a student. Assessment bias also interferes with test validity. If it distorts student performance then score based inferences can not be used accurately. There are two forms of assessment bias. The first form is offensiveness. Offensiveness occurs when negative stereotypes of certain sub groups are presented in an assessment. Offensiveness can also act as a distracter to these students taking focus off the question and causing a student to respond incorrectly or not fully. The second form of assessment bias is unfair penalization. Unfair penalization occurs when a student’s test performance is distorted due to content that may not be offensive but disadvantages a student’s subgroup. An example of this could be questions aimed at a strong knowledge of the game of football. Girls may tend to not be as familiar with terms used and therefore not do as well on these types of questions. Unfair penalization happens when it is not the students’ ability that leads to poor performance but the students’ characteristics or subgroup membership.
Disparate impact however, does not indicate assessment bias. If disparate impact has an effect on members of a certain ethnicity, gender or religious group it should be scrutinized for bias, but often it is an educational factor that needs to be addressed. Content review of test items is essential, especially for high stakes tests. These types of reviews can help show disparate impact as well as assessment bias. One way to do this is bias review panels. Bias review panels should consist of experts in the subject being reviewed as well as individual exclusively from the subgroup being adversely impacted or potentially adversely impacted. Male and females should also be equally represented. Once the panel is formed assessment bias needs to be clearly defined and explained to the panel. The purpose of the assessment also needs to be clearly defined and explained. The next step for the bias review panel would be a per-item absence-of-bias judgment. A question should be developed that the review panel will ask of each item as they read through the assessment items. They should answer yes or no to the item question. Once the items are tallied the percentage of no judgments per item are calculated. An overall of per item absence-of-bias index can be computed for each item, and then the entire test. In addition to this type of scoring, when a panelist feels a question warrants a “no”, an explanation is also provided in writing. Often an item is discarded on this written basis. Individual item bias review should then be followed by overall absence-of-bias review. A overall question should be created and asked of the whole assessment. The same scoring process is applied and items can be modified to be corrected rather quickly.
Whole bias detection for high stakes tests can be detected through bias review panels as well. However, this is not practical for classroom assessments. Bias detection in the classroom can be prevented by becoming sensitive to the existence of assessment bias and the need to eliminate it. Teachers’ should think seriously about the backgrounds and experiences of the various students in their class. Then try to review each item on every assessment as if the item might offend or unfairly penalize any students. If any items may show bias eliminate them. It may also be helpful if you are unsure and have access to other teachers with background or knowledge of a particular sub group, to have them review your tests or the item you are unsure of.
Assessing students with disabilities can also lead to assessment bias. Federal laws have forced regular teachers to look at how they teach and assess students with disabilities. In 1975 Public Law 94-142 was enacted for states that properly educated students with disabilities to receive federal funds. This law also established the use of “IEP’s” or Individual Education Plans for students with disabilities. IEP’s are federally prescribed documents created by parents, teachers, and specialized service providers about how a child with disabilities should be educated. In 1997 this act was reauthorized and renamed Individuals with disabilities act (IDEA). This reauthorization required state and districts to identify curricular expectations for special education students similar to expectations of all students. These students were also required to be included in assessment programs reported to the public. In January 2002, No Child Left Behind was enacted to provide consequences for states and districts not complying to force them to have to comply or face consequences. This act also intended to improve achievement levels of all students. These improvements have to be demonstrated on State chosen tests linked to the states curricular aims. Over twelve years all schools must meet adequate yearly progress (AYP), which increases in score levels from year to year. No Child Left Behind also requires sub-groups of students to meet AYP targets based on a number chosen by each state. IEP’s are used today to help students meet content standards pursued by all students.
Assessment accommodations are used to address the issue of assessment-bias for disabled students. Accommodations are procedures or practice that allows students with disabilities to have equitable access to instruction and assessment. Assessments do not lower expectations for these students but rather provide a student with a setting similar to the one used during instruction. The accommodations can not alter the nature of the skills or knowledge being assessed. There are four typical accommodation categories; presentation, response setting and timing and scheduling accommodations. Students should be a part of choosing the accommodations that best work for them to ensure their proper use.
English language learners are another diverse group. This group consists of; students whose first language is not English and know little if any, students who are beginning to learn English but could benefit from school instruction, and students who are proficient in English but need additional assistance in academic or social contexts. Another sub group is limited English Proficient (LEP) students who are students who use a language other than English proficiently at home. If students are going to be assessed fairly that are considered ELL or ESL then classifications need to be made consistent within and across states. There are also sparse populations in areas and statistically this needs to be accounted for. There is also instability over time of these students. Mobility is a very common and therefore instruction instability can also be a contributing factor. Lower base-line and cut scores should be considered as language based subjects tend to be more difficult. The movement to isolate these sub-groups makes sense; however, how the scores are analyzed needs to be taken into consideration for students who have a poor understanding of English. When taking a test in English, preventing test bias for these students should be a consideration.
Classroom teachers need to have awareness for assessment bias. They also need to use measures to prevent bias on their assessments whenever possible. On high stakes tests test review panels need to be formed for item analysis as well as assessment analysis. Students with disabilities should have IEP’s written to help accommodate their testing needs for equity during assessments. Students whose first language is not English should be assessed in a subgroup and not penalized for low scores on English written tests.
Classroom Assessment – What Teachers Need to Know - W. James Popham
Chapter 4 discusses absence-of-bias, the last of the three essential criteria for evaluating educational assessments. Assessment bias is defined as qualities of any assessment instrument that offend or unfairly penalize a group of students because of the students’ gender, race, ethnicity, socioeconomic status, religion or other such group-defining characteristic. Assessment bias occurs when elements of an assessment distort a subgroups performance on the assessment. Bias review panels should be created to review especially high stakes tests. Students who are disabled or English language learners also experience assessment bias, due to the nature of their needs. An understanding of assessment bias is essential for classroom teachers to be aware of.
Assessment-bias occurs when elements of an assessment distort a student’s performance based on personal characteristics of a student. Assessment bias also interferes with test validity. If it distorts student performance then score based inferences can not be used accurately. There are two forms of assessment bias. The first form is offensiveness. Offensiveness occurs when negative stereotypes of certain sub groups are presented in an assessment. Offensiveness can also act as a distracter to these students taking focus off the question and causing a student to respond incorrectly or not fully. The second form of assessment bias is unfair penalization. Unfair penalization occurs when a student’s test performance is distorted due to content that may not be offensive but disadvantages a student’s subgroup. An example of this could be questions aimed at a strong knowledge of the game of football. Girls may tend to not be as familiar with terms used and therefore not do as well on these types of questions. Unfair penalization happens when it is not the students’ ability that leads to poor performance but the students’ characteristics or subgroup membership.
Disparate impact however, does not indicate assessment bias. If disparate impact has an effect on members of a certain ethnicity, gender or religious group it should be scrutinized for bias, but often it is an educational factor that needs to be addressed. Content review of test items is essential, especially for high stakes tests. These types of reviews can help show disparate impact as well as assessment bias. One way to do this is bias review panels. Bias review panels should consist of experts in the subject being reviewed as well as individual exclusively from the subgroup being adversely impacted or potentially adversely impacted. Male and females should also be equally represented. Once the panel is formed assessment bias needs to be clearly defined and explained to the panel. The purpose of the assessment also needs to be clearly defined and explained. The next step for the bias review panel would be a per-item absence-of-bias judgment. A question should be developed that the review panel will ask of each item as they read through the assessment items. They should answer yes or no to the item question. Once the items are tallied the percentage of no judgments per item are calculated. An overall of per item absence-of-bias index can be computed for each item, and then the entire test. In addition to this type of scoring, when a panelist feels a question warrants a “no”, an explanation is also provided in writing. Often an item is discarded on this written basis. Individual item bias review should then be followed by overall absence-of-bias review. A overall question should be created and asked of the whole assessment. The same scoring process is applied and items can be modified to be corrected rather quickly.
Whole bias detection for high stakes tests can be detected through bias review panels as well. However, this is not practical for classroom assessments. Bias detection in the classroom can be prevented by becoming sensitive to the existence of assessment bias and the need to eliminate it. Teachers’ should think seriously about the backgrounds and experiences of the various students in their class. Then try to review each item on every assessment as if the item might offend or unfairly penalize any students. If any items may show bias eliminate them. It may also be helpful if you are unsure and have access to other teachers with background or knowledge of a particular sub group, to have them review your tests or the item you are unsure of.
Assessing students with disabilities can also lead to assessment bias. Federal laws have forced regular teachers to look at how they teach and assess students with disabilities. In 1975 Public Law 94-142 was enacted for states that properly educated students with disabilities to receive federal funds. This law also established the use of “IEP’s” or Individual Education Plans for students with disabilities. IEP’s are federally prescribed documents created by parents, teachers, and specialized service providers about how a child with disabilities should be educated. In 1997 this act was reauthorized and renamed Individuals with disabilities act (IDEA). This reauthorization required state and districts to identify curricular expectations for special education students similar to expectations of all students. These students were also required to be included in assessment programs reported to the public. In January 2002, No Child Left Behind was enacted to provide consequences for states and districts not complying to force them to have to comply or face consequences. This act also intended to improve achievement levels of all students. These improvements have to be demonstrated on State chosen tests linked to the states curricular aims. Over twelve years all schools must meet adequate yearly progress (AYP), which increases in score levels from year to year. No Child Left Behind also requires sub-groups of students to meet AYP targets based on a number chosen by each state. IEP’s are used today to help students meet content standards pursued by all students.
Assessment accommodations are used to address the issue of assessment-bias for disabled students. Accommodations are procedures or practice that allows students with disabilities to have equitable access to instruction and assessment. Assessments do not lower expectations for these students but rather provide a student with a setting similar to the one used during instruction. The accommodations can not alter the nature of the skills or knowledge being assessed. There are four typical accommodation categories; presentation, response setting and timing and scheduling accommodations. Students should be a part of choosing the accommodations that best work for them to ensure their proper use.
English language learners are another diverse group. This group consists of; students whose first language is not English and know little if any, students who are beginning to learn English but could benefit from school instruction, and students who are proficient in English but need additional assistance in academic or social contexts. Another sub group is limited English Proficient (LEP) students who are students who use a language other than English proficiently at home. If students are going to be assessed fairly that are considered ELL or ESL then classifications need to be made consistent within and across states. There are also sparse populations in areas and statistically this needs to be accounted for. There is also instability over time of these students. Mobility is a very common and therefore instruction instability can also be a contributing factor. Lower base-line and cut scores should be considered as language based subjects tend to be more difficult. The movement to isolate these sub-groups makes sense; however, how the scores are analyzed needs to be taken into consideration for students who have a poor understanding of English. When taking a test in English, preventing test bias for these students should be a consideration.
Classroom teachers need to have awareness for assessment bias. They also need to use measures to prevent bias on their assessments whenever possible. On high stakes tests test review panels need to be formed for item analysis as well as assessment analysis. Students with disabilities should have IEP’s written to help accommodate their testing needs for equity during assessments. Students whose first language is not English should be assessed in a subgroup and not penalized for low scores on English written tests.
Subscribe to:
Posts (Atom)