Classroom Assessment: What Teachers Need to Know [CLASSROOM ASSESSMENT 5/E]
Saturday, July 31, 2010
Friday, June 20, 2008
Aligning Classroom Assessments to Content Standards - Final Project
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Educational Assessment – Mentor: Lorraine Lander
Final Project: Aligning Classroom Assessments to Content Standards
Assessments in the classroom today should have one main focus. Assessments should be aligned with the content or curricular aims intending to be learned. Content standards or curricular aims are described as the knowledge or skills teachers and educators want students to learn. When teachers plan for instruction they need to mindfully plan their assessments prior to instruction in order to properly align the curricular aims with the instruction. Thus, allowing for the adjustment of instruction when curricular aims are not being mastered. The assessments also need to be valid and reliable so that the inferences that are made about students can be determined upon the results or outcomes of the assessments. Popham suggests that using various types of assessments will help teachers to get a sampling of what knowledge or skills students have acquired.
One very important assessment that is used in New York State is the 4th Grade NYS Math Assessment. The 4th Grade NYS Math Assessment is one of many grade level based exams used to comply with No Child Left Behind Legislation. Popham refers to this type of test as a high stakes exam. The NYS 4th Grade assessment is a standards based assessment that according to Popham’s definition would be considered instructionally sensitive. The 4th Grade NYS Math assessment attempts to measure a manageable number of curricular aims., The assessment also attempts to clearly describes the skills and/or knowledge being assessed and it is able to report students’ performance in a way to benefit teachers and students instructionally which according to Popham is essential. Popham would also agree that an instructionally sensitive assessment is also a way to measure a teacher’s effectiveness instructionally. Since the exam is not created in the classroom however, it may need to be carefully evaluated for assessment bias as well as item bias. Many districts in New York State are beginning to take advantage of this aspect of the assessment. In order for teachers to align and inform their instruction a group of teachers in New York State are using benchmark testing three times a year along with unit assessments and other formative assessment measures, to help inform their instruction and get a handle on what their students are really learning.
The process being used by a current group of New York State Teachers is described as follows. The first step for the teachers was to aligning their instruction to the assessment they are creating. Currently they are taking the current grade level being taught; for the purpose of this paper I will use Grade 5 as the base grade. The teachers will use the scores as well as the Performance Indicators from the March, 4th Grade NYS Assessment as their base assessment. If a new or untested student arrives in the fall, they will use only the score from the September 5th Grade Benchmark exam for alignment purposes. A Fall Benchmark Assessment is then developed. In order to develop this benchmark assessment the teachers will gather the performance indicators from the NYS 4th Grade Assessment. Then with the help of the other fourth grade teachers along with the current fifth grade teachers they will look at the key curricular aims in 4th grade as well as 5th Grade. The key curricular aims that these students should have learned and mastered from 4th grade are used from the NYS 4th Grade Assessment. They will then, align these curricular aims with the current aims expected in 5th grade to see where there are matches or similar curricular aims. Popham also recommends this type of alignment, but stresses not using too many content standards when trying to get a handle on content mastery. The teachers will align the curriculum maps of these two grade levels side by side and then select the curricular aims that appear across the grade levels. A form to help collect this information is used to help simplify the process for them. Below is a sample of this form.
Sample Form:
Targeted Grade Level: _________
Projected Benchmark Testing Dates: __________________________________
Test Format: __________________________________
Targeted Content Strand (Curricular Aim): ___Measurement____________________________
Performance Indicator Sample Test Question
4M1 – Select tools and units (customary and metric) appropriate for length measured.
5M1 – Use a ruler to measure the nearest in., ½, ¼ , and 1/8 inch Measure the following line.:
________________________
6M1 – Measure capacity and calculate volume of a rectangular prism.
Bands within the Content Strands:
Measurement: Units of measurement
Tools and Methods
Units
Error and Magnitude
Estimation
Once they have decided the appropriate curricular aims the teachers will then begin to look at test questions that are aligned with the curricular aims they have chosen. To do this they utilize the bank of older test questions provided from NYS of previous tests or software that they have available to use that is aligned with the curricular aims chosen. This practice also aligns with sound assessment practices recommended by Popham. Because the teachers are pulling questions from multiple sources they are showing generalized test-taking preparation. This allows students to be prepared for many different types of test items according to test preparation guidelines explained by Popham. Popham also recommends the more questions used the better you are able to gage the students’ mastery of skills and knowledge. There are some key questions that the teachers use to ensure that they are designing this Benchmark Assessment accordingly. These key questions also align with many of Popham’s recommendations for classroom assessment creation
1. What will be the format for your test questions? Popham recommends keeping test items in sequence and size aligned. He also suggests not having problems run onto another page or causing pages to be flipped when taking an assessment.
2. For each curricular aim selected, how many questions will we have? Popham would suggest an even distribution. For example 2 questions per curricular aim.
3. How many total questions will be on the test? Popham would recommend 50 questions or more for reliable results.
4. How many of each type of question will be included? Again, an even distribution is recommended by Popham.
5. How will we ensure that there is a range in the level of difficulty for the question selected for each curricular aim? An item difficulty indices might be used to determine
this.
6. Will there be a time limit? If so what will the time limit be?
For this assessment the teachers will incorporate multiple choice as well extended response or constructed response questions. The NYS Math exam uses three types of question formats; Multiple Choice, Short Response and Extended Response. Generalized mastery should be promoted with whatever is being taught, therefore varied assessment practices should also be employed whenever possible (Popham, 2008). Performance assessments and Portfolio assessments as well as Affect assessments can also be used throughout the year to assess where students are in relation to the curricular aims intending to be assessed. By using these strategies the teachers are promoting educational defensibility as well as professional ethics. This practice is also teaching to the curricular aims represented by the assessment. While the test is important it is not the single defining piece of learning for the teacher. The curricular aims addressed are also important to the grade level curriculum. That is how they are proving educational defensibility and by using older exam questions aligned to the curricular aims the teachers are intending to measure they are also ensuring professional ethics. If the teachers incorporate some of the other assessment techniques throughout the year they will be using general assessment practices and preparing students for a variety of different testing formats. Since the 4th Grade NYS Assessment is largely comprised of Number Sense and Operation questions the teachers have chosen many of the questions from that standard. A couple from the Algebra and Statistics and probability standards will also be addressed but the fall instruction for 5th grade is based largely on the Number Sense and Operations curriculum. The test will have a total of 20 questions 16 multiple choice and 4 Constructed Response. The students will be given an hour to complete the assessment. To make the assessment a more reliable assessment they might have chosen more questions and used a couple of questions per curricular aim being assessed. This would have also helped with the reliability of the assessment.
Some final considerations suggested by the teachers before successfully administering the assessment are described next. The teachers recommend the use of a cover page. The directions should be clearly written and contain understandable wording for each section of the test. A practice also outlined by Popham in his five general Item-Writing Commandments. (Popham, 2008). The teachers suggest that modeling your assessment after the NYS Math assessment, directions should be modeled as well. Popham would suggest a more generalized modeling in order to prepare students for various test items they may encounter. Bubble sheets for Multiple Choice. Again, clearly labeled sheets and directions on how to bubble the answer sheet, as well as, how to correct mis-bubbles or mistakes is very important. Determine ahead of time the number of copies needed of assessments and answers sheets. (if this is a grade-wide assessment, determine who will be in charge of copying?). Testing Modifications and accommodations for Special Education students should be used. This practice is ensuring that these students are assessed in the same way that they are instructed for valid inferences to be gained and to minimize assessment bias. Date and time that the grade level or class will be given the exam. (To ensure that students are administered the same exam at the same time to disallow test reactive-ness to occur). Scoring scale and rubric are established far before the assessment is administered and before instruction occurs. Having students participate in the rubric building is also recommended. Grading sheets prepared. Who will be scoring the test and when is determined ahead of time. How the data will be analyzed and reported.
The fifth grade students are then given the assessment in early September. The multiple choice portion is then graded separately from the constructed response portion. A group of teachers will meet to grade the constructed response portion during a mutual meeting time. A grid is created in Excel for all portions of the test. Down the left side of the grid is the students names along the top of the grid are the curricular aims addressed and the corresponding test question numbers. Along the bottom of the grid is a tally for the number missed. Each question that is missed gets a check under the question and curricular aim it corresponds to. All the items are tallied at the bottom and off to the left side a total correct tally is also created for both parts of the assessment along with a total score tally area. (See attached example for clarification purposes).
The items that seem to have the most problems are the items considered to not have been mastered by those students. Since there are a number of items the group as a whole is struggling with the teacher will use these items as key targets in upcoming instruction. When we look at this analysis grid it might seem that the items should be thrown out however, since this assessment is being used to guide instruction the teachers are using it to see what was learned and not learned, they are therefore, valid questions. On the December benchmark these items would hopefully not be as problematic and instead show learning that has occurred. The same can be said for the items scored well on. These items in some assessment situations might be items to through out however, in this case they are items that have been mastered and show learning that has occurred. The teachers in this group might want to create groups to target their instruction for these students in the areas of weakness and use formative assessments to show mastery along the way. While the students who preformed well, newer curricular aims can be used to begin instruction for them as they have shown mastery and are ready to move forward. A pre-assessment should be created to administer to them and to then guide that instruction.
In December these teachers are going to create another assessment. This time the assessment will contain problems from the areas that were not mastered by the majority during the September assessment as well as new curricular aims they have been introduced to, to see if mastery has occurred for that learning, as well as areas that need to be targeted for instruction. The teachers use grouping models for the math students and will also re-group students according to areas of mastery and weakness.
The process described above is very through and seems to cover the areas suggested by the author in creating assessments that are not only valid but also reliable. I was especially encouraged by the idea that, these teachers get together over the summer and during the year to re-evaluate their assessments and continually inform their instruction. This is a practice that I feel all teachers should strive for and be encouraged to participate in. Many of the practices described by Popham for good sound assessment building and administration were followed by these teachers and in areas where they were not the fact that they were trying might suggest that over time they will change to align tighter with practices suggested by Popham. It was very exciting to attend this session and hear about the thoughtfulness that is occurring in testing practices within New York State. While some teachers may not agree with the thoughtfulness and approach these teachers used, they should strive for more collaboration and sharing. The promotion of shared learning and teaching will only benefit student learning and mastery. The forms and processes described by these teachers aligned very nicely with the text used for this study and was very beneficial to pulling it all together.
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Educational Assessment – Mentor: Lorraine Lander
Final Project: Aligning Classroom Assessments to Content Standards
Assessments in the classroom today should have one main focus. Assessments should be aligned with the content or curricular aims intending to be learned. Content standards or curricular aims are described as the knowledge or skills teachers and educators want students to learn. When teachers plan for instruction they need to mindfully plan their assessments prior to instruction in order to properly align the curricular aims with the instruction. Thus, allowing for the adjustment of instruction when curricular aims are not being mastered. The assessments also need to be valid and reliable so that the inferences that are made about students can be determined upon the results or outcomes of the assessments. Popham suggests that using various types of assessments will help teachers to get a sampling of what knowledge or skills students have acquired.
One very important assessment that is used in New York State is the 4th Grade NYS Math Assessment. The 4th Grade NYS Math Assessment is one of many grade level based exams used to comply with No Child Left Behind Legislation. Popham refers to this type of test as a high stakes exam. The NYS 4th Grade assessment is a standards based assessment that according to Popham’s definition would be considered instructionally sensitive. The 4th Grade NYS Math assessment attempts to measure a manageable number of curricular aims., The assessment also attempts to clearly describes the skills and/or knowledge being assessed and it is able to report students’ performance in a way to benefit teachers and students instructionally which according to Popham is essential. Popham would also agree that an instructionally sensitive assessment is also a way to measure a teacher’s effectiveness instructionally. Since the exam is not created in the classroom however, it may need to be carefully evaluated for assessment bias as well as item bias. Many districts in New York State are beginning to take advantage of this aspect of the assessment. In order for teachers to align and inform their instruction a group of teachers in New York State are using benchmark testing three times a year along with unit assessments and other formative assessment measures, to help inform their instruction and get a handle on what their students are really learning.
The process being used by a current group of New York State Teachers is described as follows. The first step for the teachers was to aligning their instruction to the assessment they are creating. Currently they are taking the current grade level being taught; for the purpose of this paper I will use Grade 5 as the base grade. The teachers will use the scores as well as the Performance Indicators from the March, 4th Grade NYS Assessment as their base assessment. If a new or untested student arrives in the fall, they will use only the score from the September 5th Grade Benchmark exam for alignment purposes. A Fall Benchmark Assessment is then developed. In order to develop this benchmark assessment the teachers will gather the performance indicators from the NYS 4th Grade Assessment. Then with the help of the other fourth grade teachers along with the current fifth grade teachers they will look at the key curricular aims in 4th grade as well as 5th Grade. The key curricular aims that these students should have learned and mastered from 4th grade are used from the NYS 4th Grade Assessment. They will then, align these curricular aims with the current aims expected in 5th grade to see where there are matches or similar curricular aims. Popham also recommends this type of alignment, but stresses not using too many content standards when trying to get a handle on content mastery. The teachers will align the curriculum maps of these two grade levels side by side and then select the curricular aims that appear across the grade levels. A form to help collect this information is used to help simplify the process for them. Below is a sample of this form.
Sample Form:
Targeted Grade Level: _________
Projected Benchmark Testing Dates: __________________________________
Test Format: __________________________________
Targeted Content Strand (Curricular Aim): ___Measurement____________________________
Performance Indicator Sample Test Question
4M1 – Select tools and units (customary and metric) appropriate for length measured.
5M1 – Use a ruler to measure the nearest in., ½, ¼ , and 1/8 inch Measure the following line.:
________________________
6M1 – Measure capacity and calculate volume of a rectangular prism.
Bands within the Content Strands:
Measurement: Units of measurement
Tools and Methods
Units
Error and Magnitude
Estimation
Once they have decided the appropriate curricular aims the teachers will then begin to look at test questions that are aligned with the curricular aims they have chosen. To do this they utilize the bank of older test questions provided from NYS of previous tests or software that they have available to use that is aligned with the curricular aims chosen. This practice also aligns with sound assessment practices recommended by Popham. Because the teachers are pulling questions from multiple sources they are showing generalized test-taking preparation. This allows students to be prepared for many different types of test items according to test preparation guidelines explained by Popham. Popham also recommends the more questions used the better you are able to gage the students’ mastery of skills and knowledge. There are some key questions that the teachers use to ensure that they are designing this Benchmark Assessment accordingly. These key questions also align with many of Popham’s recommendations for classroom assessment creation
1. What will be the format for your test questions? Popham recommends keeping test items in sequence and size aligned. He also suggests not having problems run onto another page or causing pages to be flipped when taking an assessment.
2. For each curricular aim selected, how many questions will we have? Popham would suggest an even distribution. For example 2 questions per curricular aim.
3. How many total questions will be on the test? Popham would recommend 50 questions or more for reliable results.
4. How many of each type of question will be included? Again, an even distribution is recommended by Popham.
5. How will we ensure that there is a range in the level of difficulty for the question selected for each curricular aim? An item difficulty indices might be used to determine
this.
6. Will there be a time limit? If so what will the time limit be?
For this assessment the teachers will incorporate multiple choice as well extended response or constructed response questions. The NYS Math exam uses three types of question formats; Multiple Choice, Short Response and Extended Response. Generalized mastery should be promoted with whatever is being taught, therefore varied assessment practices should also be employed whenever possible (Popham, 2008). Performance assessments and Portfolio assessments as well as Affect assessments can also be used throughout the year to assess where students are in relation to the curricular aims intending to be assessed. By using these strategies the teachers are promoting educational defensibility as well as professional ethics. This practice is also teaching to the curricular aims represented by the assessment. While the test is important it is not the single defining piece of learning for the teacher. The curricular aims addressed are also important to the grade level curriculum. That is how they are proving educational defensibility and by using older exam questions aligned to the curricular aims the teachers are intending to measure they are also ensuring professional ethics. If the teachers incorporate some of the other assessment techniques throughout the year they will be using general assessment practices and preparing students for a variety of different testing formats. Since the 4th Grade NYS Assessment is largely comprised of Number Sense and Operation questions the teachers have chosen many of the questions from that standard. A couple from the Algebra and Statistics and probability standards will also be addressed but the fall instruction for 5th grade is based largely on the Number Sense and Operations curriculum. The test will have a total of 20 questions 16 multiple choice and 4 Constructed Response. The students will be given an hour to complete the assessment. To make the assessment a more reliable assessment they might have chosen more questions and used a couple of questions per curricular aim being assessed. This would have also helped with the reliability of the assessment.
Some final considerations suggested by the teachers before successfully administering the assessment are described next. The teachers recommend the use of a cover page. The directions should be clearly written and contain understandable wording for each section of the test. A practice also outlined by Popham in his five general Item-Writing Commandments. (Popham, 2008). The teachers suggest that modeling your assessment after the NYS Math assessment, directions should be modeled as well. Popham would suggest a more generalized modeling in order to prepare students for various test items they may encounter. Bubble sheets for Multiple Choice. Again, clearly labeled sheets and directions on how to bubble the answer sheet, as well as, how to correct mis-bubbles or mistakes is very important. Determine ahead of time the number of copies needed of assessments and answers sheets. (if this is a grade-wide assessment, determine who will be in charge of copying?). Testing Modifications and accommodations for Special Education students should be used. This practice is ensuring that these students are assessed in the same way that they are instructed for valid inferences to be gained and to minimize assessment bias. Date and time that the grade level or class will be given the exam. (To ensure that students are administered the same exam at the same time to disallow test reactive-ness to occur). Scoring scale and rubric are established far before the assessment is administered and before instruction occurs. Having students participate in the rubric building is also recommended. Grading sheets prepared. Who will be scoring the test and when is determined ahead of time. How the data will be analyzed and reported.
The fifth grade students are then given the assessment in early September. The multiple choice portion is then graded separately from the constructed response portion. A group of teachers will meet to grade the constructed response portion during a mutual meeting time. A grid is created in Excel for all portions of the test. Down the left side of the grid is the students names along the top of the grid are the curricular aims addressed and the corresponding test question numbers. Along the bottom of the grid is a tally for the number missed. Each question that is missed gets a check under the question and curricular aim it corresponds to. All the items are tallied at the bottom and off to the left side a total correct tally is also created for both parts of the assessment along with a total score tally area. (See attached example for clarification purposes).
The items that seem to have the most problems are the items considered to not have been mastered by those students. Since there are a number of items the group as a whole is struggling with the teacher will use these items as key targets in upcoming instruction. When we look at this analysis grid it might seem that the items should be thrown out however, since this assessment is being used to guide instruction the teachers are using it to see what was learned and not learned, they are therefore, valid questions. On the December benchmark these items would hopefully not be as problematic and instead show learning that has occurred. The same can be said for the items scored well on. These items in some assessment situations might be items to through out however, in this case they are items that have been mastered and show learning that has occurred. The teachers in this group might want to create groups to target their instruction for these students in the areas of weakness and use formative assessments to show mastery along the way. While the students who preformed well, newer curricular aims can be used to begin instruction for them as they have shown mastery and are ready to move forward. A pre-assessment should be created to administer to them and to then guide that instruction.
In December these teachers are going to create another assessment. This time the assessment will contain problems from the areas that were not mastered by the majority during the September assessment as well as new curricular aims they have been introduced to, to see if mastery has occurred for that learning, as well as areas that need to be targeted for instruction. The teachers use grouping models for the math students and will also re-group students according to areas of mastery and weakness.
The process described above is very through and seems to cover the areas suggested by the author in creating assessments that are not only valid but also reliable. I was especially encouraged by the idea that, these teachers get together over the summer and during the year to re-evaluate their assessments and continually inform their instruction. This is a practice that I feel all teachers should strive for and be encouraged to participate in. Many of the practices described by Popham for good sound assessment building and administration were followed by these teachers and in areas where they were not the fact that they were trying might suggest that over time they will change to align tighter with practices suggested by Popham. It was very exciting to attend this session and hear about the thoughtfulness that is occurring in testing practices within New York State. While some teachers may not agree with the thoughtfulness and approach these teachers used, they should strive for more collaboration and sharing. The promotion of shared learning and teaching will only benefit student learning and mastery. The forms and processes described by these teachers aligned very nicely with the text used for this study and was very beneficial to pulling it all together.
Sunday, June 8, 2008
Chapter 15 – Appropriate Evaluating Teaching and Grading Students
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 15 addresses evaluating teaching and grading students. These two topics while sometimes used interchangeably are separate functions. Evaluation is an activity focused on determining the effectiveness of the teachers. Grading is an activity focused on informing students how they are performing. Pre-instruction and post-instruction assessment practices are discussed as well as split-and-switch design for informing instruction. The use of standardized achievement tests for evaluating students and instruction was also weighed. Three schemes for grading are also describe along with a more commonly used practice.
There are two types of evaluation used in apprising instructional efforts of teachers. The first is formative evaluation. Formative evaluation is the appraisal of the teacher’s instructional program for purposes of improving the program. The second form of evaluation is summative evaluation. Summative evaluation is not improvement focused, it is an appraisal of teachers competencies to make more permanent decisions about teachers. These decisions are typically about continuation of employment or awarding of tenure. Summative evaluation is usually made by an administrator or supervisor. Teachers will typically do their own formative evaluation in order to better their own instruction. Summative data may be supplied to administrators or supervisors to show effectiveness of teaching.
Instructional impact can be gauged by pre-instruction and post-instruction. Assessing students prior to instruction is pre-assessment and then assessing after instruction has occurred is post-assessment and an indication of learning that has occurred. This scenario however, can be reactive. Reactive is when students are sensitized to what needs to be learned from the assessment and then perform well on the post-assessment as a result. An alternative to this problem might be a split-and switch design. This alternative data gathering design works best on large groups of students versus smaller groups. In this model you will split your class and administer two similar tests to each half. Mark the test as pre-tests instruct the group and then switch the tests for each group and post-test. Blind scoring should then occur. Blind scoring is when someone else grades the tests, another teacher, parent or assistant. The test results are then pulled together for each test. There is no problem caused in this design by differential difficulty and students will not have previously seen the post-test so reactive impact is not a consideration or problem. As a result instructional impact should be seen.
A common use of evaluating teaching has been through performance of students on standardized achievement tests. For most achievement tests there is a very inappropriate way to evaluation instructional quality. A standardized test is any exam administered and scored in a predetermined, standard manner. There are two major forms of standardized tests they are aptitude tests and achievement tests. Schools effectiveness is typically based on standardized achievement tests. There are three types of standardized achievement tests. The first is a traditional national standardized achievement test. The second is a standards-based achievement test that is instructionally insensitive and the third is a standards-based achievement test that is instructionally sensitive.
The purpose of Nationally Standardized Achievement tests is to allow valid inferences to be made about the knowledge and skills a student possesses in a certain content area. These inferences are then compared with a norm group of students of the same age and grade. The dilemma of this is that there is so much that would need to be tested that only a small sampling is possible. The consequence of this is an assumption that the norm group is a genuine representation of th nation at large. If this is the case these tests should not evaluate the quality of education that is not their purpose. There is a likelihood the tests are not aligned rigorously with a state’s curricular aims. Items covering important emphasized content by the classroom teacher may be eliminated in a quest for score spread. The final reason nationally standardized achievement tests should be used to evaluate teachers success is many items are linked to students SES – Social economic status or their inherited academic aptitude. In essence they are measuring what students bring to school not what they learn at school.
Standards-based tests sound like they would make much more sense. Two problems that have occurred with standards-based instructionally insensitive tests are the large number of content standards needing to be addressed and then reporting results used have limited instructional value. If properly designed these are standards-based tests that are instructionally sensitive. Three attributes must be present for standards-based test to be instructionally sensitive. They are the skills and/or bodies of knowledge must be clearly described so students’ mastery is very clear and the test results must allow clear identification of each assessed skill or body of knowledge mastered by a student. A standards-based test not possessing all three of these attributes is not instructionally sensitive and therefore, is useless. Instructionally sensitive standards-based tests are the right kind of tests to use to evaluate schools.
Teachers also need to inform students of how well they are doing and how well they have done. This is a demonstration of what they have learned and the extent of their achievement. Serious thought should be given to identifying factors to consider when grading and how much those factors will count. There are three common grade giving approaches. The first is absolute grading. In this model a grade is given based on the teachers’ idea of what level of students performance is necessary to earn each grade. This method is similar to criterion-referenced approach to assessments. The second form of grading is relative grading. Relative grading is a grade based on how students perform in relation to one another. This type of grading requires flexibility from class to class due to make-up of class changes. This form is close to norm-referenced grading approach. The third grade option is aptitude-based grading. Aptitude-based grading is a grade assigned to each student based on how well the students perform in relation to the students’ potential. This form of grading tends to “level the playing field”, by grading according to ability and encouraging full potential. Given these three options researchers have found that teachers really use a more “Hodgepodge” form of grading based loosely on judgment of students assessed achievement, effort, attitude, in-class conduct, and growth. The results of this type of grading are low performance in any of these areas results in a low grade for a student. There are not scientific quantitational models for clear cut grades using the “hodgepodge” method. It is purely judgmental on most levels but is widely used and accepted by teachers and students.
The final chapter has described distinctions of evaluating and grading. Evaluating of teachers quality of instruction and grading of students. Also discussed was the inappropriateness of using national standardized achievement tests to evaluate teachers. The difference between instructionally insensitive standards-based achievement tests and instructionally sensitive achievement tests was shown. Grading was then discussed and the importance of developing criteria and weighting of grades ahead of actual grade dispensing. Three grading options were described the reality of “hodgepodge” grading was presented.
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 15 addresses evaluating teaching and grading students. These two topics while sometimes used interchangeably are separate functions. Evaluation is an activity focused on determining the effectiveness of the teachers. Grading is an activity focused on informing students how they are performing. Pre-instruction and post-instruction assessment practices are discussed as well as split-and-switch design for informing instruction. The use of standardized achievement tests for evaluating students and instruction was also weighed. Three schemes for grading are also describe along with a more commonly used practice.
There are two types of evaluation used in apprising instructional efforts of teachers. The first is formative evaluation. Formative evaluation is the appraisal of the teacher’s instructional program for purposes of improving the program. The second form of evaluation is summative evaluation. Summative evaluation is not improvement focused, it is an appraisal of teachers competencies to make more permanent decisions about teachers. These decisions are typically about continuation of employment or awarding of tenure. Summative evaluation is usually made by an administrator or supervisor. Teachers will typically do their own formative evaluation in order to better their own instruction. Summative data may be supplied to administrators or supervisors to show effectiveness of teaching.
Instructional impact can be gauged by pre-instruction and post-instruction. Assessing students prior to instruction is pre-assessment and then assessing after instruction has occurred is post-assessment and an indication of learning that has occurred. This scenario however, can be reactive. Reactive is when students are sensitized to what needs to be learned from the assessment and then perform well on the post-assessment as a result. An alternative to this problem might be a split-and switch design. This alternative data gathering design works best on large groups of students versus smaller groups. In this model you will split your class and administer two similar tests to each half. Mark the test as pre-tests instruct the group and then switch the tests for each group and post-test. Blind scoring should then occur. Blind scoring is when someone else grades the tests, another teacher, parent or assistant. The test results are then pulled together for each test. There is no problem caused in this design by differential difficulty and students will not have previously seen the post-test so reactive impact is not a consideration or problem. As a result instructional impact should be seen.
A common use of evaluating teaching has been through performance of students on standardized achievement tests. For most achievement tests there is a very inappropriate way to evaluation instructional quality. A standardized test is any exam administered and scored in a predetermined, standard manner. There are two major forms of standardized tests they are aptitude tests and achievement tests. Schools effectiveness is typically based on standardized achievement tests. There are three types of standardized achievement tests. The first is a traditional national standardized achievement test. The second is a standards-based achievement test that is instructionally insensitive and the third is a standards-based achievement test that is instructionally sensitive.
The purpose of Nationally Standardized Achievement tests is to allow valid inferences to be made about the knowledge and skills a student possesses in a certain content area. These inferences are then compared with a norm group of students of the same age and grade. The dilemma of this is that there is so much that would need to be tested that only a small sampling is possible. The consequence of this is an assumption that the norm group is a genuine representation of th nation at large. If this is the case these tests should not evaluate the quality of education that is not their purpose. There is a likelihood the tests are not aligned rigorously with a state’s curricular aims. Items covering important emphasized content by the classroom teacher may be eliminated in a quest for score spread. The final reason nationally standardized achievement tests should be used to evaluate teachers success is many items are linked to students SES – Social economic status or their inherited academic aptitude. In essence they are measuring what students bring to school not what they learn at school.
Standards-based tests sound like they would make much more sense. Two problems that have occurred with standards-based instructionally insensitive tests are the large number of content standards needing to be addressed and then reporting results used have limited instructional value. If properly designed these are standards-based tests that are instructionally sensitive. Three attributes must be present for standards-based test to be instructionally sensitive. They are the skills and/or bodies of knowledge must be clearly described so students’ mastery is very clear and the test results must allow clear identification of each assessed skill or body of knowledge mastered by a student. A standards-based test not possessing all three of these attributes is not instructionally sensitive and therefore, is useless. Instructionally sensitive standards-based tests are the right kind of tests to use to evaluate schools.
Teachers also need to inform students of how well they are doing and how well they have done. This is a demonstration of what they have learned and the extent of their achievement. Serious thought should be given to identifying factors to consider when grading and how much those factors will count. There are three common grade giving approaches. The first is absolute grading. In this model a grade is given based on the teachers’ idea of what level of students performance is necessary to earn each grade. This method is similar to criterion-referenced approach to assessments. The second form of grading is relative grading. Relative grading is a grade based on how students perform in relation to one another. This type of grading requires flexibility from class to class due to make-up of class changes. This form is close to norm-referenced grading approach. The third grade option is aptitude-based grading. Aptitude-based grading is a grade assigned to each student based on how well the students perform in relation to the students’ potential. This form of grading tends to “level the playing field”, by grading according to ability and encouraging full potential. Given these three options researchers have found that teachers really use a more “Hodgepodge” form of grading based loosely on judgment of students assessed achievement, effort, attitude, in-class conduct, and growth. The results of this type of grading are low performance in any of these areas results in a low grade for a student. There are not scientific quantitational models for clear cut grades using the “hodgepodge” method. It is purely judgmental on most levels but is widely used and accepted by teachers and students.
The final chapter has described distinctions of evaluating and grading. Evaluating of teachers quality of instruction and grading of students. Also discussed was the inappropriateness of using national standardized achievement tests to evaluate teachers. The difference between instructionally insensitive standards-based achievement tests and instructionally sensitive achievement tests was shown. Grading was then discussed and the importance of developing criteria and weighting of grades ahead of actual grade dispensing. Three grading options were described the reality of “hodgepodge” grading was presented.
Tuesday, May 27, 2008
Chapter 14 – Appropriate and Inappropriate Test-Preparation Practices
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 14 deals with appropriate and inappropriate test-preparation practices. This is a newly recognized concern as a result of public reporting on education and No Child Left Behind (NCLB) legislation. The pressure to increase test scores has lead to improper and unethical test preparation practices. There are some steps to ensure good testing practices and enhancing students learning.
Educational achievement tests are administered to help teachers make good inferences about students’ knowledge and/or skills in relation to curricular aims being addressed. The achievement test should sample the curricular aim being addressed to serve as an indication to mastery or non-mastery. The relationship of mastery of a curricular aim and a score on an achievement test should be the same. If they are the same, we are seeing good test preparation practice.
There are two guidelines that can be used to ensure appropriate test-preparation practices. The first guideline is the professional ethics guideline. Professional ethics says that no test-preparation practice should violate the ethical norms of the education profession. State imposed security procedures regarding high stakes tests should never be compromised. The breaching of proper test practices removes the confidence of the public and can result in the loss of teaching credentials. The second guideline is the educational defensibility guideline. Educational defensibility states no test-preparation should increase students test scores without simultaneously increasing students mastery of the curricular aim tested. This guideline emphasizes the importance of engaging in instructional practices that focus on the student and their best interest.
There are five common test-preparation practices. These practices generally occur in the classroom and sometimes special instruction needs to occur outside the regular class time. The first practice is previous-form preparation provides special instruction and practice based directly on the use of an actual previously used test. This form of practice actually violates educational defensibility, as test scores may be boosted without a rise in mastery of curricular aims. The second practice is current-form preparation. Current-form preparation provides special instruction and practice based directly on the students’ use of the form of the test currently being used. This practice violates both educational defensibility and ethical guidelines. Any access to a current form before its official release is a form of stealing or cheating. The third practice is generalized test-taking preparation. Generalized test-taking preparation provides special instruction covering test taking skills for dealing with a variety of achievement test formats. Students learn how to make calculated guesses and how to judiciously use test taking time. This form of test practice is an appropriate use of test-preparation skills. Students are able to cope with different test types and are apt to be less intimidated by various test items they may encounter. This skill also promotes a more accurate reflection of true knowledge and skill. The fourth test preparation skill is same-format preparation. Same-Format preparation provides regular classroom instruction dealing directly with content covered on the test using practice items in the format of the actual test. The items are a clone of the test and for many students they are indistinguishable from the actual test. While this practice may not be unethical it is not educationally defensible. If students only recognize or deal with the test format they are not prepared to show what they have learned. In this case test scores may rise but demonstration of curricular aim master may not. The fifth test preparation practice is varied-format preparation. Varied format preparation provides regular classroom instruction dealing with content covered on the test, but practice items represent a variety of test-item formats. This test practice also satisfies ethical and educationally defensible guide-lines. Content on the tests as well as content applied to the curricular aims is applied in a variety of formats. Both test scores and curricular aim mastery should be seen in this test preparation practice.
A popular expression used today is “teaching to the test”. The meaning of this statement can have a negative connotation. Negatively applied the teacher is directing instruction specifically to test items on the test itself. This of course is a form of bad instruction. A positive way to “teach to the test” would be to aim instruction toward curricular aims supported on the actual test. This would be a positive form of instruction. The author suggests not using the phrase “teaching to the test” to avoid any confusion on instructional practices. A suggested phrase by the author is, “Teaching to the curricular aim represented by the test”.
Raising students test scores on high stakes tests is a common theme throughout public education. If teachers and administrators are being pressured to do this they should only do so if they are provided curricular aims that are aligned with the assessment. Test items should be accompanied by descriptions of what the test items represent that is suitable for instructional planning. If curricular aim descriptions are not provided for test items then score boosting should not be expected.
This chapter provided appropriate and inappropriate test-preparation practices. Two guidelines were explained professional ethics and educational defensibility. There are also five test practices described and how the guidelines applied to these practices. Varied-format and generalized test-preparation were the two sound practices that enforced both guidelines. The phrase “teaching to the test” was described, both negatively and positively and a whole new phrase was suggested, “Teaching to the curricular aim”. The idea of high stakes test supplying curricular aims descriptions for solid instructional decisions to occur was also discussed.
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 14 deals with appropriate and inappropriate test-preparation practices. This is a newly recognized concern as a result of public reporting on education and No Child Left Behind (NCLB) legislation. The pressure to increase test scores has lead to improper and unethical test preparation practices. There are some steps to ensure good testing practices and enhancing students learning.
Educational achievement tests are administered to help teachers make good inferences about students’ knowledge and/or skills in relation to curricular aims being addressed. The achievement test should sample the curricular aim being addressed to serve as an indication to mastery or non-mastery. The relationship of mastery of a curricular aim and a score on an achievement test should be the same. If they are the same, we are seeing good test preparation practice.
There are two guidelines that can be used to ensure appropriate test-preparation practices. The first guideline is the professional ethics guideline. Professional ethics says that no test-preparation practice should violate the ethical norms of the education profession. State imposed security procedures regarding high stakes tests should never be compromised. The breaching of proper test practices removes the confidence of the public and can result in the loss of teaching credentials. The second guideline is the educational defensibility guideline. Educational defensibility states no test-preparation should increase students test scores without simultaneously increasing students mastery of the curricular aim tested. This guideline emphasizes the importance of engaging in instructional practices that focus on the student and their best interest.
There are five common test-preparation practices. These practices generally occur in the classroom and sometimes special instruction needs to occur outside the regular class time. The first practice is previous-form preparation provides special instruction and practice based directly on the use of an actual previously used test. This form of practice actually violates educational defensibility, as test scores may be boosted without a rise in mastery of curricular aims. The second practice is current-form preparation. Current-form preparation provides special instruction and practice based directly on the students’ use of the form of the test currently being used. This practice violates both educational defensibility and ethical guidelines. Any access to a current form before its official release is a form of stealing or cheating. The third practice is generalized test-taking preparation. Generalized test-taking preparation provides special instruction covering test taking skills for dealing with a variety of achievement test formats. Students learn how to make calculated guesses and how to judiciously use test taking time. This form of test practice is an appropriate use of test-preparation skills. Students are able to cope with different test types and are apt to be less intimidated by various test items they may encounter. This skill also promotes a more accurate reflection of true knowledge and skill. The fourth test preparation skill is same-format preparation. Same-Format preparation provides regular classroom instruction dealing directly with content covered on the test using practice items in the format of the actual test. The items are a clone of the test and for many students they are indistinguishable from the actual test. While this practice may not be unethical it is not educationally defensible. If students only recognize or deal with the test format they are not prepared to show what they have learned. In this case test scores may rise but demonstration of curricular aim master may not. The fifth test preparation practice is varied-format preparation. Varied format preparation provides regular classroom instruction dealing with content covered on the test, but practice items represent a variety of test-item formats. This test practice also satisfies ethical and educationally defensible guide-lines. Content on the tests as well as content applied to the curricular aims is applied in a variety of formats. Both test scores and curricular aim mastery should be seen in this test preparation practice.
A popular expression used today is “teaching to the test”. The meaning of this statement can have a negative connotation. Negatively applied the teacher is directing instruction specifically to test items on the test itself. This of course is a form of bad instruction. A positive way to “teach to the test” would be to aim instruction toward curricular aims supported on the actual test. This would be a positive form of instruction. The author suggests not using the phrase “teaching to the test” to avoid any confusion on instructional practices. A suggested phrase by the author is, “Teaching to the curricular aim represented by the test”.
Raising students test scores on high stakes tests is a common theme throughout public education. If teachers and administrators are being pressured to do this they should only do so if they are provided curricular aims that are aligned with the assessment. Test items should be accompanied by descriptions of what the test items represent that is suitable for instructional planning. If curricular aim descriptions are not provided for test items then score boosting should not be expected.
This chapter provided appropriate and inappropriate test-preparation practices. Two guidelines were explained professional ethics and educational defensibility. There are also five test practices described and how the guidelines applied to these practices. Varied-format and generalized test-preparation were the two sound practices that enforced both guidelines. The phrase “teaching to the test” was described, both negatively and positively and a whole new phrase was suggested, “Teaching to the curricular aim”. The idea of high stakes test supplying curricular aims descriptions for solid instructional decisions to occur was also discussed.
Wednesday, May 21, 2008
Chapter 13 – Making Sense Out of Standardized Test Scores
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 13 focuses on standardized test scores and making sense out of these scores. There are a variety of ways to interpret standardized tests and their scores. Depending upon what the scores are intending to measure a variety of interpretations can be used. Teachers need to understand how to score and assess the various tests administered. By understanding how these tests are scored, teachers can inform their instruction, as well as, making sense of students’ performance on these tests and what the scores mean.
There are two types of standardized tests; one is designed to yield norm-referenced inferences the other criterion-referenced inferences. These tests are administered, scored and interpreted in a standard pre-determined manner. National standardized achievement tests are typically used to provide normed-referenced interpretations focused on the measurement of aptitude or achievement. State developed standardized achievement tests are used in many states for accountability purposes. Some states use them to assess basic skills of high school students to advance or allow or disallow a student to receive a diploma, even if curriculum requirements are met. Results are often publicized and try to indicate educators’ effectiveness. State standardized tests are intended to produce criterion-referenced interpretations.
Test scores can be interpreted individually or as a group. Group-focused interpretations are necessary for looking at all your students or groups of students. There are three ways described to do this. The first is by computing central tendency. The central tendency is an index of the groups’ scores, such as mean or median. A raw score is used to show the number of items answered correctly by a student. The median raw score is the midpoint of a set of scores. The mean raw score is the average of a set of scores. The mean and median show a center point for group scores. A second useful way to describe group scores is variability. Variability is how spread out the scores are. A simple measure of variability of a set of students scores is a range. A range is calculated easily by subtracting the lowest score and the highest score. A third way to look at group scores is the standard deviation. The standard deviation is the average difference between the individual scores in a group of scores and the mean of that set of scores. The larger the size of the standard deviation the more spread out the scores in the distribution. The formula for standard deviation is as follows:
SD = √∑(x-M)2
N
∑ (x-M)2 = Sum of the squared raw scores (x) – M (mean)
N = number of scores in the distribution
The mean and standard deviation are the best ways to discuss and describe group scores. It may however, be easier to compute median and range. However, this is not a very reliable way to always look at scores so the standard deviation method is much more reliable.
Individual test interpretations are also necessary. There are two ways to interpret individual test scores. They are interpreted in absolute or relative terms. An absolute inference is when we infer from a score what the student has mastered or not mastered and the skills and/or knowledge being assessed. A relative inference is when we infer from a score, how the student stacks up against other students currently taking the test, or already taken the test. The terms below average and above average are typically used in a relative inference.
There are three interpretative schemes to be considered from relative score-interpretations. The first scheme is percentiles or percentile ranks. A percentile compares a student’s score with those of other students in a norm group. The percentile indicates the percent of students in the norm group the student outperformed. The norm group is the students who took the test before it was published, to establish a norm group and help identify the test items that are appropriate. There are also different types of norm groups. There can be national norm groups and local norm groups. The second interpretive scheme is called grade-equivalent scores. A grade-equivalent is an indicator of student test performance based on grade levels and months of the school year. The purpose of grade-equivalent scores is to convert scores on a standardized assessment to an index score reflecting a student’s grade level progress in school. This score is also a developmental score and is indicated as follows:
Grade.month of school year
5.6
Grade equivalent scores are typically seen in reading and math. Grade-equivalent scores are determined by administering the same test to several grade levels establishing a trend-line which reflects the raw score increases at each grade level. Estimates at points along the trend line are established indicating what the grade-equivalent of the raw score would be. There are many assumptions made in this scoring theme making it a rather questionable scoring theme. It also can be misleading to parents and what it really translates to. The appropriate assumption is to say the grade equivalent score is an estimate of how the average student taking the test at a certain grade might score. The third scoring scheme is scale score interpretations. Scale scores are converted raw scores that use a new arbitrary chosen scale to represent levels of achievement or ability. The most popular scale score system is an item-response theory (IRT). This is different from a raw score reporting system. The difference is that IRT scales take into consideration the difficulty and other technical properties of every single item on the test. There is a different average scale score for each grade level. Scale scores are used heavily to describe group test performances at the state, district, and school levels. Scale scores can be used to permit longitudinal tracking of students’ progress and making direct comparisons of classes, schools and districts. It is very important to remember that, not all scale scores are similar and therefore, can’t be compared consistently on different scales score exams. Standardized tests use normal curve equivalent or NCE to attempt to use students’ raw scores to arrive at a percentile for a raw score, if the students’ scores were perfectly symmetrical a bell curve would be formed, however; sometimes the normal curve does not form and the NCE evaporates. Therefore, NCE’s were not a solution for comparing different standardized tests stanine’s like an NCE but it divides a score distribution into nine segments that though equal along the baseline of a set of scores contain different proportions of the distribution scores. Stanines are approximate scale scores.
There are two tests used to predict a high school student’s academic success in college. The first test is known as the SAT or Scholastic Aptitude Test. It’s function was originally to assist admissions officials in a group of elite Northeastern universities to determine who to admit. The test was designed to compare inherited verbal, quantitative and spatial aptitudes. Today, however; it is divided in three sections. The three sections are; critical reading, writing and mathematics. The SAT uses a score range from 200 to 800 for each of the three sections. The highest score as of the year 2005, that can be earned is a 2400. This is a total from all three sections. The test takes about three hours to administer. There is also a PSAT to help students prepare for the SAT. The second type of test is the ACT. The ACT is also an aptitude test. Different from the SAT the ACT or the American College Test was created as a measure of a student’s educational development for the soldiers that were taking advantage of the GI money being awarded to them for college. The SAT was sometimes too difficult or inappropriate so a new measure was needed. There are four content areas it addresses, they are English, Mathematics, Reading, and Science. The ACT also takes three hours to administer similar to the SAT. The ACT is scored by giving 1 point for every correct answer and no subtraction of points for wrong answers. Then an average is computed for each of the four sections, unlike the SAT where scores are added together. One very important aspect of the SAT and ACT tests is that only 25% of academic success in college is associated with a high school student’s performance. The other 75% has to do with non-test factors. Therefore, students who may not do well on these tests should not be discouraged from attending college.
Standardized tests are assessment instruments that are administered, scored and interpreted in a typically, predetermined standard format. The standardized tests are used to get a handle on students’ achievement and aptitude. Test scores are described two ways by central tendency and variability. Central tendency uses mean and median and variability uses range and standard deviation to describe scores meaning. Interpreting results by percentiles, grade-equivalent scores, scale scores, stanines and normal curve-equivalent were also explained for strengths and weaknesses. The SAT and ACT were also described and explained.
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 13 focuses on standardized test scores and making sense out of these scores. There are a variety of ways to interpret standardized tests and their scores. Depending upon what the scores are intending to measure a variety of interpretations can be used. Teachers need to understand how to score and assess the various tests administered. By understanding how these tests are scored, teachers can inform their instruction, as well as, making sense of students’ performance on these tests and what the scores mean.
There are two types of standardized tests; one is designed to yield norm-referenced inferences the other criterion-referenced inferences. These tests are administered, scored and interpreted in a standard pre-determined manner. National standardized achievement tests are typically used to provide normed-referenced interpretations focused on the measurement of aptitude or achievement. State developed standardized achievement tests are used in many states for accountability purposes. Some states use them to assess basic skills of high school students to advance or allow or disallow a student to receive a diploma, even if curriculum requirements are met. Results are often publicized and try to indicate educators’ effectiveness. State standardized tests are intended to produce criterion-referenced interpretations.
Test scores can be interpreted individually or as a group. Group-focused interpretations are necessary for looking at all your students or groups of students. There are three ways described to do this. The first is by computing central tendency. The central tendency is an index of the groups’ scores, such as mean or median. A raw score is used to show the number of items answered correctly by a student. The median raw score is the midpoint of a set of scores. The mean raw score is the average of a set of scores. The mean and median show a center point for group scores. A second useful way to describe group scores is variability. Variability is how spread out the scores are. A simple measure of variability of a set of students scores is a range. A range is calculated easily by subtracting the lowest score and the highest score. A third way to look at group scores is the standard deviation. The standard deviation is the average difference between the individual scores in a group of scores and the mean of that set of scores. The larger the size of the standard deviation the more spread out the scores in the distribution. The formula for standard deviation is as follows:
SD = √∑(x-M)2
N
∑ (x-M)2 = Sum of the squared raw scores (x) – M (mean)
N = number of scores in the distribution
The mean and standard deviation are the best ways to discuss and describe group scores. It may however, be easier to compute median and range. However, this is not a very reliable way to always look at scores so the standard deviation method is much more reliable.
Individual test interpretations are also necessary. There are two ways to interpret individual test scores. They are interpreted in absolute or relative terms. An absolute inference is when we infer from a score what the student has mastered or not mastered and the skills and/or knowledge being assessed. A relative inference is when we infer from a score, how the student stacks up against other students currently taking the test, or already taken the test. The terms below average and above average are typically used in a relative inference.
There are three interpretative schemes to be considered from relative score-interpretations. The first scheme is percentiles or percentile ranks. A percentile compares a student’s score with those of other students in a norm group. The percentile indicates the percent of students in the norm group the student outperformed. The norm group is the students who took the test before it was published, to establish a norm group and help identify the test items that are appropriate. There are also different types of norm groups. There can be national norm groups and local norm groups. The second interpretive scheme is called grade-equivalent scores. A grade-equivalent is an indicator of student test performance based on grade levels and months of the school year. The purpose of grade-equivalent scores is to convert scores on a standardized assessment to an index score reflecting a student’s grade level progress in school. This score is also a developmental score and is indicated as follows:
Grade.month of school year
5.6
Grade equivalent scores are typically seen in reading and math. Grade-equivalent scores are determined by administering the same test to several grade levels establishing a trend-line which reflects the raw score increases at each grade level. Estimates at points along the trend line are established indicating what the grade-equivalent of the raw score would be. There are many assumptions made in this scoring theme making it a rather questionable scoring theme. It also can be misleading to parents and what it really translates to. The appropriate assumption is to say the grade equivalent score is an estimate of how the average student taking the test at a certain grade might score. The third scoring scheme is scale score interpretations. Scale scores are converted raw scores that use a new arbitrary chosen scale to represent levels of achievement or ability. The most popular scale score system is an item-response theory (IRT). This is different from a raw score reporting system. The difference is that IRT scales take into consideration the difficulty and other technical properties of every single item on the test. There is a different average scale score for each grade level. Scale scores are used heavily to describe group test performances at the state, district, and school levels. Scale scores can be used to permit longitudinal tracking of students’ progress and making direct comparisons of classes, schools and districts. It is very important to remember that, not all scale scores are similar and therefore, can’t be compared consistently on different scales score exams. Standardized tests use normal curve equivalent or NCE to attempt to use students’ raw scores to arrive at a percentile for a raw score, if the students’ scores were perfectly symmetrical a bell curve would be formed, however; sometimes the normal curve does not form and the NCE evaporates. Therefore, NCE’s were not a solution for comparing different standardized tests stanine’s like an NCE but it divides a score distribution into nine segments that though equal along the baseline of a set of scores contain different proportions of the distribution scores. Stanines are approximate scale scores.
There are two tests used to predict a high school student’s academic success in college. The first test is known as the SAT or Scholastic Aptitude Test. It’s function was originally to assist admissions officials in a group of elite Northeastern universities to determine who to admit. The test was designed to compare inherited verbal, quantitative and spatial aptitudes. Today, however; it is divided in three sections. The three sections are; critical reading, writing and mathematics. The SAT uses a score range from 200 to 800 for each of the three sections. The highest score as of the year 2005, that can be earned is a 2400. This is a total from all three sections. The test takes about three hours to administer. There is also a PSAT to help students prepare for the SAT. The second type of test is the ACT. The ACT is also an aptitude test. Different from the SAT the ACT or the American College Test was created as a measure of a student’s educational development for the soldiers that were taking advantage of the GI money being awarded to them for college. The SAT was sometimes too difficult or inappropriate so a new measure was needed. There are four content areas it addresses, they are English, Mathematics, Reading, and Science. The ACT also takes three hours to administer similar to the SAT. The ACT is scored by giving 1 point for every correct answer and no subtraction of points for wrong answers. Then an average is computed for each of the four sections, unlike the SAT where scores are added together. One very important aspect of the SAT and ACT tests is that only 25% of academic success in college is associated with a high school student’s performance. The other 75% has to do with non-test factors. Therefore, students who may not do well on these tests should not be discouraged from attending college.
Standardized tests are assessment instruments that are administered, scored and interpreted in a typically, predetermined standard format. The standardized tests are used to get a handle on students’ achievement and aptitude. Test scores are described two ways by central tendency and variability. Central tendency uses mean and median and variability uses range and standard deviation to describe scores meaning. Interpreting results by percentiles, grade-equivalent scores, scale scores, stanines and normal curve-equivalent were also explained for strengths and weaknesses. The SAT and ACT were also described and explained.
Saturday, May 10, 2008
Chapter 12 – Instructionally Oriented Assessment
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 12 discusses how classroom instruction can be guided by assessment. Two strategies are discussed for making instructional decisions before the assessment results and planning instruction to achieve the curricular aim(s) represented by an assessment.
The first strategy of applying instructional decisions as a result of assessment results is typically known by most teachers, it is just not widely used. Since teachers are typically in the practice of using assessments to assign grades, they don’t always see them as tools to make instructional decisions. Teachers use tests to find out a level of knowledge learned. Curricular aims tested for could be cognitive affect, or psychomotor abilities, each of these areas being quite substantial in scope. A sampling of these curricular aims through assessment can provide inferences of the student status. This inference can then be used to make instructional decisions. Other types of tests can be used to help determine grades, but assessments written to determine levels of knowledge should be used to inform instruction only.
Assessments that are used to inform instruction should use the following three categories as decision bases. The first what to teach? This can be as a pre-assessment given before instruction, for a specific objective that is necessary. The second decision category is, how long to keep teaching toward a particular objective. This can be assessed during the time of instruction. The decision from this type of formative assessment can be used to determine whether to continue, or cease instruction for an objective, for a student, or the whole class. The third decision category is, how effective an instructional lesson or unit was. This can be assessed by comparing students’ pre and post tests results. The decision to retain, discard or modify a given instructional lesson or unit can be determined by this decision.
The second assessment based strategy for improving classroom instruction is planning instruction to achieve curricular aim(s) represented by an assessment. This strategy is a rethink of educational assessments. For testing to influence teaching, tests should be constructed prior to instructional planning. This would allow planned instructional units to better coincide with the content of the test and can then inform instructional planning. In this model the teacher starts with the instructional objectives set forth in the curriculum, then moves to create an assessments based on these goals and then after, the pre-assessment helps plan instructional activities intended to promote student mastery of knowledge, skills and/or attitudes to be post-assessed. Curriculum, should always be the starting point. The assessment then acts as clarity to the instructional aims, and whether the skills intended are mastered, teachers should therefore never teach toward the test themselves.
There are three benefits to this strategy. The first is more accurate task analysis. Since you will have a clearer idea of the results you are after, you can better enable the knowledge and skills students need to achieve before mastering what is taught. Second more on-target practice activities can be used. You will have a better sense of your end-of-unit instruction outcomes so you can choose guided-practice and independent-practice activities more aligned with targeted outcomes. The third benefit would be more lucid expositions. As a result of understanding more clearly what needs to be assessed at the conclusion of instruction you can provide clearer explorations to students about content and where instruction is heading.
The idea of assessments for learning has also evolved over time. In the past teachers used assessments to assess mostly what students know and don’t know. This is known as assessments of learning. While assessments of learning are important and should be utilized assessments for learning should also be utilized and more than assessments of learning. It has been shown that students who are given assessments for learning were able to achieve in six to seven months what it took others a full year to achieve. There are five strategies suggested to implement strategies for learning sequence. The first strategy is to clarify and share learning intentions and criteria for success. The second strategy is to engineer effective classroom discussions, questions and learning tasks. Thirdly, to provide feedback that moves learners forward. Fourth, activate students as owners of their learning and fifth, to activate students as instructional resources for one another. Again this is a huge shift for many teachers but the benefits of learning for students are tremendous. Formative assessments can also be used to help achieve higher scores especially on summative assessments.
The ideas of this chapter were really based on improving instructional decisions and instruction based on information gained from assessments. There were two strategies described, making instructional decisions in light of assessment results and planning instruction to achieve the curricular aim(s) represented by an assessment.
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 12 discusses how classroom instruction can be guided by assessment. Two strategies are discussed for making instructional decisions before the assessment results and planning instruction to achieve the curricular aim(s) represented by an assessment.
The first strategy of applying instructional decisions as a result of assessment results is typically known by most teachers, it is just not widely used. Since teachers are typically in the practice of using assessments to assign grades, they don’t always see them as tools to make instructional decisions. Teachers use tests to find out a level of knowledge learned. Curricular aims tested for could be cognitive affect, or psychomotor abilities, each of these areas being quite substantial in scope. A sampling of these curricular aims through assessment can provide inferences of the student status. This inference can then be used to make instructional decisions. Other types of tests can be used to help determine grades, but assessments written to determine levels of knowledge should be used to inform instruction only.
Assessments that are used to inform instruction should use the following three categories as decision bases. The first what to teach? This can be as a pre-assessment given before instruction, for a specific objective that is necessary. The second decision category is, how long to keep teaching toward a particular objective. This can be assessed during the time of instruction. The decision from this type of formative assessment can be used to determine whether to continue, or cease instruction for an objective, for a student, or the whole class. The third decision category is, how effective an instructional lesson or unit was. This can be assessed by comparing students’ pre and post tests results. The decision to retain, discard or modify a given instructional lesson or unit can be determined by this decision.
The second assessment based strategy for improving classroom instruction is planning instruction to achieve curricular aim(s) represented by an assessment. This strategy is a rethink of educational assessments. For testing to influence teaching, tests should be constructed prior to instructional planning. This would allow planned instructional units to better coincide with the content of the test and can then inform instructional planning. In this model the teacher starts with the instructional objectives set forth in the curriculum, then moves to create an assessments based on these goals and then after, the pre-assessment helps plan instructional activities intended to promote student mastery of knowledge, skills and/or attitudes to be post-assessed. Curriculum, should always be the starting point. The assessment then acts as clarity to the instructional aims, and whether the skills intended are mastered, teachers should therefore never teach toward the test themselves.
There are three benefits to this strategy. The first is more accurate task analysis. Since you will have a clearer idea of the results you are after, you can better enable the knowledge and skills students need to achieve before mastering what is taught. Second more on-target practice activities can be used. You will have a better sense of your end-of-unit instruction outcomes so you can choose guided-practice and independent-practice activities more aligned with targeted outcomes. The third benefit would be more lucid expositions. As a result of understanding more clearly what needs to be assessed at the conclusion of instruction you can provide clearer explorations to students about content and where instruction is heading.
The idea of assessments for learning has also evolved over time. In the past teachers used assessments to assess mostly what students know and don’t know. This is known as assessments of learning. While assessments of learning are important and should be utilized assessments for learning should also be utilized and more than assessments of learning. It has been shown that students who are given assessments for learning were able to achieve in six to seven months what it took others a full year to achieve. There are five strategies suggested to implement strategies for learning sequence. The first strategy is to clarify and share learning intentions and criteria for success. The second strategy is to engineer effective classroom discussions, questions and learning tasks. Thirdly, to provide feedback that moves learners forward. Fourth, activate students as owners of their learning and fifth, to activate students as instructional resources for one another. Again this is a huge shift for many teachers but the benefits of learning for students are tremendous. Formative assessments can also be used to help achieve higher scores especially on summative assessments.
The ideas of this chapter were really based on improving instructional decisions and instruction based on information gained from assessments. There were two strategies described, making instructional decisions in light of assessment results and planning instruction to achieve the curricular aim(s) represented by an assessment.
Saturday, May 3, 2008
Chapter 11 – Improving Teacher-Developed Assessments
Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 11 provides procedures to improve teacher developed assessments. There are two general improvement strategies discussed. The first is judgmental item improvement and the second is empirical time improvement. Teachers need to apply adequate time and be motivated to apply either of these strategies.
The first improvement strategy is judgmentally based improvement procedures. These procedures are provided by yourself, your colleagues and your students. Each of the three judges will follow different procedures. The first judge would be you, judging your own assessments. The five criteria procedures for judging your own assessments are; first, adhere to item-specific guidelines and general item-writing commandments, second, contributions to score-based inference, third, accuracy of content, fourth, absence of content lacunae or gaps, and fifth, fairness. The second judge would be a collegial judge. The collegial judge would be provided the same five criteria used by yourself as well as a brief description of each of the five criteria. The third judge would be the students. The students are often overlooked as a source of test analysis, but whom better and most experienced than the test takers. There are five improvement questions students can be asked after an assessment is administered. The first is, if any of the items seemed confusing, and which ones were they? Second, did any items have more than one correct answer? If so which ones? Third, did any item have no correct answers? If so which ones? Fourth, were there words in any items that confused you? If so, which ones? Fifth, were the directions for the test or for particular subsections unclear? If so which ones? These questions can be altered slightly for constructed response assessments, performance assessments, or portfolio assessments. Ultimately the teacher needs to be the final decision maker for any needed changes.
The second improvement strategy is known as empirically based improvement procedures. This improvement strategy is based on the empirical data supplied when students respond to teacher developed assessments. There are a variety of well tuned techniques used over the years that involve simple number or formula calculations. The first technique is difficulty indices. A useful index of item quality is its difficulty. This is also referred to as p-value. The formula looks as follows:
Difficulty P = R - # of students responding correctly to an item
T - # of total students responding to an item
A p-value can range from 0-1.00. Higher p-values indicate items that more students answered correctly. An item with a lower p-value would be an item most students missed. The p-value should be viewed as the relationship to the student’s chance, probability of getting a correct response. For example on a four item multiple choice assessment a .25 p-value, by chance should be expected. The actual difficulty of an item should be tied to the instructional program. So, if the p-value is high that does not necessarily mean the item was too easy rather the content has been taught well.
The second empirically based technique is item-discrimination indices. An item-discrimination index typically tells how frequently an item is answered correctly by those who perform well on the total test. An item-discrimination index reflects the relationship between student’s responses for the total test and their responses to a particular test item. The author uses a correlation coefficient between student’s total scores and their performance on a particular item. This is also referred to as a point-biserial. A positive discriminating item indicates an item is answered correctly more often by those who score well on the total test than by those who score poorly on the total test. A negatively discriminating item is answered correctly more often by those who score poorly on the total test than by those who score well on the total test. A nondiscriminating item, is one that there is no appreciable difference in the correct response for those who score well or poorly on the total test. Four steps can be applied to compute an items discrimination index. First order the test papers from high to low by total score. Then divide the papers evenly into a high group and a low group. Third calculate a p-value for each group. Lastly subtract pl from ph to obtain each items discrimination index (D). Example of this is:
D= Ph – Pl. The following chart can also be created:
Positive Discriminator High scores > Low scores
Negative Discriminator High scores < Low scores
Nondiscriminator High scores = Low scores
Negative discriminating items are a sign that something about the test is not good. This is so because the item is being missed more often by students who are performing well on the test in general and not being missed by students who are not doing well on the test. Ebel and Frisbie (1991) apply the following table.
.40 and above = Very good items
.30 - .39 = Reasonably good possible improvement needed
.20 - .29 = Marginal items should improve these items
.19 and below = Poor items – remove or revise items
The third technique offered for empirically based improvement procedures is distracter analysis. A distracter analysis is when we see how high and low groups are responding to the items distracters. This analysis is used to dig deeper on the basis of p-value or discrimination index’s needing revision. The following table should be set up to look at this technique:
Test Item # Alternatives
p = .50
D = .33 A B* C D Omit
Upper 15 2 5 0 8 0
Lower 15 4 10 0 0 1
In this example the correct response is B*. Notice students doing well are choosing D and students not doing well are choosing B. Also of interest is no one is choosing C. As a result of this analysis something should be changed in C to make it more appealing as well as looking a what about B is causing students to choose it who are not doing as well on the test and students who are doing well to not choose it.
Criterion referenced measurements can also use a form of item analysis to avoid low discriniation indicies from traditional item analysis. The first approach requires administering the assessment to the same group of students prior to and following instruction. Some disadvantages of this are that item analysis can’t occur till after both assessments are given and instruction has occurred. Pre-tests may also be reactive by sensitizing students to certain items, so post-test performance is a result of pre-test and instruction, not just instruction. The formula for this procedure would look like the following:
Dppd = Ppost – Ppre
Ppost – proportion of students answered correct on post-test
Ppre – proportion of students answered correct on pre-test
The value of Dppd (Discrimination based on pretest-posttest differences) should range from -1.00 to +1.00 high positives indicating good instruction. The second approach for item analysis of criterion-referenced measurement is to locate two different groups of students. One group having received instruction, and the other group not having received any instruction. By assessing and applying the item analysis described earlier you can gain information quicker on item quality, avoiding reactive pretests. The disadvantage is you are relying on human judgment.
Two strategies have been described judgment or empirical methods. Both strategies can be applied to improve assessment procedures. When using judgment procedures five criteria should be used for the assessment creator, as well as the collegial judge with descriptions and a series of question for students about the assessment. When using the more empirical methods several different types of item analysis and discrimination practices can be used by applying various formulas. Ultimately the assessment creator needs to apply good techniques and have a willingness to spend some time to improve their assessments.
Classroom Assessment – What Teachers Need to Know - By W. James Popham
Chapter 11 provides procedures to improve teacher developed assessments. There are two general improvement strategies discussed. The first is judgmental item improvement and the second is empirical time improvement. Teachers need to apply adequate time and be motivated to apply either of these strategies.
The first improvement strategy is judgmentally based improvement procedures. These procedures are provided by yourself, your colleagues and your students. Each of the three judges will follow different procedures. The first judge would be you, judging your own assessments. The five criteria procedures for judging your own assessments are; first, adhere to item-specific guidelines and general item-writing commandments, second, contributions to score-based inference, third, accuracy of content, fourth, absence of content lacunae or gaps, and fifth, fairness. The second judge would be a collegial judge. The collegial judge would be provided the same five criteria used by yourself as well as a brief description of each of the five criteria. The third judge would be the students. The students are often overlooked as a source of test analysis, but whom better and most experienced than the test takers. There are five improvement questions students can be asked after an assessment is administered. The first is, if any of the items seemed confusing, and which ones were they? Second, did any items have more than one correct answer? If so which ones? Third, did any item have no correct answers? If so which ones? Fourth, were there words in any items that confused you? If so, which ones? Fifth, were the directions for the test or for particular subsections unclear? If so which ones? These questions can be altered slightly for constructed response assessments, performance assessments, or portfolio assessments. Ultimately the teacher needs to be the final decision maker for any needed changes.
The second improvement strategy is known as empirically based improvement procedures. This improvement strategy is based on the empirical data supplied when students respond to teacher developed assessments. There are a variety of well tuned techniques used over the years that involve simple number or formula calculations. The first technique is difficulty indices. A useful index of item quality is its difficulty. This is also referred to as p-value. The formula looks as follows:
Difficulty P = R - # of students responding correctly to an item
T - # of total students responding to an item
A p-value can range from 0-1.00. Higher p-values indicate items that more students answered correctly. An item with a lower p-value would be an item most students missed. The p-value should be viewed as the relationship to the student’s chance, probability of getting a correct response. For example on a four item multiple choice assessment a .25 p-value, by chance should be expected. The actual difficulty of an item should be tied to the instructional program. So, if the p-value is high that does not necessarily mean the item was too easy rather the content has been taught well.
The second empirically based technique is item-discrimination indices. An item-discrimination index typically tells how frequently an item is answered correctly by those who perform well on the total test. An item-discrimination index reflects the relationship between student’s responses for the total test and their responses to a particular test item. The author uses a correlation coefficient between student’s total scores and their performance on a particular item. This is also referred to as a point-biserial. A positive discriminating item indicates an item is answered correctly more often by those who score well on the total test than by those who score poorly on the total test. A negatively discriminating item is answered correctly more often by those who score poorly on the total test than by those who score well on the total test. A nondiscriminating item, is one that there is no appreciable difference in the correct response for those who score well or poorly on the total test. Four steps can be applied to compute an items discrimination index. First order the test papers from high to low by total score. Then divide the papers evenly into a high group and a low group. Third calculate a p-value for each group. Lastly subtract pl from ph to obtain each items discrimination index (D). Example of this is:
D= Ph – Pl. The following chart can also be created:
Positive Discriminator High scores > Low scores
Negative Discriminator High scores < Low scores
Nondiscriminator High scores = Low scores
Negative discriminating items are a sign that something about the test is not good. This is so because the item is being missed more often by students who are performing well on the test in general and not being missed by students who are not doing well on the test. Ebel and Frisbie (1991) apply the following table.
.40 and above = Very good items
.30 - .39 = Reasonably good possible improvement needed
.20 - .29 = Marginal items should improve these items
.19 and below = Poor items – remove or revise items
The third technique offered for empirically based improvement procedures is distracter analysis. A distracter analysis is when we see how high and low groups are responding to the items distracters. This analysis is used to dig deeper on the basis of p-value or discrimination index’s needing revision. The following table should be set up to look at this technique:
Test Item # Alternatives
p = .50
D = .33 A B* C D Omit
Upper 15 2 5 0 8 0
Lower 15 4 10 0 0 1
In this example the correct response is B*. Notice students doing well are choosing D and students not doing well are choosing B. Also of interest is no one is choosing C. As a result of this analysis something should be changed in C to make it more appealing as well as looking a what about B is causing students to choose it who are not doing as well on the test and students who are doing well to not choose it.
Criterion referenced measurements can also use a form of item analysis to avoid low discriniation indicies from traditional item analysis. The first approach requires administering the assessment to the same group of students prior to and following instruction. Some disadvantages of this are that item analysis can’t occur till after both assessments are given and instruction has occurred. Pre-tests may also be reactive by sensitizing students to certain items, so post-test performance is a result of pre-test and instruction, not just instruction. The formula for this procedure would look like the following:
Dppd = Ppost – Ppre
Ppost – proportion of students answered correct on post-test
Ppre – proportion of students answered correct on pre-test
The value of Dppd (Discrimination based on pretest-posttest differences) should range from -1.00 to +1.00 high positives indicating good instruction. The second approach for item analysis of criterion-referenced measurement is to locate two different groups of students. One group having received instruction, and the other group not having received any instruction. By assessing and applying the item analysis described earlier you can gain information quicker on item quality, avoiding reactive pretests. The disadvantage is you are relying on human judgment.
Two strategies have been described judgment or empirical methods. Both strategies can be applied to improve assessment procedures. When using judgment procedures five criteria should be used for the assessment creator, as well as the collegial judge with descriptions and a series of question for students about the assessment. When using the more empirical methods several different types of item analysis and discrimination practices can be used by applying various formulas. Ultimately the assessment creator needs to apply good techniques and have a willingness to spend some time to improve their assessments.
Subscribe to:
Posts (Atom)