Overview of Summative Assessment
Summative Assessment PDF
States, J., Detrich, R. & Keyworth, R. (2018). Overview of Summative Assessment. Oakland, CA: The Wing Institute. https://www.winginstitute.org/assessment-summative.
Research supports the power of assessment to amplify learning and skill acquisition (Başol & Johanson, 2009). Summative assessment is a form of appraisal that occurs at the end of an instructional unit or at a specific point in time, such as the end of the school year. It evaluates mastery of learning and offers information on what students know and do not know. Frequently, summative assessment consists of evaluation tools designed to measure student performance against predetermined criteria based on specific learning standards. Examples of commonly employed tools include Advanced Placement exams, National Assessment of Education Progress (NAEP), end-of-lesson tests, midterm exams, final project, and term papers. These assessments are routinely used for making high-stakes decisions; for this purpose, often student knowledge or skill acquisition is compared with standards or benchmarks (examples: Common Core Standards and High School Graduation Tests).
What makes summative assessment so invaluable is that each high-stakes test may result in educators using the data for decisions with significant long-term consequences affecting a student’s future. Passing bestows important benefits, such as receiving a high school diploma, a scholarship, or entry into college, and failure can affect a child’s future employment prospects and earning potential as an adult (Geiser & Santelices, 2007). Additionally, summative assessment plays a role in improving future instruction by providing educators with data on the effectiveness of curriculum and instruction. Knowing what methods worked for a lesson or semester may not help current students, but it can provide educators with the necessary insights into how and where to redesign instructional practices to elevate next year’s student scores (Moss, 2013).
Despite the important role of summative assessment in education, research finds little evidence to support it as a critical factor in improved student achievement (Rosenshine, 2003; Yeh, 2007). Figure 1 provides a comparison of the effect size of formative assessment and high-stakes testing (an instrument of summative assessment), gleaned from multiple studies conducted over more than 40 years.
Figure 1. Comparison of formative assessment and summative assessment impact on student achievement
Because summative assessment happens after instruction is over, it has little value as a diagnostic tool to guide teachers in making timely adjustments to instruction aimed at catching students who are falling behind. It does not provide teachers with vital information to use in crafting remedial instruction. Formative assessment is a much more effective instrument for adjusting instruction to assist students master material (Garrison & Ehringhaus, 2007; Harlen & James, 1997).
Despite these shortcomings, summative assessment plays a pivotal role in education by troubleshooting weaknesses in the system. It provides educators with valuable information to determine the effectiveness of instruction for a particular unit of study, to make high-stakes decisions, and to evaluate the effectiveness of schoolwide interventions. It works to improve overall instruction (1) by providing feedback on progress measured against benchmarks, (2) by helping teachers to improve, and (3) as an accountability instrument for continuous improvement of systems (Hart et al., 2015)
Types of Summative Assessment
Educators generally rely on two forms of summative assessment: teacher constructed (informal) and standardized (systematic). Teacher-constructed assessment is the most common form of assessment found in classrooms. It can provide objective data for appraising student performance, but it is vulnerable to bias. Standardized assessment is designed to overcome many of the biases that can taint teacher-constructed tools, but this form of assessment have their own limitations. Both types of summative assessment have a place in an effective education system, but for maximum positive effects they should be employed to meet the needs for which they were designed.
Teacher Constructed (Informal)
Teacher-constructed assessment, the most common and frequently applied type of summative assessment, is derived from teachers’ daily interactions and observations of how students behave and perform in school. Since schools began, teachers have depended predominantly on informal assessment, which today includes teacher-constructed tests and quizzes, grades, and portfolios, and relies heavily on a teacher’s professional judgment. Teachers inevitably form judgments, often accurate, about students and their performance (Barnett, 1988; Spencer, Detrich, & Slocum, 2012). Although many of these judgments help teachers understand where students stand in mastering a lesson, a meaningful percentage result in false understandings and conclusions. To be effective, a teacher-constructed assessment must deliver vital information needed for the teacher to make accurate conclusions about each student’s performance in a content area and to feel confident that performance is linked to instruction. Ensuring that a teacher-constructed instrument is reliable and valid is central to the assessment design process.
Research suggests that the main weaknesses of informal assessment relate to validity and reliability (AERA, 1999; Mertler, 1999). That is why it is crucial for teachers to adopt assessment procedures that are valid indicators of a student’s performance (appraise what the assessment claims to) and that the assessment is reliable (provides information that can be replicated).
Validity is a measure of how well an instrument gauges the relevant skills of a student. The research literature identifies three basic types of validity: construct, criterion, and content. Students are best served when the teacher focuses on content validity, that is, making sure the content being tested is actually the content that was taught (Popham, 2014). Content validity requires no statistical calculations whereas both construct validity and criterion validity require knowledge of statistics and thus are not well suited to classroom teachers (Allen & Yen, 2002).
Ultimately, speedy feedback of student performance after an assessment enhances the value of all forms of assessment. To maximize the positive impact, both student and teacher should be provided with detailed and specific information on a student’s achievement. Timely comments and explanations from teachers can clarify how a student performed and are essential components of quality instruction and performance improvement. This information tells students where they stand with regard to the teacher’s expectations. Timely feedback is also essential for teachers (Gibbs & Simpson, 2005). Otherwise, teachers remain in the dark about the effectiveness of their instructional strategies and methods. Research suggests that testing without feedback is likely to produce disappointing results, and the quantity and quality of the research supports including feedback as an integral part of assessment (Başol, 2003).
Designing Teacher-Constructed Assessments
The essential question to ask when developing an informal teacher-constructed assessment is this: Does the assessment consistently assess what the teacher intended to be evaluated based on the material being taught? Best practices in assessment suggest that teachers start answering this question by incorporating assessment design into the instructional design process. Assessments are best generated at the same time as lesson plans. Although teaching to the test has acquired negative overtones, it is precisely what all student assessment is meant to accomplish. Teachers cannot and should not assess every item they teach, but it is important that they identify and prioritize the critical lesson elements for inclusion in a summary assessment.
Instruction and assessment are meant to complement one another. When this occurs it helps teachers, policymakers, administrators, and parents know what students are capable of doing at specific stages in the education process. A good match of assessment with instruction leads to more effective scope and sequencing, enhancing the acquisition of knowledge and the mastery of skills required for success in subsequent grades as well as success after graduation from school (Reigeluth, 1999).
The following are guidelines that lead to increased effectiveness of teacher-constructed assessment (Reynolds, Livingston, Willson, & Willson, 2010; Shillingburg, 2016; Taylor & Nolen, 2005):
- Clarify the purpose of the assessment and the intended use of its results.
- Define the domain (content and skills) to be assessed.
- Match instruction to standards required of each domain.
- Identify the characteristics of the population to be assessed and consider how these data might influence the design of the assessment.
- Ensure that all prerequisite skills required for the lesson have been taught to the students.
- Ensure that the assessment evaluates skills compatible with and required for success in future lessons.
- Review with the students the purpose of the assessment and the knowledge and skills to be assessed.
- Consider possible task formats, timing, and response modes and whether they are compatible with the assessment as well as how the scores will be used.
- Outline how validity will be evaluated and measured.
- Methods include matching test questions to lesson plans, lesson objectives, and standards, and obtaining student feedback after the assessment.
- Content-related evidence often consists of deciding whether the assessment methods are appropriate, whether the tasks or problems provide an adequate sample of the student’s performance, and whether the scoring system captures the performance.
- When possible, review test items with colleagues and students; revise as necessary.
- Review issues of reliability.
- Make sure that the assessment includes enough items and tasks (examples of performance) to report a reliable score.
- Evaluate the relative weight allotted to each task, to each content category, and to each skill being assessed.
- Pilot-test the assessment, then revise as necessary. Are the results consistent with formative assessments administered on the content being taught?
Standardized testing is the second major category of summative assessment commonly used in schools. Students and teachers are very familiar with these standardized tests, which have become ubiquitous. Over the past 20 years, they have played an ever-increasing role in schools, especially since the passage in 2001 of the No Child Left Behind Act (NCLB, 2002). Standardized tests have increased not only in influence but also in quantity. Typically, students are engaged in taking standardized tests between 20 and 25 hours each year (Bangert-Drowns, Kulik, Kulik, & Morgan, 1991; Hart et al., 2015). The average 8th grader spends between 1.6% and 2.3% of classroom time on standardized tests, not including test preparation (Bangert-Drowns et al., 1991; Lazarín, 2014). A student will be required to participate in approximately 112 mandatory standardized exams during his or her academic career (Hart et al., 2015).
Although research finds that student performance increases with the frequency of assessment, it also shows that improvement tapers off with excess testing (Bangert-Drowns et al., 1991). Regardless of where educators stand on the issue of standardized testing, most can agree that these assessments should be reduced to the minimum number required to obtain the critical information for which they were designed. The aim is to decrease the number of standardized tests to those indispensable in providing educators with the basic information to make high-stakes decisions and for schools to implement a continuous improvement process. Ultimately, everyone is best served by reducing redundancy in test taking in order to maximize instructional time (Wang, Haertel, & Walberg, 1990).
Standardized tests provide valuable data to be used by educators for school reform and continuous improvement purposes. Data from these tests can include early indicators that point to interventions for preventing potential future problems. The data can also reveal when the system has broken down or highlight exemplary performers that schools can emulate. Using such data can be invaluable as a systemwide tool (Celio, 2013). Despite the potential value of summative assessment as a tool to monitor and improve systems, research finds minimal positive impact on student performance when the tests are used for high-stakes purposes or to hold teachers and schools accountable (Carnoy & Loeb, 2002; Hanushek & Raymond, 2005). The increased use of incentives and other accountability measures, which have cost enormous sums, reduced instruction time, and added stress to teachers, can be linked to only an average effect size of 0.05 in improvement of student achievement (Yeh, 2007).
As previously noted, formative assessment has been shown to be a much more effective tool in helping individual students maintain progress toward meeting accepted performance standards, and the rigor and cost required to design valid and reliable standardized tests places them outside the realm of tools that teachers can personally design. In the end, it is important to understand what summative assessment is best suited to accomplish. When it comes to improving systems, standardized assessment is well suited for meeting a school’s needs. But for improving an individual student’s performance, formative assessment is more appropriate.
Summative assessment is a commonplace tool used by teachers and school administrators. It ranges from a simple teacher-constructed end-of-lesson exam to standardized tests that determine graduation from high school and entry into college. If used for the purposes for which it was designed, summative assessment plays an important role in education. When used appropriately, it can deliver objective data to support a teacher’s professional judgment, to make high-stakes decisions, and as a tool for acquiring the needed information for adjustments in curriculum and instruction that will ultimately improve the education process. When used incorrectly or for accountability purposes, summative assessment can take valuable instruction time away from students and increase teacher and student stress without producing notable results.
Allen, M. J., & Yen, W. M. (2002). Introduction to measurement theory. Long Grove, IL: Waveland Press.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association (AERA).
Bangert-Drowns, R. L., Kulik, C. L. C., Kulik, J. A., & Morgan, M. (1991). The instructional effect of feedback in test-like events. Review of educational research, 61(2), 213–238.
Barnett, D. W. (1988). Professional judgment: A critical appraisal. School Psychology Review, 17(4), 658–672
Başol, G. (2003). Effectiveness of frequent testing over achievement: a meta-analysis study. Unpublished doctorate dissertation, Ohio University, Athens, OH.
Başol, G., & Johanson, G. (2009). Effectiveness of frequent testing over achievement: A meta analysis study. International Journal of Human Sciences, 6(2), 99–121.
Belfield, C. R., & Crosta, P. M. (2012). Predicting success in college: The importance of placement tests and high school transcripts. CCRC Working Paper No. 42. New York, NY: Community College Research Center, Teachers College, Columbia University.
Brennan, R. L. (Ed.) (2006). Educational measurement (4th ed.). Westport, CT: Praeger Publishers.
Carnoy, M., & Loeb, S. (2002). Does external accountability affect student outcomes? A cross-state analysis. Educational Evaluation and Policy Analysis, 24(4), 305–331.
Celio, M. B. (2013). Seeking the magic metric: Using evidence to identify and track school system quality. In Performance Feedback: Using Data to Improve Educator Performance (Vol. 3, pp. 97–118). Oakland, CA: The Wing Institute.
Espenshade, T. J., & Chung, C. Y. (2010). Standardized admission tests, college performance, and campus diversity. Unpublished paper, Office of Population Research, Princeton University, Princeton, NJ.
Fuchs, L. S. & Fuchs, D. (1986). Effects of systematic formative evaluation: A meta-analysis. Exceptional Children, 53(3), 199–208.
Garrison, C., & Ehringhaus, M. (2007). Formative and summative assessments in the classroom. Westerville, OH: Association for Middle Level Education. https://www.amle.org/portals/0/pdf/articles/Formative_Assessment_Article_Aug2013.pdf
Geiser, S., & Santelices, M. V. (2007). Validity of high-school grades in predicting student success beyond the freshman year: High-school record vs. standardized tests as indicators of four-year college outcomes. Research and Occasional Paper Series. Berkeley, CA: Center for Studies in Higher Education, University of California.
Gibbs, G., & Simpson, C. (2005). Conditions under which assessment supports students’ learning. Learning and Teaching in Higher Education, 1, 3–31.
Hanushek, E. A., & Raymond, M. E. (2005). Does school accountability lead to improved student performance? Journal of Policy Analysis and Management, 24(2), 297–327.
Harlen, W., & James, M. (1997). Assessment and learning: Differences and relationships between formative and summative assessment. Assessment in Education: Principles, Policy & Practice, 4(3), 365–379.
Hart, R., Casserly, M., Uzzell, R., Palacios, M., Corcoran, A., & Spurgeon, L. (2015). Student testing in America’s great city schools: An inventory and preliminary analysis. Washington, DC: Council of the Great City Schools.
Lazarín, M. (2014). Testing overload in America’s schools. Washington, DC: Center for American Progress.
McMillan, J. H., & Schumacher, S. (1997). Research in education: A conceptual approach (4th ed.). New York, NY: Longman.
Mertler, C. A. (1999). Teachers’ (mis)conceptions of classroom test validity and reliability. Paper presented at the annual meeting of the Mid-Western Educational Research Association, Chicago, IL.
Moss, C. M. (2013). Research on classroom summative assessment. In J. H. McMillan (Ed.), Handbook of research on classroom assessment (pp. 235–255). Los Angeles, CA: Sage.
No Child Left Behind (NCLB) Act of 2001, Pub. L. No. 107-110, § 115, Stat. 1425. (2002).
Popham, W. J. (2014). Classroom assessment: What teachers need to know (7th ed.). Boston, MA: Pearson Education.
Reigeluth, C. M. (1999). The elaboration theory: Guidance for scope and sequence decisions. In C. M. Reigeluth (Ed.), Instructional design theories and models: A new paradigm of instructional theory (Vol. II, pp. 425–453). Mahwah, NJ: Lawrence Erlbaum.
Reynolds, C. R., Livingston, R. B., Willson, V., & Willson, V. (2010). Measurement and assessment in education. Upper Saddle River, NJ: Pearson Education.
Rosenshine, B. (2003). High-stakes testing: Another analysis. Education Policy Analysis Archives, 11(24), 1–8.
Spencer, T. D., Detrich, R., & Slocum, T. A. (2012). Evidence-based practice: A framework for making effective decisions. Education and Treatment of Children, 35(2), 127–151.
Shillingburg. W. (2016). Understanding validity and reliability in classroom, school-wide, or district-wide assessments to be used in teacher/principal evaluations. Retrieved from https://cms.azed.gov/home/GetDocumentFile?id=57f6d9b3aadebf0a04b2691a
Taylor, C. S., & Nolen, S. B. (2005). Classroom assessment: Supporting teaching and learning in real classrooms (2nd ed.). Upper Saddle River, NJ: Pearson Prentice Hall.
Wang, M. C., Haertel, G. D., & Walberg, H. J. (1990). What influences learning? A content analysis of review literature. The Journal of Educational Research, 84(1), 30–43.
Yeh, S. S. (2007). The cost-effectiveness of five policies for improving student achievement. American Journal of Evaluation, 28(4), 416–436.
The Programme for International Student Assessment (PISA) is survey which aims to evaluate education systems worldwide by testing the skills and knowledge of 15-year-old students.
PISA Reports Retrieved from http://www.oecd.org/pisa/.
Stepping stones: Principal career paths and school outcomes
This study examines the detrimental impact of principal turnover, including lower teacher retention and lower student achievement. Particularly hard hit are high poverty schools, which often lose principals at a higher rate as they transition to lower poverty, higher student achievement schools.
Beteille, T., Kalogrides, D., & Loeb, S. (2012). Stepping stones: Principal career paths and school outcomes. Social Science Research, 41(4), 904-919.
BURIED TREASURE: Developing a Management Guide From Mountains of School Data
This report provides a practical “management guide,” for an evidence-based key indicator data decision system for school districts and schools.
Celio, M. B., & Harvey, J. (2005). Buried Treasure: Developing A Management Guide From Mountains of School Data. Center on Reinventing Public Education.
Testing High Stakes Tests: Can We Believe the Results of Accountability Tests?
This study examines whether the results of standardized tests are distorted when rewards and sanctions are attached to them.
Greene, J., Winters, M., & Forster, G. (2004). Testing high-stakes tests: Can we believe the results of accountability tests?. The Teachers College Record, 106(6), 1124-1144.
Student testing in America's great city schools: An inventory and preliminary analysis.
This study exams the extent to which standardized testing is impacting schools. The researchers conducted a survey of member districts, analyzed district testing calendars, conducted interviews, and reviewed and analyzed federal, state, and locally mandated assessments to determine what tests and how frequently tests are being mandated in schools.
Hart, R., Casserly, M., Uzzell, R., Palacios, M., Corcoran, A., & Spurgeon, L. (2015). Student testing in America's great city schools: An inventory and preliminary analysis. Washington, DC: Council of the Great City Schools.
A Longitudinal Examination of the Diagnostic Accuracy and Predictive Validity of R-CBM and High-Stakes Testing
The purpose of this study is to compare different statistical and methodological approaches to standard setting and determining cut scores using R- CBM and performance on high-stakes tests
Hintze, J. M., & Silberglitt, B. (2005). A longitudinal examination of the diagnostic accuracy and predictive validity of R-CBM and high-stakes testing. School Psychology Review, 34(3), 372.
The Nation’s Report Card
The National Assessment of Educational Progress (NAEP) is a national assessment of what America's students know in mathematics, reading, science, writing, the arts, civics, economics, geography, and U.S. history.
National Center for Education Statistics
International Comparisons in Fourth-Grade Reading Literacy: Finding from the Progress in International Reading Literacy Study (PIRLS) of 2001
This report describes the reading literacy of fourth-graders in 35 countries, including the United States. The report provides information on a variety of reading topics, but with an emphasis on U.S. results. The report also presents information on reading and instruction in the classroom and explores the reading habits of fourth-graders outside of school. This report defines reading literacy for fourth-graders, highlights the performance and distribution of fourth-graders relative to fourth-graders in other countries, and illustrates, through international benchmarking, the performance of assessed students.
Ogle, L. T., Sen, A., Pahlke, E., Jocelyn, L., Kastberg, D., Roey, S., & Williams, T. (2003). International Comparisons in Fourth-Grade Reading Literacy: Findings from the Progress in International Reading Literacy Study (PIRLS) of 2001.
Incorporating End-of-Course Exam Timing Into Educational Performance Evaluations
There is increased interest in extending the test-based evaluation framework in K-12 education to achievement in high school. High school achievement is typically measured by performance on end-of-course exams (EOCs), which test course-specific standards in subjects including algebra, biology, English, geometry, and history, among others. Recent research indicates that when students take particular courses can have important consequences for achievement and subsequent outcomes. The contribution of the present study is to develop an approach for modeling EOC test performance regarding the timing of course.
Parsons, E., Koedel, C., Podgursky, M., Ehlert, M., & Xiang, P. B. (2015). Incorporating end-of-course exam timing into educational performance evaluations. Journal of Research on Educational Effectiveness, 8(1), 130-147.
Creating reports using longitudinal data: how states can present information to support student learning and school system improvement
This report provides ten actions to get data into the right hands of educators.
Data Quality Campaign, (2010). Creating reports using longitudinal data: how states can present information to support student learning and school system improvement.
Synthesis of research on reviews and tests.
This study looks at the use of properly spaced reviews and tests as a practice that can dramatically improve classroom learning and retention.
Dempster, F. N. (1991). Synthesis of Research on Reviews and Tests. Educational leadership, 48(7), 71-76.
A Meta-Analytic Review Of The Distribution Of Practice Effect: Now You See It, Now You Don't
This meta-analysis reviews 63 studies on the relationship between conditions of massed practice and spaced practice with respect to task performance, which yields an overall mean weighted effect size of 0.46.
Donovan, J. J., & Radosevich, D. J. (1999). A meta-analytic review of the distribution of practice effect: Now you see it, now you don't. Journal of Applied Psychology, 84(5), 795.
Dealing with Flexibility in Assessments for Students with Significant Cognitive Disabilities
Alternate assessment and instruction is a key issue for individuals with disabilities. This report presents an analysis, by assessment system component, to identify where and when flexibility can be built into assessments.
Gong, B., & Marion, S. (2006). Dealing with Flexibility in Assessments for Students with Significant Cognitive Disabilities. Synthesis Report 60. National Center on Educational Outcomes, University of Minnesota.
Leaders and Laggards: A State-by-State Report Card on Educational Innovation
This report is a call to action in response to how poorly states measured up on key indicators of educational innovation.
Hess, F. M., & Boser, U. (2009). Leaders and Laggards: A State-by-State Report Card on Educational Innovation. American Enterprise Institute for Public Policy Research
Uneven Transparency: NCLB Tests Take Precedence in Public Assessment Reporting for Students with Disabilities
This report analyzes the public reporting of state assessment results for students with disabilities
Klein, J. A., Wiley, H. I., & Thurlow, M. L. (2006). Uneven transparency: NCLB tests take precedence in public assessment reporting for students with disabilities (Technical Report 43). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved [today's date], from the World Wide Web: http://education.umn.edu/NCEO/OnlinePubs/Technical43.html
Use of Education Data at the Local Level From Accountability to Instructional Improvement
This report looks at the implementation of student data systems and the use of data for improving student performance.
Means, B., Padilla, C., & Gallagher, L. (2010). Use of Education Data at the Local Level: From Accountability to Instructional Improvement. US Department of Education.
Strategic responses to school accountability measures: It's all in the timing
This paper examines efforts in the State of Wisconsin to improve test scores.
Sims, D. P. (2008). Strategic responses to school accountability measures: It's all in the timing. Economics of Education Review, 27(1), 58-68.
2005 State Special Education Outcomes Steps Forward in a Decade of Change
This report provides a snapshot of new initiatives, trends, accomplishments, and emerging issues of education reform as states document the academic achievement of students with disabilities during standards-based reform.
Thompson, S., Johnstone, C., Thurlow, M., & Altman, J. (2005). 2005 State Special Education Outcomes: Steps Forward in a Decade of Change. National Center on Educational Outcomes, University of Minnesota.
The impact of high-stakes testing on student proficiency in low-stakes subjects: Evidence from Florida’s elementary science exam
This paper utilizes a regression discontinuity design to evaluate the impact of Florida's high-stakes testing policy on student proficiency in the low-stakes subject of science.
Winters, M. A., Trivitt, J. R., & Greene, J. P. (2010). The impact of high-stakes testing on student proficiency in low-stakes subjects: Evidence from Florida's elementary science exam. Economics of Education Review, 29(1), 138-146.
Effects of massed versus distributed practice of test taking on achievement and test anxiety
This study examines the effects of massed versus distributed practice on achievement and test anxiety.
Zimmer, J. W., & Hocevar, D. J. (1994). Effects of massed versus distributed practice of test taking on achievement and test anxiety. Psychological reports, 74(3), 915-919.