Accountability 101: Tests Are Blunt Instruments
By Nancy Kober
Large-scale tests like those used for NCLB have advantages over less standardized forms of measurement. They can provide results that are more consistent and useful for comparisons than those from assessments based on individual judgment. They can also produce extensive information about student performance at lower costs and with less testing time than many other forms of assessment. Because large-scale tests are developed in a scientific manner and report results in numbers, many people assume they are very precise. But even well-designed tests have limitations that should be considered by users of AYP and other test-based data.
As testing experts often note, a test score is more like an estimate than an exact measurement. If a student took the same test on consecutive days without studying in between, the student's scores may still vary due to factors unrelated to learning, such as the sample of questions on the particular test version, the student's physical condition or state of mind, lucky guesses, or errors in recording answers. Aggregate scores for a group of students—whether a school, classroom, or NCLB-defined subgroup—may also fluctuate due to factors unrelated to teaching and learning, such as yearly changes in the test-taking population.
Here are some aspects of testing that could produce these types of test score fluctuations:
a. A test is a sample of all possible questions that could be asked about a subject. The questions on a test are merely a sample of the vast store of knowledge and skills in a subject like math. A test that lasts a few hours can't possibly address all the important topics, concepts, or math skills that students are expected to learn during the school year.
Test developers try to minimize the impact of this form of sampling variation by selecting questions that cover a representative sample of important knowledge and skills in the subject being tested. They also try to ensure that different versions of the same test (developed for security reasons and to limit teaching to the test) are parallel in content and difficulty. Still, there will always be students who would have scored higher if a particular test version had included a different sample of questions that they happened to know well.
b. A test administration is a sample of a student's behavior at a single point in time. On any given day, a variety of external factors—a headache, an argument with a parent that morning, a jackhammer or barking dog outside the school—could negatively affect a student's performance. If the test had been given at another time, the student might have scored higher.
When student scores are combined across a large enough group, fluctuations in individual test scores due to sampling variations in test questions and external conditions tend to offset each other. For example, the uncharacteristically low performance of a student with a headache on test day might be offset by the unexpectedly high score of another student who felt rested and confident and made a few lucky guesses. Since AYP is calculated by looking at the percentages of students scoring at the state-set "proficient” level on state tests, the scores of a few students can mean the difference between making or not making AYP, especially when it comes to subgroups, which include fewer students.
c. Yearly changes in the test-taking population can produce fluctuations in aggregate test scores. As any teacher can attest, each year's group of students represents a unique mix of economic, linguistic, and racial/ethnic backgrounds and different capabilities, personalities, and behavior. Countless factors can change the composition of test-takers from year to year in ways that affect aggregate scores. Family income, for example, is a strong predictor of student test scores, so the loss of a major manufacturer could increase poverty and lead to lower aggregate test scores for a school in that community.
Or a school could experience an influx of immigrants, adding more English language learners to the test-taking population. This year's third grade (a tested grade in many states) could have a higher share of students with severe disabilities than last year's third grade. A cluster of students with behavioral problems could create an unusually disruptive classroom environment. An exodus of high-achieving students from neighborhood schools to private or charter schools or construction of an upscale housing development in the neighborhood could change the test-taking group in meaningful ways.
If the number of test-takers is large, these types of yearly changes may have little effect. But in a relatively small group (fewer than 100 students, according to Haney, 2002), annual changes in group composition can produce wider fluctuations because each student's score has a greater impact on the aggregate. With the average elementary school containing only 68 students per grade, score instability is not unusual. It is also common among schools with high mobility or very diverse enrollments.
These fluctuations matter because under NCLB, aggregate scores of students in tested grades and subgroups are used to make judgments about the effectiveness of the entire school. When a school fails to make AYP, people generally don't consider whether the students tested that year are truly representative of the broader universe of students served by that school across the years, and as a result, they don't question what the test scores truly say about the school's effectiveness.
d. Most, though not all, states use confidence intervals to compensate for test score variations. Recognizing that tests are not precise instruments, some states are using a statistical tool called "confidence intervals” to make allowances for score fluctuations unrelated to changes in achievement. Somewhat like the margin of error in a public opinion poll, a confidence interval creates a window around the state AYP target of plus or minus a few points. Test results that fall slightly below the target but within the window are counted as having met the target, so confidence intervals make it somewhat less likely that a school or subgroup will fail to make AYP due to chance fluctuations. The size of the window is determined by two factors: the number of students tested and the degree of confidence that test administrators wish to have in the accuracy of the results. The smaller the group of students tested, the wider the window.
Imagine that 40 percent of the students in a school score at the proficient level in math. Using a 95 percent confidence interval, test administrators can be 95 percent certain that the true achievement of the school falls within a range of 35 to 45 percent proficient. If the AYP target is 42 percent proficient, then the school makes AYP because its score falls within the window. If test administrators want to be 99 percent confident that true achievement falls within the window, then the window would have to be much wider.
The majority of states currently use confidence intervals for various AYP decisions, more than in earlier years (Center on Education Policy, 2005). In states that do not use confidence intervals, one school could be labeled as low-performing while another is deemed adequate, even though there's no meaningful difference between their aggregate scores.
* * *
Sampling variation in test questions and external conditions and changes in group composition can produce yearly fluctuations in test scores unrelated to educational effectiveness. A school's test score trends across several years can provide a trustworthy indicator of student achievement, but a single year's scores may or may not be a good indicator of the quality of teaching and learning in a school.
Nancy Kober is a consultant to the Center on Education Policy and co-author and editor of the Center's annual reports on NCLB. This sidebar is an adaptation of the October 2002 issue of Test Talk for Leaders published by the Center on Education Policy. The original is available on CEP's Web site at www.cep-dc.org.
Center on Education Policy. (2005). From the capital to the classroom: Year 3 of the No Child Left Behind Act. Washington, D.C.: Author.
Haney, W. (July 10, 2002). "Ensuring failure: How a state's achievement test may be designed to do just that.” Education Week.