Evidence Based EducationOverview
Determining Measurement
Research to Practice
Standards of Proof
Understanding Research
Why Science is Essential

Determining Measurement

Executive Summary: Determining when Measurement is Adequate

Determining What to Measure

At the heart of any evidence-based approach is systematic measurement.
Determining what to measure is ultimately a question of values and cannot be answered by empirical methods. The decisions about what to measure are determined through social processes.

Determining How to Measure

Measurement science can guide how to measure once it has been determined what is important to measure.

Reliability of Measurement Systems

The first requirement of any measurement system is that it is reliable (consistent).

Validity of Measurement Systems

The second requirement of a measurement system is that it measure what it is purported to measure (validity). Measurement systems can be reliable without being valid but cannot be valid without being reliable.
How unit of interest is defined can how powerful influence on reliability and validity concerns. The measurement of constructs such as social skills pose challenges for demonstrating reliability and validity.
Systematic sampling of the behavior of interest increases validity.

The Relationship Between Measurement Science and Judgment

Ultimately, the adequacy of a measurement system is a judgment of the researcher and the consumer of research. To a large extent the adequacy of the measurement depends on the question being asked and the steps taken to assure the reliability and validity of the measurement system. The final answer to questions about the adequacy of a measurement system is a pragmatic one.

Determining if Measurement is Adequate

At the heart of any evidence-based education approach is the systematic measurement of whatever is being studied. The evidence one uses to guide decisions about education interventions is only as good as the measurement system used to develop the evidence. Given the importance of measurement, it is important for decision makers to have an understanding of what constitutes good measurement.

Determining What to Measure

Determining what to measure is not, ultimately, a question that can be answered by empirical methods. The decision about what to measure reflects the goals one has for education and what is meant by the term “well-educated.” Some segments of the population want education to teach skills that are directly related to job skills. Others desire for education to provide a more classic education with an emphasis on literature, art, and history.

The immediate application to job skills is not a concern for this group. Clearly, how education will be evaluated will be different because the ultimate goals are very different. Both sets of goals are legitimate and progress toward each goal can be evaluated only in the context of the goal. It would be unreasonable to evaluate education that emphasized job related skills by standards that are relevant to a classic education. The first question to be answered when developing a measurement system is to ask “why am I interested in this?” The answer to that question can guide many decisions about what to measure and how to measure.

Determining How to Measure

Once the goals for education and what to measure have been established then the process becomes somewhat easier. Many of the decisions about the measurement system can be guided by a science of measurement but not all of the decisions are entirely straightforward. Consider the example of the goal that all students of a certain age will be able to read at the 4th grade level. Immediately, questions arise. Which 4th graders are going to be the referent group? Will it be a sample of 4th graders nationally or will it be 4th graders from the local area. There may well be significant differences between the national sample and the local sample. The answer to the question will not necessarily influence the way reading is measured but will certainly influence how the resulting measures are interpreted.

A second question that immediately arises is what do we mean by reading? Are we referring to decoding skills or are we interested in reading comprehension? Clearly the answer to this question will determine how and what we measure. If we are measuring decoding then decisions must be made about what materials will be used as the assessment materials and how to score words read correctly. Such questions as how to score self-corrections while reading must be addressed and explicated so scores can be interpreted and compared against the way in which reading levels were determined in the referent group.

Differences in the way reading is measured can make it difficult to make any meaningful comparisons. Ultimately, the point of collecting data is to be able to make comparisons against some other performance. The usual types of comparisons that are made are with a person’s earlier performance, the performance of another similar group, performance against an absolute standard such as words read correctly per minute. The specific comparison being made depends on the exact question being asked. To the extent that performance is measured in different ways between the two comparison points then any conclusions about the performance is likely to be flawed.

Reliability of Measurement Systems

The first requirement of any measurement system is that it is reliable which means that the system is consistent under a standard set of conditions. That is to say, with repeated measures taken under similar circumstances, similar data will be obtained regardless of whom is taking the measure or any other variable not directly related to what is being measured. To the extent that a measurement system is reliable then confidence in the data is increased. A simple measurement system is a thermometer.

The ideal thermometer is one that will (1) yield the same temperature whenever a standard set of conditions are present; (2) yield the same temperature as any other thermometer measuring under those same conditions; (3) yield the same temperature across a range of temperatures; and (4) yield the same temperature regardless of whom is taking the temperature. To the extent that any one of these statements is not true then the reliability of the thermometer is questionable and the resulting data are to be interpreted cautiously. For example, if the thermometer meets all of the above criteria but is not reliable at extreme temperatures then the data may be unusable if there are instances of extreme temperatures in sampled temperatures.

If any of the data in the sample are suspect then the conclusions are suspect. There are two solutions to this problem. Either use a thermometer that is reliable across all temperatures or throw out all data that fall at the extremes. The first solution is certainly more desirable. If data are thrown out then any generalizations about the data are limited because nothing is known about what happens at the extremes. The requirements of a measurement system in education are no different than for a thermometer. The measurement system has to meet all of the same standards for reliability.

Validity of Measurement Systems

Once a system has been proven to be reliable then it becomes necessary to demonstrate that it is valid. In this context, valid means that the measurement system is measuring what it is purported to measure and not some other phenomenon. One of the most frequently cited concerns with high stakes testing is that it is not actually a measure of what students know but rather a measure of how well teachers have taught to the particular test. One way to address this concern is to demonstrate that the students’ test scores on the high stakes test are highly correlated with their scores on another test that is considered to measure the same skills as the high stakes test and the validity has been established. To the extent that the scores on the two measures are correlated the results from the high stakes test can be considered to be valid. If there is a low correlation between the two tests then the meaning of the scores from the high stakes test are brought into question. It is possible that a measurement system is reliable but not valid. Conversely, it is not possible for it to be valid if it is not reliable. By definition, data based on an unreliable data system cannot be valid because it is unclear what the measurement system is measuring since the obtained data are not consistent under a standard set of conditions.

One of the biggest threats to both reliability and validity is how the unit being measured is defined. If what is being measured is comprised of separate events that are assumed to be part of the same construct then it is often much more difficult to develop a measurement system that can reliably measure all of the component events and similarly, it is not always clear how each of these components contribute to the obtained data raising questions about the validity of the measurement system. Consider the construct of social skills. Clearly, what is consider to be social skills is not a short list of discrete behaviors but rather a broad set of behaviors that are, in part defined by contextual variables. Out of necessity, any measurement system designed to assess social skills will involve multiple behavioral events that ultimately fall under the general construct of social skills. Because social skills is a construct, within the context of a specific piece of research, the indices of social skills will be defined differently and will be a subset of all the behaviors that might be considered socially skilled. Under these conditions it is very difficult to develop a measurement system that will reliably measure the unit of interest regardless of whether skills are measured through direct observation of persons or through some type of rating system. In either instance, achieving a high level of consistency across raters and across time will be challenging.

Assuming a reliable measurement system can be developed the validity of the measures is the next obstacle. If validity is the extent to which a measurement system is actually measuring what is purported to measure then it is incumbent to demonstrate the obtained data are a function of differences in social skills and not some other variable. One variable highly correlated with ratings of social skills is physical attractiveness so any valid measurement system of social skills must demonstrate that the obtained data are a measure of social skills and not of physical attractiveness.

Currently, there is a great deal of interest in graduation rates as a measure of the effectiveness of an educational institution. The issues related to defining the measurement unit of interest are well illustrated in this measure. The first question that has to be answered is what is considered as graduating. Clearly, receiving a diploma from the granting institution is the most easily defined measure; however, questions arise when students leave school and obtain their high school diploma through a GED. For the purposes of measuring graduation rates, should this be counted as a positive instance of graduation? The answer to this question will influence the overall graduation rate.

A second question that arises when trying to measure graduation rates is what is the total population from which the rate will be calculated? For example, one way to determine the population is to count the number of students who entered high school in a particular year and calculate the rate as the number of those students from this population that subsequently graduated. Immediately questions arise about how to deal with students who move to another educational institution before graduation. Similarly, how should students be counted if they move into the school after the base year. If these students graduate, should they be included in the calculations? If they fail to graduate, should they be counted?

Finally, questions have to be answered about how to count students who graduate but take longer than the usual amount of time to graduate. For the purposes of calculation should these students be counted as failing to graduate? Should they be counted in some subsequent year as graduating even though they were not counted in the base year for determining the graduation rates?

Before the data about graduation rates can be well understood, these questions must be answered. In part, the answers to these questions should be influenced by why questions are being asked about the graduation rate. Depending on how these questions are answered, it is likely that the resulting data about graduation rates will either be an overestimate or an underestimate of actual graduation rates. It is important that the persons who make the decisions about how to measure the graduation rates provide clear explanations about why they made the decisions they did so that the various stakeholders (politicians, educators, parents, voters) can interpret the data in meaningful ways. In some instances, an underestimate of graduation rates may be better than an overestimate and it is important that the consumers of the data know if the data they are evaluating are an overestimate or underestimate. In the current political climate surrounding education the way graduation rates are calculated is important because these rates are part of the formula for determining if the educational institution (school, district, or state system) is making adequate yearly progress.

An additional threat to validity of the obtained data is the method of sampling the unit of interest. Since it is very unlikely that any measurement system will count every instance of whatever is being measured then how sampling is done will be critical to assuring the validity of measurement. Consider the following example of sampling. When physicians are interested in the count of red and white blood cells they do not drain all of the blood from a person to count all of the red and white blood cells. Rather they draw a fixed amount of blood and from the counts in this sample they make judgments about the health of the patient. The sampling procedure has to be valid to assure that the estimates of the patient’s health are correct.

Similarly in education, we want to make sure that we sample a sufficient amount of the student’s performance to assure our estimates about the student’s learning are correct. When we are interested in a student’s performance in reading, it is not count every word that the student reads in a given year to estimate progress. With curriculum based measurement systems, teachers can sample one-minute readings by the student and count the number of word read correctly. These measures can be taken at regularly scheduled intervals such as weekly or monthly to measure progress across time. If reading is not sampled but once per year, it is not clear if the obtained data are representative of the student’s performance or was influenced by other factors such as illness. By sampling on a weekly basis, it is possible to have higher confidence in the validity of the reading measure. From a pragmatic perspective the problem is to determine how much to sample to assure the obtained data are valid but not to sample more than necessary so that scarce resources can be distributed in the most effective manner possible.

The Relationship Between Measurement Science and Judgment

As has been highlighted in this discussion, there is not a fixed method for assuring the adequacy of the measurement system. Ultimately, the adequacy of a measurement system will depend on the judgments of the researchers and the consumers of research. In many instances, the adequacy will depend on the specific question being asked. There are occasions that researchers are asking on question and the consumer of the research will be asking a different but related question. The adequacy of the measurement system may not be sufficient to answer both questions. The discussion here is intended to give the reader an idea of the types of considerations that apply to determining if a measurement system is adequate for the particular question. Ultimately, the judged adequacy of a measurement system will depend on the clarity of the question being asked. If one is unclear about the question being asked it is more difficult to determine if the measurement system is adequate. To close this discussion let us revisit the thermometer example. There are some instances in which a difference of one degree is sufficiently precise to accept the measurement system. On other occasions, that same difference is simply too great and the thermometer is deemed unreliable. As always, pragmatic judgments are necessary. With any measurement system it is important to be sufficiently precise but not more precise than one needs for a particular question.