Executive Summary: Determining when Measurement is Adequate
Determining What to Measure
- At the heart of any evidence-based approach is systematic measurement.
- Determining
what to measure is ultimately a question of values and cannot be
answered by empirical methods. The decisions about what to measure are
determined through social processes.
Determining How to Measure
- Measurement science can guide how to measure once it has been determined what is important to measure.
Reliability of Measurement Systems
- The first requirement of any measurement system is that it is reliable (consistent).
Validity of Measurement Systems
- The
second requirement of a measurement system is that it measure what it
is purported to measure (validity). Measurement systems can be reliable
without being valid but cannot be valid without being reliable.
- How
unit of interest is defined can how powerful influence on reliability
and validity concerns. The measurement of constructs such as social
skills pose challenges for demonstrating reliability and validity.
- Systematic sampling of the behavior of interest increases validity.
The Relationship Between Measurement Science and Judgment
- Ultimately,
the adequacy of a measurement system is a judgment of the researcher
and the consumer of research. To a large extent the adequacy of the
measurement depends on the question being asked and the steps taken to
assure the reliability and validity of the measurement system. The
final answer to questions about the adequacy of a measurement system is
a pragmatic one.
Determining if Measurement is Adequate
At
the heart of any evidence-based education approach is the systematic
measurement of whatever is being studied. The evidence one uses to
guide decisions about education interventions is only as good as the
measurement system used to develop the evidence. Given the importance
of measurement, it is important for decision makers to have an
understanding of what constitutes good measurement.
Determining
what to measure is not, ultimately, a question that can be answered by
empirical methods. The decision about what to measure reflects the
goals one has for education and what is meant by the term
“well-educated.” Some segments of the population want education to
teach skills that are directly related to job skills. Others desire for
education to provide a more classic education with an emphasis on
literature, art, and history.
The immediate application to job
skills is not a concern for this group. Clearly, how education will be
evaluated will be different because the ultimate goals are very
different. Both sets of goals are legitimate and progress toward each
goal can be evaluated only in the context of the goal. It would be
unreasonable to evaluate education that emphasized job related skills
by standards that are relevant to a classic education. The first
question to be answered when developing a measurement system is to ask
“why am I interested in this?” The answer to that question can guide
many decisions about what to measure and how to measure.
Once
the goals for education and what to measure have been established then
the process becomes somewhat easier. Many of the decisions about the
measurement system can be guided by a science of measurement but not
all of the decisions are entirely straightforward. Consider the example
of the goal that all students of a certain age will be able to read at
the 4th grade level. Immediately, questions arise. Which 4th graders
are going to be the referent group? Will it be a sample of 4th graders
nationally or will it be 4th graders from the local area. There may
well be significant differences between the national sample and the
local sample. The answer to the question will not necessarily influence
the way reading is measured but will certainly influence how the
resulting measures are interpreted.
A second question that
immediately arises is what do we mean by reading? Are we referring to
decoding skills or are we interested in reading comprehension? Clearly
the answer to this question will determine how and what we measure. If
we are measuring decoding then decisions must be made about what
materials will be used as the assessment materials and how to score
words read correctly. Such questions as how to score self-corrections
while reading must be addressed and explicated so scores can be
interpreted and compared against the way in which reading levels were
determined in the referent group.
Differences in the way reading
is measured can make it difficult to make any meaningful comparisons.
Ultimately, the point of collecting data is to be able to make
comparisons against some other performance. The usual types of
comparisons that are made are with a person’s earlier performance, the
performance of another similar group, performance against an absolute
standard such as words read correctly per minute. The specific
comparison being made depends on the exact question being asked. To the
extent that performance is measured in different ways between the two
comparison points then any conclusions about the performance is likely
to be flawed.
The
first requirement of any measurement system is that it is reliable
which means that the system is consistent under a standard set of
conditions. That is to say, with repeated measures taken under similar
circumstances, similar data will be obtained regardless of whom is
taking the measure or any other variable not directly related to what
is being measured. To the extent that a measurement system is reliable
then confidence in the data is increased. A simple measurement system
is a thermometer.
The ideal thermometer is one that will (1)
yield the same temperature whenever a standard set of conditions are
present; (2) yield the same temperature as any other thermometer
measuring under those same conditions; (3) yield the same temperature
across a range of temperatures; and (4) yield the same temperature
regardless of whom is taking the temperature. To the extent that any
one of these statements is not true then the reliability of the
thermometer is questionable and the resulting data are to be
interpreted cautiously. For example, if the thermometer meets all of
the above criteria but is not reliable at extreme temperatures then the
data may be unusable if there are instances of extreme temperatures in
sampled temperatures.
If any of the data in the sample are
suspect then the conclusions are suspect. There are two solutions to
this problem. Either use a thermometer that is reliable across all
temperatures or throw out all data that fall at the extremes. The first
solution is certainly more desirable. If data are thrown out then any
generalizations about the data are limited because nothing is known
about what happens at the extremes. The requirements of a measurement
system in education are no different than for a thermometer. The
measurement system has to meet all of the same standards for
reliability.
Once
a system has been proven to be reliable then it becomes necessary to
demonstrate that it is valid. In this context, valid means that the
measurement system is measuring what it is purported to measure and not
some other phenomenon. One of the most frequently cited concerns with
high stakes testing is that it is not actually a measure of what
students know but rather a measure of how well teachers have taught to
the particular test. One way to address this concern is to demonstrate
that the students’ test scores on the high stakes test are highly
correlated with their scores on another test that is considered to
measure the same skills as the high stakes test and the validity has
been established. To the extent that the scores on the two measures are
correlated the results from the high stakes test can be considered to
be valid. If there is a low correlation between the two tests then the
meaning of the scores from the high stakes test are brought into
question. It is possible that a measurement system is reliable but not
valid. Conversely, it is not possible for it to be valid if it is not
reliable. By definition, data based on an unreliable data system cannot
be valid because it is unclear what the measurement system is measuring
since the obtained data are not consistent under a standard set of
conditions.
One of the biggest threats to both reliability and
validity is how the unit being measured is defined. If what is being
measured is comprised of separate events that are assumed to be part of
the same construct then it is often much more difficult to develop a
measurement system that can reliably measure all of the component
events and similarly, it is not always clear how each of these
components contribute to the obtained data raising questions about the
validity of the measurement system. Consider the construct of social
skills. Clearly, what is consider to be social skills is not a short
list of discrete behaviors but rather a broad set of behaviors that
are, in part defined by contextual variables. Out of necessity, any
measurement system designed to assess social skills will involve
multiple behavioral events that ultimately fall under the general
construct of social skills. Because social skills is a construct,
within the context of a specific piece of research, the indices of
social skills will be defined differently and will be a subset of all
the behaviors that might be considered socially skilled. Under these
conditions it is very difficult to develop a measurement system that
will reliably measure the unit of interest regardless of whether skills
are measured through direct observation of persons or through some type
of rating system. In either instance, achieving a high level of
consistency across raters and across time will be challenging.
Assuming
a reliable measurement system can be developed the validity of the
measures is the next obstacle. If validity is the extent to which a
measurement system is actually measuring what is purported to measure
then it is incumbent to demonstrate the obtained data are a function of
differences in social skills and not some other variable. One variable
highly correlated with ratings of social skills is physical
attractiveness so any valid measurement system of social skills must
demonstrate that the obtained data are a measure of social skills and
not of physical attractiveness.
Currently, there is a great deal
of interest in graduation rates as a measure of the effectiveness of an
educational institution. The issues related to defining the measurement
unit of interest are well illustrated in this measure. The first
question that has to be answered is what is considered as graduating.
Clearly, receiving a diploma from the granting institution is the most
easily defined measure; however, questions arise when students leave
school and obtain their high school diploma through a GED. For the
purposes of measuring graduation rates, should this be counted as a
positive instance of graduation? The answer to this question will
influence the overall graduation rate.
A second question that
arises when trying to measure graduation rates is what is the total
population from which the rate will be calculated? For example, one way
to determine the population is to count the number of students who
entered high school in a particular year and calculate the rate as the
number of those students from this population that subsequently
graduated. Immediately questions arise about how to deal with students
who move to another educational institution before graduation.
Similarly, how should students be counted if they move into the school
after the base year. If these students graduate, should they be
included in the calculations? If they fail to graduate, should they be
counted?
Finally, questions have to be answered about how to
count students who graduate but take longer than the usual amount of
time to graduate. For the purposes of calculation should these students
be counted as failing to graduate? Should they be counted in some
subsequent year as graduating even though they were not counted in the
base year for determining the graduation rates?
Before the data
about graduation rates can be well understood, these questions must be
answered. In part, the answers to these questions should be influenced
by why questions are being asked about the graduation rate. Depending
on how these questions are answered, it is likely that the resulting
data about graduation rates will either be an overestimate or an
underestimate of actual graduation rates. It is important that the
persons who make the decisions about how to measure the graduation
rates provide clear explanations about why they made the decisions they
did so that the various stakeholders (politicians, educators, parents,
voters) can interpret the data in meaningful ways. In some instances,
an underestimate of graduation rates may be better than an overestimate
and it is important that the consumers of the data know if the data
they are evaluating are an overestimate or underestimate. In the
current political climate surrounding education the way graduation
rates are calculated is important because these rates are part of the
formula for determining if the educational institution (school,
district, or state system) is making adequate yearly progress.
An
additional threat to validity of the obtained data is the method of
sampling the unit of interest. Since it is very unlikely that any
measurement system will count every instance of whatever is being
measured then how sampling is done will be critical to assuring the
validity of measurement. Consider the following example of sampling.
When physicians are interested in the count of red and white blood
cells they do not drain all of the blood from a person to count all of
the red and white blood cells. Rather they draw a fixed amount of blood
and from the counts in this sample they make judgments about the health
of the patient. The sampling procedure has to be valid to assure that
the estimates of the patient’s health are correct.
Similarly in
education, we want to make sure that we sample a sufficient amount of
the student’s performance to assure our estimates about the student’s
learning are correct. When we are interested in a student’s performance
in reading, it is not count every word that the student reads in a
given year to estimate progress. With curriculum based measurement
systems, teachers can sample one-minute readings by the student and
count the number of word read correctly. These measures can be taken at
regularly scheduled intervals such as weekly or monthly to measure
progress across time. If reading is not sampled but once per year, it
is not clear if the obtained data are representative of the student’s
performance or was influenced by other factors such as illness. By
sampling on a weekly basis, it is possible to have higher confidence in
the validity of the reading measure. From a pragmatic perspective the
problem is to determine how much to sample to assure the obtained data
are valid but not to sample more than necessary so that scarce
resources can be distributed in the most effective manner possible.
| The Relationship Between Measurement Science and Judgment |
Back to Top |
|
As
has been highlighted in this discussion, there is not a fixed method
for assuring the adequacy of the measurement system. Ultimately, the
adequacy of a measurement system will depend on the judgments of the
researchers and the consumers of research. In many instances, the
adequacy will depend on the specific question being asked. There are
occasions that researchers are asking on question and the consumer of
the research will be asking a different but related question. The
adequacy of the measurement system may not be sufficient to answer both
questions. The discussion here is intended to give the reader an idea
of the types of considerations that apply to determining if a
measurement system is adequate for the particular question. Ultimately,
the judged adequacy of a measurement system will depend on the clarity
of the question being asked. If one is unclear about the question being
asked it is more difficult to determine if the measurement system is
adequate. To close this discussion let us revisit the thermometer
example. There are some instances in which a difference of one degree
is sufficiently precise to accept the measurement system. On other
occasions, that same difference is simply too great and the thermometer
is deemed unreliable. As always, pragmatic judgments are necessary.
With any measurement system it is important to be sufficiently precise
but not more precise than one needs for a particular question.
|