Executive Summary
- Science
is not a fixed set of practices but rather a highly developed logic
system designed to rule out alternative explanations for obtained
outcomes.
Measurement, Reliability, and Validity
- Measurement is critical component of rigorous research.
- Two primary concerns for measurement are reliability and validity.
- Critical to educational reform is to measure them well.
Selecting Important Goals
- The selection of best indicators of education is ultimately a values question.
- Once the goals of education have been defined then science can provide meaningful measures of these goals.
Fundamentals of Scientific Proof
- There are specific steps that are required for a demonstration of scientific proof.
Demonstrations of Experimental Effects
- There are several methods for demonstrating the effects of an educational intervention each with strengths and limitations.
Accountability for Individual Students
- Ultimately, it is the responsibility of education to demonstrate that each child is benefiting from educational services.
Science and Pragmatism
- Educators
will have to make decisions based on the best evidence available which
requires a pragmatic approach to evaluating data.
One
of the mandates of No Child Left Behind is that educational
institutions will use “scientifically-based research” to guide their
decisions about which interventions to implement. On the surface, it
certainly seems reasonable that educators spend taxpayers money on
those procedures that have been demonstrated to be effective; however,
it is not clear that all decision makers have a shared understanding of
what constitutes evidence-based or have sufficient training to be able
to read research and determine if it meets the minimum standards of
proof for science. The purpose of this paper is to provide a brief
overview of science and to suggest how the requirement to rely on
scientifically based research can be implemented by public school
educators.
It is important to understand that science is not a
rigid set of practices but rather a highly refined logic system for
demonstrating that the results from research are a function of the
experimental procedures rather than some other unexamined variable. The
particular set of practices in a given piece of research is, in part,
determined by the phenomenon being studied and the question being
asked. Generally, it is more productive to think about the practices of
science falling on a continuum rigor. The challenge for the consumer is
to determine if an acceptable level of rigor has been achieved in a
given instance. There are certain defining features of science that can
be useful in guiding the consumer of science.
The
first defining feature of science is the reliance upon objective data
to drawn conclusions about the phenomenon of interest. While it seems
straightforward to rely on data to inform judgments, there are a number
of issues to be considered when evaluating data. There are two
fundamental questions with regard to the data. The first question
relates to the reliability of the data. Data are reliable if a
measuring device produces similar data when repeatedly exposed to the
same data set. A simple example is a thermometer. A thermometer is
considered reliable if it yields the same temperature reading when
repeatedly exposed to a fixed temperature. For pragmatic reasons it is
helpful if the thermometer is reliable across a range of temperatures.
Similarly, it is deemed reliable if it yields the same reading as a
second thermometer under the same conditions. To the extent that data
are reliable then confidence in the data increases.
The second
question regarding the data relates to the validity of the data. Data
are considered valid if they are a measure of what the experimenter
thought was being measured. Sticking with the thermometer example.
Generally, we take a persons body temperature to determine if they are
ill. If the person sits under a heat lamp with the thermometer in her
mouth then the resulting data are not valid since it is just as likely
that what was being measured was the temperature from the heat lamp
rather than body temperature.
The two measures of reliability and
validity are related. It is axiomatic that data can be reliable without
being valid but cannot be valid without being reliable. Obviously, both
tests for reliability and validity have to be satisfied when conducting
research. There are specific practices that increase the reliability
and validity of data collected during research.
The
ultimate questions for any measurement-based system is are we measuring
the right things and are we measuring them well. These questions can
only be partially answered by empirical methods. The first question
(are we measuring the right things) is, to a large extent, a question
of what is valued. When trying to evaluate education, it is often easy
to agree in the very broad sense what we mean when we say we want our
children to be well-educated. Once we try to more carefully define what
we mean by the term, it becomes clear that there are very different
meanings. Ultimately, what is meant reflects the values of the person
defining it. For some, being well-educated might mean having a set of
skills that are directly applicable to the work place. For others, it
might mean to have a broad, classic education in literature, history,
science, and arts with relatively little emphasis on work place skills.
The definition of well-educated will determine what is measured to
evaluate the impact of the educational system. Both are perfectly
legitimate goals for education and each can be evaluated only in the
context of their stated goals. Before any measurement system can be
implemented, it will be necessary to have some agreement about what the
goals are. This is usually accomplished through some kind of social
process that involves gaining input from the various stakeholders. The
extent to which large segments of the population agree with the goals
of the educational system becomes a measure of the social validity of
the goals.
Once the goals have been determined then the questions
about the appropriateness of the measurement system will require
answering. Fortunately, there are empirical methods to assist in
answering these questions. As discussed above the primary issues with
respect to how well we are measuring what we are interested in are
issues of reliability and validity.
Assuming reliable and valid
methods for measurement have been established it is necessary to
determine if the measures selected effectively predict performance on
the broader goal. It is possible to have reliable and valid measures of
performance but the unit measured produce results that are of limited
social value.
The question of social value is also a question of
that is contextual in nature. The question is often is this outcome
valuable relative to other means for accomplishing a goal. The function
of the SAT is to predict how well a prospective student will do in
college. There are really two questions to be answered: (1) how well
does the SAT predict (2) how well do alternative measures predict. It
may well be that a given instrument does not predict particularly well
but it may predict far better than alternative methods. The question
then becomes does it predict well enough to be used even though it is
less than a perfect instrument. In medicine, there are a wide variety
of measures that can be used to predict the health of the patient. Some
of these measures are far better at predicting but in many cases they
are also far more expensive. It is untenable for all patients to be
assessed with the highly accurate but very expensive methods. The goal
is to find the most highly predictive measure that has the lowest costs
associated with it. For many measures of health, the body mass index
(BMI) is a good predictor of health. It is simple to obtain and has a
high enough predictive validity that it is a good broad measure of
health. Similarly, blood pressure is an effective predictor of health
and simple to obtain. Neither of these measures are direct measures of
heart functioning but they have enough predictive validity that if
someone has a high BMI or high blood pressure then it may be necessary
for the physician to use more invasive and expensive measures to assess
patient health. The invasive and expensive procedures are reserved for
those who have been identified as being at risk.
Similarly, in
education there are some measures that are relatively easy to obtain
that have high reliability and validity and predict how a student will
do subsequently in school. Some recent research by Hart and Risley
(1995) has identified that the size of a childs vocabulary by age 3
predicts how well a child will perform in later grades in elementary
school. Equally important, they have identified the types of
experiences that are most likely to result in a well-developed
vocabulary. The measures that Hart and Risley used were number of new
words spoken during weekly observations of interactions between parents
and child. Using the medical analogy, it may be wise for early
childhood educators to routinely sample the words spoken by a child. If
the childs vocabulary does not keep pace with developmental norms then
more intensive kinds of education should follow. The point is that
vocabulary can be easily assessed in ways that are reliable and valid
and this score can be used to predict how well a child will do in
school. Even though it may not predict with absolute certainty it is an
effective measure because it is directly related to something that is
broadly valued in the culture, i.e., obtaining an education.
As
stated in the beginning of this paper, science is ultimately a logic
system for assuring that the results obtained from an experiment are
the result of variables identified by the experimenter rather than some
uncontrolled, unidentified variable. The general logic of this
enterprise is to be able to show that when subjects are exposed to the
experimental variable different results are obtained than when variable
is not present or if the subjects are exposed to some other variable.
It is necessary to demonstrate that these results can be repeated
across many instances. In order to strengthen the argument that the
experimental variable resulted in the change there a several basic
steps that distinguish science from other enterprises. The first step
is to assure that the individuals being exposed to the variable and the
individuals who are not exposed are equivalent at the beginning of the
study. Any differences between these two groups make it impossible to
discern if the results of the experiment are a function of the
experimental variable or the fact that the two groups were different
prior to the experiment starting.
The second step is to assure
that only the experimental variable is changed. If there are other
variables that are changed at the same time as the experimental
variable it becomes very difficult to determine exactly what produced
the result, the experimental variable or one of the other changes that
occurred.
Closely, related to changing only one variable at a
time is the systematic measurement of the phenomenon of interest. The
phenomenon should be measured using a reliable measurement instrument
under very similar conditions such as time of day, type of activity,
time period between measures being taken, etc. To the extent that there
are differences in the measurement process, it makes the interpretation
of the obtained data more problematic. It is not clear if the changes
are the result of the experimental variable or to differences in the
measurement process.
Finally, it is important to repeatedly
measure the impact of the experimental variable. This will be discussed
in greater detail below but it can be accomplished in two different
ways. The most common method is to expose a large number of different
individuals to the variable and have a comparable sized, equivalent
group that are not exposed to the variable. Any observed differences
between the two groups are assumed to be a function of the experimental
variable. The second method is to systematically expose a small group
of individuals to the experimental variable on repeated occasions that
systematically alternate with these same individuals not being exposed
to the variable. This repeated exposure of the experimental variable
allows the researcher to assess the stability of the change. If each
time the subject is exposed to the experimental variable there is a
different level of performance than when not exposed then it can be
concluded that the experimental variable accounts for the differences.
There
are two general strategies for making this demonstration. The first
method and most common in education is to repeat the demonstration
across a large number of students. For purposes of this discussion this
method will be referred to as group designs. The second method is to
repeatedly expose a small number of students to the experimental
demonstration. This approach will be referred to as single subject
design. Each has their own set of requirements for the demonstration of
proof and each has limitations.
Methods of Group Design:
With group design
approaches to research, a large group of subjects are divided into two
or more groups. One group will receive the experimental intervention;
the second group will not. Additional groups may be necessary depending
on the exact experimental question. With group design research, it is
important to demonstrate that the groups are equal prior to the
beginning of the experiment. As stated previously, any differences
between groups prior to the experiment make interpretation of the
results more complicated because it is less clear that the results are
a function of the experimental variable rather than the differences
between the groups.
The general method with group design research
is to assess student performance prior to the beginning of the
experiment and then again at specific points during the experiment. On
some occasions, the assessments are completed only pre-experiment and
post-experiment. Using a large number of subjects in the experiment
satisfies the requirement of repeated demonstrations of the effect of
the experimental variable. To analyze the effect of the experimental
variable, a variety of statistical methods are used to discern if there
are differences between the experimental and control groups. All of the
statistical methods are ultimately statistical averaging methods that
make it very difficult to determine how any given student performed as
a function of exposure to the experimental variable. To determine if
the differences between groups was a result of the experiment or a
result of random variability levels of statistical significance are
assigned. Generally, the lowest level of acceptable statistical
significance is .05. This means that the result would be obtained in
95% of the instances if this experiment were repeated with these
subjects.
One of the difficulties of using statistical averaging
procedures is that there can be very wide ranging performance by
subjects in the experimental group and still have statistically
significant results. With a group of 30 subjects it would be possible
to have statistically significant results if 10 of the subjects
drastically improved, 10 of these subjects had no change as a result of
the experimental variable, and 10 of the subjects actually performed
worse than they did on the pre-test. The use of statistical averaging
and reliance on tests of statistical significance obscures the analysis
of individual performance.
These concerns aside, randomized
trials are considered the gold standard for research. In this
particular approach subjects are randomly assigned to either the
experimental group or the control group. Usually large numbers of
subjects are in each group so that the differential impact of a few
exceptional performances can easily be averaged out. This approach is
an exceptionally powerful method for conducting program evaluation
research.
In education, we often want to know not only what works
but also what works best. A randomized trial effectively answers this
question. In this instance, the number of groups would depend on the
number of different programs being evaluated. The randomized trial
allows for direct comparison of the different programs and, if all goes
well, an unambiguous interpretation of the results. An example of this
type of research would be comparing different reading programs to
evaluate which one produces the best initial outcomes for students. In
a randomized trial, all first grade students in a particular school
district who meet criteria for inclusion into the study would be
randomly assigned to one of the reading programs. Students who can
already read at a particular level might be excluded from the study
because their results might obscure the effects of the reading program.
To assure that the groups are equal at the beginning of the experiment
all students would take the same pre-test under the same testing
conditions. Assuming that the groups are equivalent at the beginning of
the experiment, the students might be assessed at the end of the first
semester and again at the end of the year. Once all of the data have
been collected, the scores are aggregated and appropriate statistical
analyses are performed. Any differences that are found between the
groups can reasonably be attributed to the differences between the
reading programs that were evaluated.
While the randomized trial
is the gold standard it is not without it problems. As has been noted
earlier, aggregating scores may obscure important differences between
individuals. A second concern is that from a logistical perspective it
is very difficult and time consuming to run randomized trials. To get
around this concern quasi-experimental designs have been developed that
are not as powerful in determining differences between groups but still
contribute to understanding. In a quasi-experimental design, instead of
random assignment to groups subjects in the groups are matched on some
relevant dimensions rather than randomly assigned. In our evaluation of
reading programs it is difficult to truly randomize assignment of all
first grade students because they are grouped in classrooms and they
would all have to move to different groups for reading. This causes
some important logistical problems. In a quasi-experimental design two
first grade classrooms might be matched on a number of variables such
as gender, ethnic background, socio-economic status, and performance on
the pre-test. If these two classrooms are equivalent at the beginning
of the study then any differences between the two classrooms at the end
of the study are likely to be the result of the differences in the
reading program. While it is likely, it cannot be stated with the same
degree of certainty as with a randomized trial. The differences might
be a result of the differences in the teacher experience rather than
the reading program. Generally, what is lost in quasi-experimental
designs is a degree of rigor and a resultant loss of confidence in the
experimental results.
Just as in randomized trials, statistical
averaging and test of statistical significance are methods of analysis
in quasi-experimental methods. The same problems exist as with the
randomized trial. The advantage of the quasi-experimental design is
that logistically these studies are generally easier to implement so
more research can actually be completed in a given area.
Methods of Single Subject Design:
With
group design research, large numbers of subjects are used to satisfy
the requirement of repeated demonstrations of the experimental effect.
In single subject designs, the effect is repeatedly demonstrated with a
small number of subjects. In this case, instead of a group of subjects
who are similar to the experimental subjects but are not exposed to the
experimental variable being used as a control group, the subject serves
as his/her own control. In single subject design, repeated measures are
obtained with the subject exposed to the experimental variable and
compared with repeated measures when the subject is not exposed to the
experimental variable. The exposure to the experimental variable is
introduced and removed in a systematic manner so that any differences
between the performance when exposed to the experimental variable and
when not exposed can be attributed to the experimental variable rather
than some other unidentified variable. The strength of the method lies
in the repeated, systematic alternation between experimental conditions
and the baseline or non-experimental conditions. One of the major
strengths of this approach is that the same subject is used in both the
experimental and control conditions. No inference about the equivalence
of subjects is necessary.
A second strength of single subject
methods is that the reliance on fewer subjects and repeated measures
allows for the impact of the experimental variable to be known much
sooner. Instead of waiting until the experiment to be over to analyze
the data, data are analyzed as they are obtained. Instead of relying on
statistical analysis to interpret the results, trends in the data are
used to evaluate the impact of the experimental variable. Generally,
the results are interpreted by a visual analysis of the data displayed
on a graph.
Single subject methodologies are powerful methods
that can fill the knowledge gap while waiting for the larger group
designs to be completed. Given the mandate in No Child Left Behind to
use scientifically based research for improving instruction and the
relative paucity of randomized trials in evaluating instructional
practices, the length of time and effort required of group designs,
single subject designs are a valuable tool for educators for answering
questions about instructional methods in the short run.
An
important advantage of single subject designs is that they do not
obscure the effects on the individual subject but rather highlight how
individuals perform. If there is considerable variability across
subjects in their response to the experimental variable then additional
research is required to determine the source of this variability rather
than having the effects obscured by group design methods.
One
limitation of single subject design methods is that by relying on a
small number of subjects it is difficult to know how generalizable
(external validity) the obtained effects are. Group designs are
generally considered to better address questions of generlizability but
even with these designs no single study can address the
generalizability of results. Just as with single subject designs,
multiple studies are required until there is a preponderance of
evidence that the obtained results hold across a wide range of subjects
and settings.
There is clearly a trade-off of costs and benefits
between group and single subject design methods. Group designs obscure
the effects on individuals. Single subject designs address issues of
effects on individuals but make it difficult to answer the question
about how universal these results might be. While group designs are
generally considered to have greater external validity, there is still
the problem of being able to assume that the intervention will be
beneficial for a given student. The external validity of group designs
allows us to generalize to populations but not to individuals so it is
still necessary to evaluate the impact of an intervention for a
specific student. Single subject designs are appropriate for evaluating
the impact of an intervention on a particular student and with
sufficient replications across subjects we can generalize to
populations of students.
One way to think about this problem is
that single subject designs are especially effective in discovering the
impact a particular experimental variable might have. Group designs are
especially effective in program evaluation when the questions are how
universal of the obtained results and which interventions produce the
greatest benefits relative to other interventions to address the same
problem. In some respects, the two different approaches answer
different questions and should be used relative to the question being
asked.
This treatment of both group designs and single subject
designs are admittedly brief. Entire textbooks and graduate level
courses at universities are devoted to the issues identified in this
discussion. It is hoped that this discussion gives a better
understanding of the fundamental process of science and allows the
reader to better evaluate claims of effectiveness and perhaps design
their own experiments.
Given
the limitations of each type of design and the ultimate responsibility
of education administrators to assure that each child is benefiting
from the instruction being offered, it is important that the
performance of each student is systematically monitored and changes in
instruction are made if warranted. It is not sufficient to measure
student performance once a year during high stakes testing because an
entire year of ineffective instruction will have occurred before low
performance is identified through the high stakes testing. A second
difficulty is that the data from high stakes testing are aggregated so
educational administrators cannot determine which students are doing
well and which require different instructional methods. The type of
measurement that is warranted to evaluate the impact of instruction for
a particular student requires direct, frequently occurring, systematic
measurement. Curriculum based measurement strategies have been well
developed for reading, math, and spelling. This approach to measurement
allows the educational decision makers to get regular feedback about
how an individual student is progressing and if progress is
insufficient changes in instruction can occur as soon as difficulty is
identified. The advantage to the student is that minimum time is spent
receiving ineffective instruction. The advantage to the educational
decision makers is that more students will make progress over the
course of a year.
One of the difficulties in moving toward evidence-based decision-making
is that for many of the decisions that administrators have to make
there simply are not a sufficient number of well-designed, rigorous
studies to guide the decision. Rather than leave the administrator with
using personal judgment and opinion to make decisions, it is encouraged
that the decision-maker reviews the available evidence, albeit less
rigorous data, to make judgments. For example, in the absence of
well-controlled, randomized trials, if there is a series of
quasi-experimental studies that answer the question then certainly
these studies should be considered when making the decision.
In
the absence of series well-controlled, quasi-experimental studies, then
the decision maker has to move to other less rigorous sets of data such
as a single, quasi-experimental piece of research and perhaps a series
of correlational studies and case studies. The basic notion is that the
decision maker should take advantage of the existing data even if it
does not always meet the most rigorous standards. The caveat is that
the certainty or confidence in the correctness of the decision should
be much more conservative when the data are less rigorous. Ultimately,
there may be questions for which there is no research. In those
instances, the decision maker will have to rely on the judgment and
opinion of others but the confidence in the decision should be very low
and once data have been accumulated, decisions should be reviewed and
revised to fit the best available evidence.
Science is, by
nature, a conservative enterprise but once studies have been completed
that meet acceptable standards of proof then the comfort with the
decisions can be much higher. Just as science is conservative, when
using science as the foundation for decision making one must accept
that as new data become available decisions should be adjusted
accordingly; however, it is important to note that decisions that
should not be adjusted on the basis of a single study if there is a
preponderance of data to support a decision. The conservative nature of
science requires that when a single study yields data that disconfirms
a large body of data then that study should be replicated by
individuals who have no affiliation with the first group to verify the
obtained results. If the results still hold then additional work is
required to determine the processes that account for the new data as
well as the original data. It is the conservative and self-correcting
feature of science that are its greatest strengths but for the public
school administrator trying to used empirical evidence to guide
decisions, these strengths can be seen as limitations because the
answers are not readily available. Using the decision rule of the
pragmatism to make decisions with the best available evidence, even
though it may be flawed, can help the administrator rely on the methods
of science to guide decisions.
|