Standards of Proof

Executive Summary

  • Science is not a fixed set of practices but rather a highly developed logic system designed to rule out alternative explanations for obtained outcomes.

Measurement, Reliability, and Validity

  • Measurement is critical component of rigorous research.
  • Two primary concerns for measurement are reliability and validity.
  • Critical to educational reform is to measure them well.

Selecting Important Goals

  • The selection of best indicators of education is ultimately a values question.
  • Once the goals of education have been defined then science can provide meaningful measures of these goals.

Fundamentals of Scientific Proof

  • There are specific steps that are required for a demonstration of scientific proof.

Demonstrations of Experimental Effects

  • There are several methods for demonstrating the effects of an educational intervention each with strengths and limitations.

Accountability for Individual Students

  • Ultimately, it is the responsibility of education to demonstrate that each child is benefiting from educational services.

Science and Pragmatism

  • Educators will have to make decisions based on the best evidence available which requires a pragmatic approach to evaluating data.

Standards of Proof

One of the mandates of No Child Left Behind is that educational institutions will use “scientifically-based research” to guide their decisions about which interventions to implement. On the surface, it certainly seems reasonable that educators spend taxpayers money on those procedures that have been demonstrated to be effective; however, it is not clear that all decision makers have a shared understanding of what constitutes evidence-based or have sufficient training to be able to read research and determine if it meets the minimum standards of proof for science. The purpose of this paper is to provide a brief overview of science and to suggest how the requirement to rely on scientifically based research can be implemented by public school educators.

It is important to understand that science is not a rigid set of practices but rather a highly refined logic system for demonstrating that the results from research are a function of the experimental procedures rather than some other unexamined variable. The particular set of practices in a given piece of research is, in part, determined by the phenomenon being studied and the question being asked. Generally, it is more productive to think about the practices of science falling on a continuum rigor. The challenge for the consumer is to determine if an acceptable level of rigor has been achieved in a given instance. There are certain defining features of science that can be useful in guiding the consumer of science.


The first defining feature of science is the reliance upon objective data to drawn conclusions about the phenomenon of interest. While it seems straightforward to rely on data to inform judgments, there are a number of issues to be considered when evaluating data. There are two fundamental questions with regard to the data. The first question relates to the reliability of the data. Data are reliable if a measuring device produces similar data when repeatedly exposed to the same data set. A simple example is a thermometer. A thermometer is considered reliable if it yields the same temperature reading when repeatedly exposed to a fixed temperature. For pragmatic reasons it is helpful if the thermometer is reliable across a range of temperatures. Similarly, it is deemed reliable if it yields the same reading as a second thermometer under the same conditions. To the extent that data are reliable then confidence in the data increases.

The second question regarding the data relates to the validity of the data. Data are considered valid if they are a measure of what the experimenter thought was being measured. Sticking with the thermometer example. Generally, we take a persons body temperature to determine if they are ill. If the person sits under a heat lamp with the thermometer in her mouth then the resulting data are not valid since it is just as likely that what was being measured was the temperature from the heat lamp rather than body temperature.

The two measures of reliability and validity are related. It is axiomatic that data can be reliable without being valid but cannot be valid without being reliable. Obviously, both tests for reliability and validity have to be satisfied when conducting research. There are specific practices that increase the reliability and validity of data collected during research.

 Judging reliability

 Determining Validity

 Selecting Important Goals

The ultimate questions for any measurement-based system is are we measuring the right things and are we measuring them well. These questions can only be partially answered by empirical methods. The first question (are we measuring the right things) is, to a large extent, a question of what is valued. When trying to evaluate education, it is often easy to agree in the very broad sense what we mean when we say we want our children to be well-educated. Once we try to more carefully define what we mean by the term, it becomes clear that there are very different meanings. Ultimately, what is meant reflects the values of the person defining it. For some, being well-educated might mean having a set of skills that are directly applicable to the work place. For others, it might mean to have a broad, classic education in literature, history, science, and arts with relatively little emphasis on work place skills. The definition of well-educated will determine what is measured to evaluate the impact of the educational system. Both are perfectly legitimate goals for education and each can be evaluated only in the context of their stated goals. Before any measurement system can be implemented, it will be necessary to have some agreement about what the goals are. This is usually accomplished through some kind of social process that involves gaining input from the various stakeholders. The extent to which large segments of the population agree with the goals of the educational system becomes a measure of the social validity of the goals.

Once the goals have been determined then the questions about the appropriateness of the measurement system will require answering. Fortunately, there are empirical methods to assist in answering these questions. As discussed above the primary issues with respect to how well we are measuring what we are interested in are issues of reliability and validity.

Assuming reliable and valid methods for measurement have been established it is necessary to determine if the measures selected effectively predict performance on the broader goal. It is possible to have reliable and valid measures of performance but the unit measured produce results that are of limited social value.

The question of social value is also a question of that is contextual in nature. The question is often is this outcome valuable relative to other means for accomplishing a goal. The function of the SAT is to predict how well a prospective student will do in college. There are really two questions to be answered: (1) how well does the SAT predict (2) how well do alternative measures predict. It may well be that a given instrument does not predict particularly well but it may predict far better than alternative methods. The question then becomes does it predict well enough to be used even though it is less than a perfect instrument. In medicine, there are a wide variety of measures that can be used to predict the health of the patient. Some of these measures are far better at predicting but in many cases they are also far more expensive. It is untenable for all patients to be assessed with the highly accurate but very expensive methods. The goal is to find the most highly predictive measure that has the lowest costs associated with it. For many measures of health, the body mass index (BMI) is a good predictor of health. It is simple to obtain and has a high enough predictive validity that it is a good broad measure of health. Similarly, blood pressure is an effective predictor of health and simple to obtain. Neither of these measures are direct measures of heart functioning but they have enough predictive validity that if someone has a high BMI or high blood pressure then it may be necessary for the physician to use more invasive and expensive measures to assess patient health. The invasive and expensive procedures are reserved for those who have been identified as being at risk.

Similarly, in education there are some measures that are relatively easy to obtain that have high reliability and validity and predict how a student will do subsequently in school. Some recent research by Hart and Risley (1995) has identified that the size of a childs vocabulary by age 3 predicts how well a child will perform in later grades in elementary school. Equally important, they have identified the types of experiences that are most likely to result in a well-developed vocabulary. The measures that Hart and Risley used were number of new words spoken during weekly observations of interactions between parents and child. Using the medical analogy, it may be wise for early childhood educators to routinely sample the words spoken by a child. If the childs vocabulary does not keep pace with developmental norms then more intensive kinds of education should follow. The point is that vocabulary can be easily assessed in ways that are reliable and valid and this score can be used to predict how well a child will do in school. Even though it may not predict with absolute certainty it is an effective measure because it is directly related to something that is broadly valued in the culture, i.e., obtaining an education.

Fundamentals of Scientific Proof

As stated in the beginning of this paper, science is ultimately a logic system for assuring that the results obtained from an experiment are the result of variables identified by the experimenter rather than some uncontrolled, unidentified variable. The general logic of this enterprise is to be able to show that when subjects are exposed to the experimental variable different results are obtained than when variable is not present or if the subjects are exposed to some other variable. It is necessary to demonstrate that these results can be repeated across many instances. In order to strengthen the argument that the experimental variable resulted in the change there a several basic steps that distinguish science from other enterprises. The first step is to assure that the individuals being exposed to the variable and the individuals who are not exposed are equivalent at the beginning of the study. Any differences between these two groups make it impossible to discern if the results of the experiment are a function of the experimental variable or the fact that the two groups were different prior to the experiment starting.

The second step is to assure that only the experimental variable is changed. If there are other variables that are changed at the same time as the experimental variable it becomes very difficult to determine exactly what produced the result, the experimental variable or one of the other changes that occurred.

Closely, related to changing only one variable at a time is the systematic measurement of the phenomenon of interest. The phenomenon should be measured using a reliable measurement instrument under very similar conditions such as time of day, type of activity, time period between measures being taken, etc. To the extent that there are differences in the measurement process, it makes the interpretation of the obtained data more problematic. It is not clear if the changes are the result of the experimental variable or to differences in the measurement process.

Finally, it is important to repeatedly measure the impact of the experimental variable. This will be discussed in greater detail below but it can be accomplished in two different ways. The most common method is to expose a large number of different individuals to the variable and have a comparable sized, equivalent group that are not exposed to the variable. Any observed differences between the two groups are assumed to be a function of the experimental variable. The second method is to systematically expose a small group of individuals to the experimental variable on repeated occasions that systematically alternate with these same individuals not being exposed to the variable. This repeated exposure of the experimental variable allows the researcher to assess the stability of the change. If each time the subject is exposed to the experimental variable there is a different level of performance than when not exposed then it can be concluded that the experimental variable accounts for the differences.

Demonstrations of Experimental Effects

There are two general strategies for making this demonstration. The first method and most common in education is to repeat the demonstration across a large number of students. For purposes of this discussion this method will be referred to as group designs. The second method is to repeatedly expose a small number of students to the experimental demonstration. This approach will be referred to as single subject design. Each has their own set of requirements for the demonstration of proof and each has limitations.

Methods of Group Design:

With group design approaches to research, a large group of subjects are divided into two or more groups. One group will receive the experimental intervention; the second group will not. Additional groups may be necessary depending on the exact experimental question. With group design research, it is important to demonstrate that the groups are equal prior to the beginning of the experiment. As stated previously, any differences between groups prior to the experiment make interpretation of the results more complicated because it is less clear that the results are a function of the experimental variable rather than the differences between the groups.

The general method with group design research is to assess student performance prior to the beginning of the experiment and then again at specific points during the experiment. On some occasions, the assessments are completed only pre-experiment and post-experiment. Using a large number of subjects in the experiment satisfies the requirement of repeated demonstrations of the effect of the experimental variable. To analyze the effect of the experimental variable, a variety of statistical methods are used to discern if there are differences between the experimental and control groups. All of the statistical methods are ultimately statistical averaging methods that make it very difficult to determine how any given student performed as a function of exposure to the experimental variable. To determine if the differences between groups was a result of the experiment or a result of random variability levels of statistical significance are assigned. Generally, the lowest level of acceptable statistical significance is .05. This means that the result would be obtained in 95% of the instances if this experiment were repeated with these subjects.

One of the difficulties of using statistical averaging procedures is that there can be very wide ranging performance by subjects in the experimental group and still have statistically significant results. With a group of 30 subjects it would be possible to have statistically significant results if 10 of the subjects drastically improved, 10 of these subjects had no change as a result of the experimental variable, and 10 of the subjects actually performed worse than they did on the pre-test. The use of statistical averaging and reliance on tests of statistical significance obscures the analysis of individual performance.

These concerns aside, randomized trials are considered the gold standard for research. In this particular approach subjects are randomly assigned to either the experimental group or the control group. Usually large numbers of subjects are in each group so that the differential impact of a few exceptional performances can easily be averaged out. This approach is an exceptionally powerful method for conducting program evaluation research.

In education, we often want to know not only what works but also what works best. A randomized trial effectively answers this question. In this instance, the number of groups would depend on the number of different programs being evaluated. The randomized trial allows for direct comparison of the different programs and, if all goes well, an unambiguous interpretation of the results. An example of this type of research would be comparing different reading programs to evaluate which one produces the best initial outcomes for students. In a randomized trial, all first grade students in a particular school district who meet criteria for inclusion into the study would be randomly assigned to one of the reading programs. Students who can already read at a particular level might be excluded from the study because their results might obscure the effects of the reading program. To assure that the groups are equal at the beginning of the experiment all students would take the same pre-test under the same testing conditions. Assuming that the groups are equivalent at the beginning of the experiment, the students might be assessed at the end of the first semester and again at the end of the year. Once all of the data have been collected, the scores are aggregated and appropriate statistical analyses are performed. Any differences that are found between the groups can reasonably be attributed to the differences between the reading programs that were evaluated.

While the randomized trial is the gold standard it is not without it problems. As has been noted earlier, aggregating scores may obscure important differences between individuals. A second concern is that from a logistical perspective it is very difficult and time consuming to run randomized trials. To get around this concern quasi-experimental designs have been developed that are not as powerful in determining differences between groups but still contribute to understanding. In a quasi-experimental design, instead of random assignment to groups subjects in the groups are matched on some relevant dimensions rather than randomly assigned. In our evaluation of reading programs it is difficult to truly randomize assignment of all first grade students because they are grouped in classrooms and they would all have to move to different groups for reading. This causes some important logistical problems. In a quasi-experimental design two first grade classrooms might be matched on a number of variables such as gender, ethnic background, socio-economic status, and performance on the pre-test. If these two classrooms are equivalent at the beginning of the study then any differences between the two classrooms at the end of the study are likely to be the result of the differences in the reading program. While it is likely, it cannot be stated with the same degree of certainty as with a randomized trial. The differences might be a result of the differences in the teacher experience rather than the reading program. Generally, what is lost in quasi-experimental designs is a degree of rigor and a resultant loss of confidence in the experimental results.

Just as in randomized trials, statistical averaging and test of statistical significance are methods of analysis in quasi-experimental methods. The same problems exist as with the randomized trial. The advantage of the quasi-experimental design is that logistically these studies are generally easier to implement so more research can actually be completed in a given area.

Methods of Single Subject Design:

With group design research, large numbers of subjects are used to satisfy the requirement of repeated demonstrations of the experimental effect. In single subject designs, the effect is repeatedly demonstrated with a small number of subjects. In this case, instead of a group of subjects who are similar to the experimental subjects but are not exposed to the experimental variable being used as a control group, the subject serves as his/her own control. In single subject design, repeated measures are obtained with the subject exposed to the experimental variable and compared with repeated measures when the subject is not exposed to the experimental variable. The exposure to the experimental variable is introduced and removed in a systematic manner so that any differences between the performance when exposed to the experimental variable and when not exposed can be attributed to the experimental variable rather than some other unidentified variable. The strength of the method lies in the repeated, systematic alternation between experimental conditions and the baseline or non-experimental conditions. One of the major strengths of this approach is that the same subject is used in both the experimental and control conditions. No inference about the equivalence of subjects is necessary.

A second strength of single subject methods is that the reliance on fewer subjects and repeated measures allows for the impact of the experimental variable to be known much sooner. Instead of waiting until the experiment to be over to analyze the data, data are analyzed as they are obtained. Instead of relying on statistical analysis to interpret the results, trends in the data are used to evaluate the impact of the experimental variable. Generally, the results are interpreted by a visual analysis of the data displayed on a graph.

Single subject methodologies are powerful methods that can fill the knowledge gap while waiting for the larger group designs to be completed. Given the mandate in No Child Left Behind to use scientifically based research for improving instruction and the relative paucity of randomized trials in evaluating instructional practices, the length of time and effort required of group designs, single subject designs are a valuable tool for educators for answering questions about instructional methods in the short run.

An important advantage of single subject designs is that they do not obscure the effects on the individual subject but rather highlight how individuals perform. If there is considerable variability across subjects in their response to the experimental variable then additional research is required to determine the source of this variability rather than having the effects obscured by group design methods.

One limitation of single subject design methods is that by relying on a small number of subjects it is difficult to know how generalizable (external validity) the obtained effects are. Group designs are generally considered to better address questions of generlizability but even with these designs no single study can address the generalizability of results. Just as with single subject designs, multiple studies are required until there is a preponderance of evidence that the obtained results hold across a wide range of subjects and settings.

There is clearly a trade-off of costs and benefits between group and single subject design methods. Group designs obscure the effects on individuals. Single subject designs address issues of effects on individuals but make it difficult to answer the question about how universal these results might be. While group designs are generally considered to have greater external validity, there is still the problem of being able to assume that the intervention will be beneficial for a given student. The external validity of group designs allows us to generalize to populations but not to individuals so it is still necessary to evaluate the impact of an intervention for a specific student. Single subject designs are appropriate for evaluating the impact of an intervention on a particular student and with sufficient replications across subjects we can generalize to populations of students.

One way to think about this problem is that single subject designs are especially effective in discovering the impact a particular experimental variable might have. Group designs are especially effective in program evaluation when the questions are how universal of the obtained results and which interventions produce the greatest benefits relative to other interventions to address the same problem. In some respects, the two different approaches answer different questions and should be used relative to the question being asked.

This treatment of both group designs and single subject designs are admittedly brief. Entire textbooks and graduate level courses at universities are devoted to the issues identified in this discussion. It is hoped that this discussion gives a better understanding of the fundamental process of science and allows the reader to better evaluate claims of effectiveness and perhaps design their own experiments.

Accountability for Individual Students

Given the limitations of each type of design and the ultimate responsibility of education administrators to assure that each child is benefiting from the instruction being offered, it is important that the performance of each student is systematically monitored and changes in instruction are made if warranted. It is not sufficient to measure student performance once a year during high stakes testing because an entire year of ineffective instruction will have occurred before low performance is identified through the high stakes testing. A second difficulty is that the data from high stakes testing are aggregated so educational administrators cannot determine which students are doing well and which require different instructional methods. The type of measurement that is warranted to evaluate the impact of instruction for a particular student requires direct, frequently occurring, systematic measurement. Curriculum based measurement strategies have been well developed for reading, math, and spelling. This approach to measurement allows the educational decision makers to get regular feedback about how an individual student is progressing and if progress is insufficient changes in instruction can occur as soon as difficulty is identified. The advantage to the student is that minimum time is spent receiving ineffective instruction. The advantage to the educational decision makers is that more students will make progress over the course of a year.

Science and Pragmatism

One of the difficulties in moving toward evidence-based decision-making is that for many of the decisions that administrators have to make there simply are not a sufficient number of well-designed, rigorous studies to guide the decision. Rather than leave the administrator with using personal judgment and opinion to make decisions, it is encouraged that the decision-maker reviews the available evidence, albeit less rigorous data, to make judgments. For example, in the absence of well-controlled, randomized trials, if there is a series of quasi-experimental studies that answer the question then certainly these studies should be considered when making the decision.

In the absence of series well-controlled, quasi-experimental studies, then the decision maker has to move to other less rigorous sets of data such as a single, quasi-experimental piece of research and perhaps a series of correlational studies and case studies. The basic notion is that the decision maker should take advantage of the existing data even if it does not always meet the most rigorous standards. The caveat is that the certainty or confidence in the correctness of the decision should be much more conservative when the data are less rigorous. Ultimately, there may be questions for which there is no research. In those instances, the decision maker will have to rely on the judgment and opinion of others but the confidence in the decision should be very low and once data have been accumulated, decisions should be reviewed and revised to fit the best available evidence.

Science is, by nature, a conservative enterprise but once studies have been completed that meet acceptable standards of proof then the comfort with the decisions can be much higher. Just as science is conservative, when using science as the foundation for decision making one must accept that as new data become available decisions should be adjusted accordingly; however, it is important to note that decisions that should not be adjusted on the basis of a single study if there is a preponderance of data to support a decision. The conservative nature of science requires that when a single study yields data that disconfirms a large body of data then that study should be replicated by individuals who have no affiliation with the first group to verify the obtained results. If the results still hold then additional work is required to determine the processes that account for the new data as well as the original data. It is the conservative and self-correcting feature of science that are its greatest strengths but for the public school administrator trying to used empirical evidence to guide decisions, these strengths can be seen as limitations because the answers are not readily available. Using the decision rule of the pragmatism to make decisions with the best available evidence, even though it may be flawed, can help the administrator rely on the methods of science to guide decisions.