Evidence According to ESSA:

16 min readSep 14, 2020

Going Beyond the NCLB-Era to Reduce Achievement Gaps

Authors: Denis Newman and Andrew P. Jaciw

Ending a Two-Decade-Old Research Legacy

As a research organization specializing in rigorous program evaluations, Empirical Education has, for 17 years been developing methods for providing schools with useful and affordable information about products they purchase. Working with Evidentally, Inc., we have recently been focusing on education technology (edtech) and related distance learning products. This has led us to move away from what has become the conventional methods of education research. There’s an urgency. With the COVID-19 crisis and school closures in full swing, use of edtech products is predicted to continue expanding even when the classrooms they have replaced come back in use. There is little information available about which edtech products work for whom and under what conditions and the lack of evidence is noted. This posting describes the rationale for a methodological solution to get the information schools need.

The “modern” era of education research was established with No Child Left Behind, the education laws passed in 2002, which established the Institute of Education Sciences (IES). The declaration of “scientifically-based research” in NCLB sparked a major undertaking assigned to the IES. To meet NCLB’s goals for improvement, it was essential to overthrow and replace education research methods that lacked the practical goal of determining whether programs, products, practices, or policies moved the needle for an outcome of interest. These outcomes could be test scores or other measures, such as discipline referrals, that the schools were concerned with.

The kinds of questions that the learning sciences of the time answered were different: for example, how can we compare tests with tasks outside of the test? Or what is the structure of the classroom dialogue? Instead of these qualitative methods, IES looked to the medical field and other areas of policy like workforce training where randomization was used to assign subjects to a treatment group and a control group. Statistical summaries were then used to decide whether the researcher can reasonably conclude that a difference of the magnitude observed is big enough that it is unlikely to happen without a real effect.

Mimicking the medical field, IES set up the What Works Clearinghouse (WWC), which became the arbiter of acceptable research. WWC focused on getting the research design right. Many studies would be disqualified, with very few, and sometimes just one, meeting design requirements considered acceptable to provide valid causal evidence This led many to the false idea that all programs, products, practices, or policies needed was one good study that would prove its efficacy. The focus on design (or “internal validity”) was to the exclusion of generalizability (or “external validity”). Our team has conducted dozens of RCTs and appreciates the need for careful design. But we try to keep in mind the information that schools need.

Schools need to know whether the current version of the product will work, when implemented using district resources and with its specific student and teacher populations. A single acceptable study may have been conducted a decade ago with ethnicities different from the district that is looking for evidence of what will work for them. The WWC has worked hard to be fair to each intervention, especially where there are multiple studies that meet their standards. But the key is that each separate study comes with an average conclusion. While the WWC notes the demographic composition of the subjects in the study, difference results for subgroups when tested by the study are considered secondary.

The Every Student Succeeds Act (ESSA) was passed with bi-partisan support in late 2015 and replaced NCLB. ESSA replaced scientifically-based research required by NCLB with four evidence tiers. While this was an important advance, ESSA retained the WWC as providing the definition of the top two tiers. The WWC, which remains the arbiter of evidence validity, gives the following explanation of the purpose of the evidence tiers.

“Evidence requirements under the Every Student Succeeds Act (ESSA) are designed to ensure that states, districts, and schools can identify programs, practices, products, and policies that work across various populations.”

The key to ESSA’s failings is in the final clause “that work across various populations.” As a legacy of the NCLB-era, the WWC is only interested in the average impact across the various populations in the study. The problem is that district or school decision-makers need to know if the program will work in their schools given their specific population and resources.

The good news is that the education research community is recognizing that ignoring the population, region, product, and implementation differences is no longer necessary. Strict rules were needed to make the paradigm shift to the RCT-oriented approach. But now that NCLB’s paradigm has been in place for almost two decades and generations of educational researchers have been trained in it, we can begin to broaden the outlook. Mark Schneider, IES’s current director defines the IES mission as being “in the business of identifying what works for whom under what conditions.” This framing is a move toward a broader focus with more relevant results. Researchers are looking at how to generalize results, for example, the Generalizer tool developed by Tipton and colleagues uses demographics of a target district to generate an applicability metric. The Jefferson Education Exchange’s EdTech Genome Project has focused on implementation models as an important factor in efficacy.

Our methods move away from the legacy of the last 20 years to lower the cost of program evaluations, while retaining the scientific rigor and avoiding biases that can give schools misleading information. Lowering cost makes it feasible for thousands of small local studies to be conducted on the multitude of school products. Instead of one or a small handful of studies for each product, we can use a dozen small studies encompassing the variety of contexts so that we can determine for whom and under what conditions the product works.

In the next section we explain how the new education law, while introducing flexibility, can at the same time raise issues about biases, which can mislead school decision-makers.

ESSA’s Evidence Tiers and Potential for Bias

The tiers of evidence defined in the Every Student Succeeds Act (ESSA) give schools and researchers greater flexibility, but are not without controversy. The ESSA law, for example, states that studies must statistically control for “selection bias”, recognizing that teachers who “select” to use a program may have other characteristics that give them an advantage and the results for those teachers could be biased upward. As we trace the problem of bias it is useful to go back to the interpretation of ESSA that originated with the NCLB-era approach to research.

When we helped develop the research guidelines for the Software & Information Industry Association, we took a close look at ESSA and how it is often interpreted. Now, as research is evolving with cloud-based educational products that automatically report usage data, it is important to clarify both ESSA’s useful advances and how the four tiers fail to address a critical scientific concept needed for schools to make use of research.

We’ve written elsewhere how the ESSA tiers of evidence form a developmental scale. The four tiers give educators as well as developers of educational materials and products an easier way to start examining effectiveness without making the commitment to the type of scientifically-based research that NCLB once required.

We think of the four tiers of evidence defined in ESSA as a pyramid as shown in this figure.

1. RCT. At the apex is Tier 1, defined by ESSA as a randomized control trial (RCT), considered the gold standard in the NCLB era.

2. Matched Comparison or “quasi-experiments”. With Tier 2 the WWC also allowed for less rigorous experimental research design, such as matched comparisons or quasi-experiments (QE) where schools, teachers, and students (experimental units) independently chose to engage in the program. QEs are permitted but accepted “with reservations” because without random assignment there is the possibility of “selection bias.” For example, teachers who do well at preparing kids for tests might be more likely to participate in a new program than teachers who don’t excel at test preparation. With an RCT we can expect that such positive traits are equally distributed in the experiment between users and non-users.

3. Correlational. Tier 3 is an important and useful addition to evidence, as a weaker but readily achieved method once the developer has a product running in schools. At that point, they have an opportunity to see if critical elements of the program correlate with outcomes of interest. With controls for “selection bias” as ESSA calls for, this provides promising evidence. This is useful for both improving the product and giving the schools some indication that it is helping. This evidence suggests that it might be worthwhile to follow up with a tier 2 study for more definitive results.

4. Rationale. The base level or Tier 4 is the expectation that any product should have a rationale based on learning science for why it is likely to work. Schools will want this basic rationale for why a program should work before trying it out. Our colleagues at Digital Promise have announced a service in which developers are certified as meeting Tier 4 standards.

Each subsequent tier of evidence (from number 4 to 1) improves what’s considered the “rigor” of the research design. It is important to understand that the hierarchy has nothing to do with whether the results can be generalized from the setting of the study to the district where the decision-maker resides.

While the NCLB-era focus on strong design puts emphasis on the Tier 1 RCT, we see Tiers 2 and 3 as an opportunity for lower cost and faster-turn-around “rapid-cycle evaluations” (RCE.) Tier 1 RCTs have given education research a well-deserved reputation as slow and expensive. It can take one to two years to complete an RCT, with additional time needed for data collection, analysis, reporting, and review. This extensive work also includes recruiting districts that are willing to participate in the RCT and often puts the cost of the study in the millions of dollars. We have conducted dozens of RCTs following the NCLB-era rules, but advocate less expensive studies in order to get the volume of evidence schools need. In contrast to an RCT, an RCE can use existing data from a school system can be both more efficient and far less expensive.

There is some controversy about whether schools should use lower-tier evidence, which might be subject to “selection bias.” Randomized control trials are protected from selection bias since users and non-user are assigned randomly, whether they like it or not. It is well known and has been recently pointed out by Robert Slavin that using a matched comparison, a study where teachers chose to participate in the pilot of a product, can result in unmeasured variables, technically “confounders” that affect outcomes. These variables may be associated with the qualities that motivate a teacher to pursue pilot studies and their desire to excel in teaching. The comparison group may lack these characteristics that help the self-selected program users succeed. Studies in Tiers 2 and 3 will always have, by definition, unmeasured variables that may act as confounders.

While obviously a concern, there are ways that researchers can statistically control for the effects of important characteristics associated with selection to use a program. For example, the amount of a teacher’s motivation to use edtech products can be controlled by collecting information from the prior year on the amount of usage by the teacher and students of a full set of products. Past studies looking at the conditions under which there is correspondence between results of RCTs and matched comparison studies that evaluate the impact of a given program have established that it is exactly “focal” variables such as motivation, that are influential confounders. Controlling for a teacher’s demonstrated motivation and students’ incoming achievement may go very far in adjusting away bias. We suggest this in a design memo for a study now being undertaken. This statistical control meets the ESSA requirement for Tiers 2 and 3.

We have a more radical proposal for controlling important sources of bias that we address in the next section where we show that the bias affecting the NCLB-era average impact results are not necessarily inherited by the differential subgroup results that schools need to know about.

Validating Research that Helps Reduce Achievement Gaps

There’s a joke among researchers: “When a study is conducted in Brooklyn and in San Diego the average results should apply to Wichita.”

Following the NCLB-era rules, researchers usually put all their resources for a study into the primary result, providing a report on the average effectiveness of a product or tool across all populations. This leads to disregarding subgroup differences and misses the opportunity to discover that the program studied may work better or worse for certain populations. A particularly strong example is from our own work where this philosophy led to a misleading conclusion. While the program we studied was celebrated as working on average, it turned out not to help the Black kids. It widened an existing achievement gap. Our point is that differential impacts are not just extras, they should be the essential results of school research.

In many cases we find that a program works well for some students and not so much for others. In all cases the question is whether the program increases or decreases an existing gap. Researchers call this differential impact an interaction between the characteristics of the people in the study and the program, or they call it a moderated impact as in: the program effect is moderated by the characteristic. If the goal is to narrow an achievement gap, the difference between subgroups (for example: English language learners, kids in free lunch programs, or girls versus boys) in the impact provides the most useful information.

Examining differential impacts across subgroups also turns out to be less subject to the kinds of bias that have concerned NCLB-era researchers. In a recent article, Andrew Jaciw showed that with matched comparison designs, estimation of the differential effect of a program on contrasting subgroups of individuals can be less susceptible to bias than the researcher’s estimate of the average effect. Moderator effects are less prone to certain forms of selection bias. In this work, he develops the idea that when evaluating differential effects using matched comparison studies that involve cross-site comparisons, standard selection bias is “differenced away” or negated. While a different form of bias may be introduced, he shows empirically that it is considerably smaller. This is a compelling and surprising result and speaks to the importance of making moderator effects a much greater part of impact evaluations. Jaciw finds that the differential effects for subgroups do not necessarily inherit the biases that are found in average effects, which were the core focus of the NCLB era.

Therefore, matched comparison studies, may be less biased than one might think for certain important quantities. On the other hand, RCTs, which are often believed to be without bias (thus, “gold standard”) may be biased in ways that are often overlooked. For instance, results from RCTs may be limited by selection based on who chooses to participate. The teachers and schools who agree to be part of the RCT might bias results in favor of those more willing to take risks and try new things. In that case, the results wouldn’t generalize to less adventurous teachers and schools.

A general advantage that rapid cycle evaluations (RCEs) in the form of matched comparison experiments have over RCTs is they can be conducted under more true-to-life circumstances. If using existing data, outcomes reflect results from field implementations as they happened in real life. Such RCEs can be performed more quickly and at a lower cost than RCTs. These can be used by school districts, which have paid for a pilot implementation of a product and want to know in June whether the program should be expanded in September. The key to this kind of quick turn-around, rapid-cycle evaluation is to use data from the just completed school year rather than following the NCLB-era habit of identifying schools that have never implemented the program and assigning teachers as users and non-users before implementation begins. Tools, such as the RCE Coach,(now the Evidence to Insights Coach) and Evidentally’s Evidence Suite, are being developed to support district-driven as well as developer-driven matched comparisons.

Commercial publishers have also come under criticism for potential bias. The Hechinger Report recently published an article entitled: “Ed tech companies promise results, but their claims are often based on shoddy research.” Also from the Hechinger Report, Jill Barshay offered a critique entitled “The dark side of education research: widespread bias.” She cites a working paper that was recently published in a well-regarded journal by a Johns Hopkins team led by Betsy Wolf, who worked with reports of studies within the WWC database (all using either RCTs or matched comparisons). Wolf compared differences in results where some studies were paid for by the program developer and others paid for through independent sources (such as IES grants to research organizations). Wolf’s study found the size of the effect based on source of funding substantially favored developer sponsored studies. The most likely explanation was that developers are more prone to avoid publishing unfavorable results. This is called a “developer effect”. While we don’t doubt that Wolf found a real difference, the interpretation and importance of the bias can be questioned.

First while more selective reporting by developers may bias their reporting upward, other biases may lead to smaller effects for independently-funded research. Following NCLB-era rules, independently-funded researchers must convince school districts to use the new materials or programs to be studied. But many developer-driven studies are conducted where the materials or program being studied is already in use (or being considered for use and established as a good fit for adoption.) The bias to the average overall effect size from a lack of inherent interest might result in a lower effect estimate.

When a developer is budgeting for an evaluation, working in a district that has already invested in the program and succeeded in its implementation is often the best way to provide the information that other districts need since it shows not just the outcomes but a case study of a district-initiated implementation. Results from a district that has chosen a program for a pilot and succeeded in its implementation may not be an unfair bias. While the developer selected districts with successful pilots may score higher than districts recruited by an independently funded researcher, they are also more likely to have commonalities with districts interested in adopting the program. Recruiting schools with no experience with the program may bias the results to be lower than they should be.

Second, the fact that bias was found in the standard NCLB-era average between the user and non-user groups provides another reason to drop the primacy of the overall average and put our focus on the subgroup moderator analysis where there may be less bias. Average outcomes across all populations has little information value for school district decision-makers. Moderator effects are what they need if their goal is to reduce rather than enlarge an achievement gap.

We have no reason to assume that the information school decision-makers need has inherited the same biases that have been demonstrated in the use of developer-driven Tier 2 and 3 studies in evaluation of programs. We do see that the NCLB-era habit of ignoring subgroup differences reinforces the status quo and hides achievement gaps in K-12 schools.

In the final section, we address the second major recommendation. If evaluation research, follows the recommendation in this section and produces moderator results as the primary goal, we will need enough studies to generalize about the moderator. We will need ways to put many small studies together.

Putting Many Small Studies Together

The NCLB-era of the single big study should be giving way to the analysis of the differential impacts for subgroups from multiple studies. This is the information that schools need in order to reduce achievement gaps. Today’s technology landscape is ready for this major shift in the research paradigm. The school shutdowns resulting from the COVID-19 pandemic have demonstrated that the value of edtech products goes beyond just the cost reduction of eliminating expensive print materials. Over the last decade digital learning products have collected usage data which provides rich and systematic evidence of how products are being used and by whom. At the same time, schools have accumulated huge databases of digital records on demographics and achievement history, with public data at a granularity down to the grade-level. Using today’s “big data” analytics, this wealth of information can be put to work for a radical reduction in the cost of showing efficacy.

Fast turnaround, low cost research studies will enable hundreds of studies to be conducted providing information to school decision-makers that answer their questions. Their questions are not just “which program, on average, produces the largest effect?” Their questions are “which program is most likely to work in my district, with my kids and teachers, and with my available resources, and which are most likely to reduce gaps of greatest concern?”

Meta-analysis is a method for combining multiple studies to increase generalizability (Shadish, Cook, & Campbell, 2002). With meta-analysis, we can test for stability of effects across sites and synthesize those results, where warranted, based on specific statistical criteria. While moderator analysis is considered merely exploratory in the NCLB-era, using meta-analysis, moderator results from multiple small studies, can in combination provide confirmation of a differential impact. Meta-analysis, or other approaches to research synthesis, combined with big data present new opportunities to move beyond the NCLB-era philosophy that prizes the single big study to prove the efficacy of a program.

While addressing WWC and ESSA standards, we caution, that a single study in one school district, or even several studies in several school districts, may not provide enough useful information to generalize to other school districts. For research to be the most effective, we need studies in enough districts to represent the full diversity of relevant populations. Studies need to systematically include moderator analysis for an effective way to generalize impact for subgroups.

The definitions provided in ESSA do not address how much information is needed to generalize from a particular study to deciding about implementation in other school districts. While we accept that well-designed Tier 2 or 3 studies are necessary to establish an appropriate level of rigor, we do not believe a single study is sufficient to declare a program will be effective across varied populations. We note that the Standards for Excellence in Education Research (SEER) recently adopted by the IES, call for facilitating generalizability.

After almost two decades of exclusive focus on the design of the single study we need to more effectively address achievement gaps with the specifics that school decision-makers need. Lowering the cost and turn-around time for research studies that break out subgroup results is entirely feasible. With enough studies qualified for meta-analysis, a new wealth of information will be available to educators who want to select the products that will best serve their students. This new system will democratize learning across the country, reducing inequities and raising student achievement in K-12 schools.

For comments on earlier versions of this article, we are grateful to Beth Tipton, Wendy Lewis, and Andrew Coulson. Address comments to dn@empiricaleducation.com.

Evidence According to ESSA:

Ending a Two-Decade-Old Research Legacy

ESSA’s Evidence Tiers and Potential for Bias

Validating Research that Helps Reduce Achievement Gaps

Putting Many Small Studies Together

Written by Empirical Education