Demystifying California State Tests. Part 1:
The Sky Is Not Falling

Tracy in LAUSDLand
12 min readFeb 10, 2023

--

J. M. Urrutia, Ph.D.

February 8, 2023

Photo (ORHA)

On Friday, September 9, six weeks before the California Department of Education (CDE) released the results for the 2022 administration of the Smarter Balanced tests (also known as the SBAC tests as they were initially developed by the Smarter Balanced Assessment Consortium ), LAUSD’s Superintendent Alberto Carvalho announced that 58% of the District’s students had not met standards in English while 78% had not met the math standards. The L.A. Times interpreted this news as “deep setbacks for a majority of Los Angeles schoolchildren who were already far behind,” presumably because the proficient level for LAUSD’s students has not been above 50% (44.1% in 2019, 42.31% in 2018, 39.96% in 2017, 39% in 2016, and 33% in 2015).

Superintendent Carvalho was further quoted:

The pandemic deeply impacted the performance of our students …particularly kids who were at risk, in a fragile condition, prior to the pandemic, as we expected, were the ones who have lost the most ground.

The Times article follows this quote with “Carvalho … wants to make up two years of lost ground this year and make up five years of reverses over the next two years.” Clearly, Superintendent Carvalho (and perhaps the L.A. Times) believes a low proficiency percentage constitutes a reverse in academic achievement.

While the L.A. Times headlines and Superintendent Carvalho quotes are good attention grabbers, this “sky is falling” narrative ignores important facts buried in the SBAC tests. As noted in a previous article, “Proficiency for All,” during a town Hall in Van Nuys on May 2, Superintendent Carvalho stated his belief that the level of proficiency demonstrated in state tests by LAUSD students is not up to par with his expectations. As that article explains, the data already predicts he will never succeed in his goal of students being 100 percent proficient because the SBAC test design prevents it. Six months later, Superintendent Carvalho continues the same narrative.

Instead of a superintendent (and he is not the first) conferring a mystical power to the SBAC Scores and subjecting four hundred thousand LAUSD students to it, it is time to ditch the narrative and demystify the tests. There are simple facts to these tests that should make us all question why we are using SBAC results for setting LAUSD’s goals.

As shown in the previous article, “meeting the standards,” also known as being “proficient,” has hovered around the 50% level for all the years state tests have been administered to California students. To present this observation in a more informative way, Fig. 1 displays the four performance bands reported by the CDE for all the years of SBAC administrations at three levels: the entire state, the County of Los Angeles, and LAUSD (note that all charter schools, both independent and affiliated, are included because LAUSD is their charter authority). Please note that no test was administered in the 2019–20 school year due to COVID. Also note that participation in the 2021 administration was seriously reduced because the number of students with valid scores was roughly 20% at the state and county level and as low as 10% for LAUSD. Therefore, the CDE does not advise direct comparisons of the 2020–21 scores with prior years.

Figure 1: (a) Percent of students in the achievement bands for the administration of the English Language Art CAASPP test for all California students for the years 2015 through 2022, (b) same but for all students in the Los Angeles County, and © same but for all LAUSD students, including those attending charter schools, both affiliated and independent. The four achievement bands are “Standard Not Met,” “Standard Nearly Met,” “Standard Met,” and “Standard Exceeded.” The number above each column is the number of students with valid test scores.

It is important to note what these graphs actually represent: They are not the actual scores of students but groupings based on their level of performance in the test as determined by four predefined achievement levels (“Standard Not Met,” “Standard Nearly Met,” “Standard Met,” and “Standard Exceeded”). Also, rather than present the actual number of students in each band, they are presented as a percentage of the year’s total. This makes it easier to compare the achievement bands across the years because the actual number of students taking the test varies from year to year.

The public, unfortunately, has been informed by the media that these groupings are directly related to whether or not a student is learning. In fact, EdSource, a respected digital publication, stated in their article discussing the 2022 results that:

In third grade English language arts, the first statewide measure of children’s ability to read, scores fell 6.5% percentage points to 42.2%, the lowest of any grade.

This is simply astonishing because it equates “not meeting the standard” with being illiterate. Which it is not.

Then there is the highest performing band. What does that say or tell us? This performance band is labeled “standard exceeded” implying that students in this band are performing at a 4th or higher grade level. Examination of the data available at CDE’s web site reveals that 22.77% of 413,308 3rd graders are in that band. If so, are we to assume that more than 94,000 3rd graders have teachers who are teaching 4th grade material? Of course not. Teachers often barely have time to teach the standards of their grades. Similarly, why assume ”not meeting the standard” means that the child placed in this performance band is failing the 3rd grade? Or, as EdSource suggests, demonstrating an inability to read? There are plenty of LAUSD parents who know their child is a solid reader and understands math and gets 3s and 4s on their report cards but according to their SBAC test, they are labeled as “Standard Nearly Met” or even, “Standard Not Met.” Those parents run to the teacher who often then patiently explains the student’s ability and not to pay attention to the test.

This is the problem that Superintendent Carvalho pretends exists: the majority of Los Angeles schoolchildren are already far behind because they are not deemed to be proficient by the SBAC tests. If this were so, then half the students across the state would not be at grade level every single year for as long as the tests have been administered. Therefore, half of LAUSD students should be retained and not advanced to the next grade as noted in a recent article in the L.A. Times. According to this article there may be a vast conspiracy by LAUSD teachers and their administrators across close to 800 schools to inflate classroom marks, which is preposterous.

What are the SBAC tests actually measuring?

According to the CDE , all students, except those “who participate in the alternate assessments” and certain English learners, must take the SBAC tests. The tests, labeled as the Smarter Balanced Summative Assessments by the CDE, are

…delivered by computer, consist of two sections: a computer adaptive test and a performance task (PT) based on the Common Core State Standards (CCSS) for ELA and mathematics. The computer adaptive section includes a range of item types, such as selected response, constructed response, table, fill-in, graphing, and so forth. The PTs are extended activities that measure a student’s ability to integrate knowledge and skills across multiple standard — a key component of college and career readiness.

That may sound reasonable, except that the actual test processing should raise flags when compared to how previous generations of tests were taken. According to page 2 of the California Assessment of of Student Performance and Progress Smarter Balanced 2019–2020 Technical Report ,

The computer-adaptive portion of the test [N.B.: the CAT] is designed to present items of difficulty to match the ability of each student, as indicated by the responses the student provided to previous test items. By adapting to the student’s ability as the assessment is being taken, the CAT presents an individually tailored set of questions that is appropriate for each student. As a result, it provides more accurate scores for all students across the full range of the achievement continuum. Compared with a fixed-form assessment — that is, a test where all students are given the same questions, regardless of their responses or ability — a CAT requires fewer questions to obtain an equally precise estimate of a student’s ability.

Only a technically-proficient person can decide if the Artificial Intelligence of the test delivery system (TDS) is a better predictor of a student’s ability. While this method might be useful in discouraging a test taker from choosing random answers, it does prevent a student, who might be stumped by a “simple” question, from attempting to answer “harder” questions which was allowed when the tests were paper-and-pencil not on a screen. It also opens up many questions: what makes a question “simple” or “hard?” If the goal is to test a student on their mastery of the standards and they are all reasonable, why are some “harder” to satisfy than others if the questions presented by the TDS are meant to start at “average?”

Imagine taking the DMV test with harder questions based on your answer to a simple question. Are you allowed to make a right turn on a red? Are you able to make a right turn on a red when you hear a siren? Are you able to make a right turn on a red, when you hear a siren and it is in a hospital zone? It would be hard to fault a parent or teacher who wonders if the test system is rigged against their child and his class of peers.

Which raises an important question: who creates the test questions?

We get a hint of how these questions come about from page 2 of the above referenced technical report:

The CAT required a large pool of test questions statistically calibrated on a common scale to cover the ability range. For the Smarter Balanced Online Summative Assessments, the test question statistics were obtained mainly from the spring 2013–2014 field test. Each year, new items are field-tested and added to the Smarter Balanced item pools.

So a team of statistical researchers periodically comes up with a set of questions that are then test-driven and the responses of the test takers defines their level of difficulty? If so, these questions are not truly measuring the mastery of the standards but are simply documenting how a convoluted question can be made to test what are, in fact, simple standards. Given this, the message prominently displayed in the main page of the Smarter Balanced website:

Developed by Teachers for Teachers

At the center of our work is a commitment to equity, accessibility, and ensuring that teachers are equipped with what they need to support students of diverse backgrounds and abilities for successful productive futures.

does not ring true. Perhaps teachers suggested the core contents of the questions, but it is statisticians who shape them and ultimately decide to include them in the question pool in order to come up with a particular distribution of responses as discussed below.

At this point it is fair to ask what the standards are. It is not the purpose of this writing to march through the standards and examine if they are reasonable and age-appropriate. To this writer, examination of the California Common Core State Standards English Language Arts & Literacy in History/Social Studies, Science, and Technical Subjects is sufficient to determine that the standards seem to be reasonable, at least for the lower grades. However, it is important to note that the introduction of this document warns the reader in its page 5 about what the design limitation of the standards are:

1. The Standards define what all students are expected to know and be able to do, not how teachers should teach.

2. While the Standards focus on what is most essential, they do not describe all that can or should be taught.

3. The Standards do not define the nature of advanced work for students who meet the Standards prior to the end of high school.

4. The standards set grade-specific standards but do not define the intervention methods or materials necessary to support students who are well below or well above grade-level expectations.

5. It is also beyond the scope of the Standards to define the full range of supports appropriate for English language learners and for students with special needs.

6. While the ELA and content area literacy components described herein are critical to college and career readiness, they do not define the whole of such readiness.

Despite all this, the above cited technical report states that:

The primary purpose of the CAASPP System of assessments is to assist teachers, administrators, and students and their parents/guardians by promoting high-quality teaching and learning through the use of a variety of item types and assessment approaches.

How can this happen when the standards are not meant to tell teachers how they should teach? If the standards do not define what is “advanced work,” why is one of the performance bands labeled “Standards Exceeded?” More importantly, if the tests are taken at the end of the year, how can they assist in promoting high-quality teaching and learning? At best, the tests serve as a warning that the student could not perform as the Artificial Intelligence behind the TDS expected. But does it prove that the student did not demonstrate sufficient mastery in the classroom and therefore should have been retained?

What is the result of all this effort? The simplest glimpse is offered when the data found in the table titled “Frequency Distribution of Overall Scale Scores” offered in the technical reports is plotted. Unfortunately, only the years 2015, 2016, and 2017 are available. CDE has not publicly released the 2018 and 2019 reports because “the report has not completed the accessibility tagging process and therefore should not be widely shared.” Nevertheless, reports for these years are available upon request. From these reports, the results for 3rd graders for all years are shown in Fig. 2. Why choose the 3rd graders? Because 3rd grade students have never taken a test like this in their life. They are essentially a blank slate that allows us to see how nearly half-a-million California children respond to this grand statistical experiment.

Figure 2: (a) Distribution of scaled scores for 3rd grade students with valid scores for the 2015 through 2019 administrations of the ELA SBAC tests. (b) Percent of students in the defined achievement bands as calculated from the distribution for the 2019 administration of the ELA SBAC test.

As can be observed, the distribution of scaled scores is not radically different from year to year. The facts are that the observed lack of change is counter to what is demanded by politicians and educrats (“we must increase our students’ proficiency in order to ensure they are ready for college and career”) and reported by the media. Worse, since the scaled scores are forced to approximate a Bell Curve, students are subjected to what is known in the business world as a “stacked performance rankings”. As noted in a recent Bloomberg article reprinted by the L.A. Times, ranking employees in this manner forces a negative grading on half of the employees and “creates a dysfunctional, stressful, top-down work environment.” It is not surprising, therefore, that roughly half the students across the state are deemed “Standard Nearly Met” or “Standard Not Met” every single year that the SBAC has been administered.

Are politicians, educrats and the media aware that this is what is behind all those claims that the sky is falling?

While Fig. 2b is barely different from the one included in Fig. 1a, presenting the performance bins side by side shows that each bin contains roughly one-quarter of the entire population. Putting aside what was raised above about the impossibility that 3rd grade teachers across the state are teaching 4th grade materials, it seems that the cutoff points have been chosen without consideration of statistical concepts (for example, the cutoff points are nowhere near the standard deviation of the distribution).

How were they chosen? According to the SBAC Consortium , this is how the scaled score cutoff points were determined:

Through a series of online and in-person activities, educators, parents, and community leaders helped ensure the assessments are based on fair and rigorous expectations for students. The process consisted of three phases:

1. An Online Panel allowed thousands of K-12 educators, higher education faculty, parents, and other interested parties to participate virtually in recommending achievement levels.

2. An In-Person Panel with educators and other stakeholders working in grade-level teams deliberated and made recommendations for the threshold scores of the four achievement levels.

3. The Cross-Grade Review Committee, a subset of the In-Person Panel, examined recommendations across all grades to consider the reasonableness of the system of cut scores.

In other words, the levels were selected by multiple committees and there is no way to trace the origin of these cutoff points, which have not changed since 2013, much less identify their justification. Hence, there is plausible deniability all around since not one group is responsible for this decision.

While the SBAC itselfs says the tests should be continuously validated, the mere fact that the distribution of scaled scores has not changed significantly is clear proof that they have been left alone since their initial design albeit with minor tinkering with new questions. Yet, politicians, educrats, and the media insist on using the proficiency levels derived from the SBAC tests results to judge whether schools are failing their students. Using the results this way does not help students in mastering the standards since, by definition, their abilities are matched to the questions. Because this implicitly enshrines a hierarchy in abilities in students, then there will never be an opportunity to achieve 100% proficiency as the federal No Child Left Behind Act once demanded.

As the largest school district in California, our leaders at LAUSD should be the loudest voices calling out the conflict in having a test defined by a set of questions of varying degrees of difficulty that ultimately herds the test taker into a particular “mastery” bin. Particularly since this placement is based on the responses to questions chosen by some Artificial Intelligence algorithm from a predefined pool of questions.

So far, Superintendent Carvalho has chosen to prop up SBAC.

Yet, when we drill deeper into the tests in Demystifying California State Tests. Part 2: The Biases, we discover that the biases in the tests might not line up with the public’s assumptions.

It isn’t hard to see that the sky is not falling. The tests are simply failing our students.

--

--