Improving CS assessment with careful (data) analysis

Benji Xie
Bits and Behavior
Published in
11 min readMar 1, 2019

Yes, you can use data to improve assessments. But you have to be careful. In this post, I describe a psychometric process at a high level and then demonstrate how we applied it (with Item Response Theory) to improve the SCS1, a popular introductory CS1 assessment.

We can’t reasonably assume that students will always understand a new concept on the first pass. We can’t reasonably assume that the first version of learning materials will be effective. Likewise, we cannot assume that tests will effectively measure what we want to. It’s all about designing, evaluating, and iterating to get to better.

In our current day and age, test scores matter. A lot. We interpret test scores to determine grades and advanced placement. But perhaps more importantly, learners’ interpretation of test scores can affect their sense of mastery and identity. So, it is important to ensure our interpretation of test scores are “good,” that they accurately reflect measuring the knowledge they’re designed to measure for a target group of test-takers. By doing so, we can ensure test score interpretations are meaningful and improve upon our tests for future use.

But how do we evaluate the quality of our tests and their scores? One way to do so is by considering response data to identify patterns, drawing upon the educational field of psychometrics. At a high level, we can break this process down into 3 steps:

The psychometric process is about 1) understand the test design to know how to interpret scores, 2) analyzing learner response data to flag potentially problematic questions, and 3) follow-up analyses to understand the if the flagged questions need to be revised.
  1. Understand test design. By understanding what the test was intended to measure and for whom, we can better understand how to interpret the scores.
    We can answer questions such as: How are we supposed to interpret this test score? What assumptions, inferences, or decisions do we rely on to make this interpretation? Who is this test intended for? How will this score be used?
  2. Conduct statistical analyses. By measuring test response data for learners in the target population, we can check for patterns which may suggest problems with the test.
    We can answer questions such as: Which questions are too easy/hard for learners? Which questions do (not) differentiate between high and low performers? Which questions measure knowledge which is not related to the rest of the test? Do learners of diverse gender identities or ethnicities respond differently to a given question?
  3. Follow-up with domain experts. With the potential problems the statistical analyses flagged, we can consult with domain experts (students, test creators) to better understand problems and potentially revise the test.
    We can answer questions such as: Is there something in the wording or structure of the test which makes it confusing? Why do even high-performing learners consistently select a particular wrong answer for this question? Does the test require knowledge which is beyond the scope of the learning objectives?

The main takeaway is this: We can develop high-quality tests, but we must be skeptical of how we interpret test scores and iterative in design. We can use statistical methods to identify POTENTIALLY problematic test questions; we must conduct follow-up analysis on these questions to understand issues and improve our tests.

In the rest of this post, I’ll demonstrate how we applied the psychometric process to see how effectively the SCS1 assessment measured introductory computer science (“CS1”) knowledge. We emphasize our application of statistical methods related to Item Response Theory (IRT) as these results can be very insightful, but IRT is underutilized in computing education. This work was accepted to the research track of SIGCSE 2019 (PDF of pre-print) and will be presented on Friday afternoon at SIGCSE 2019.

Research Questions

For our analysis, we sought to answer the following research questions:

  1. Do all the SCS1 questions measure the same underlying construct (CS1 knowledge)?
    All the SCS1 questions should assess knowledge taught in an introductory computer science course. Unusual response patterns for some questions may suggest that those questions assess unrelated knowledge.
  2. How closely do the difficulty levels of the SCS1 questions align with the knowledge levels of our sample group?
    Alignment between the content of a test and the knowledge of learners is crucial in test score interpretations. If tests are too easy or too difficult, there will not be enough variation in scores to make them useful.
  3. For what levels of CS1 knowledge does the SCS1 measure well?
    There is a broad distribution in knowledge levels for novice programmers and only one SCS1, so the SCS1 likely be most effective for novice programmers within a given range of knowledge. Knowing this will ensure the SCS1 is an appropriate test for future learners.
  4. What do the response patterns of problematic questions reveal?
    Some questions may have unusual response patterns (e.g. a question which low-performers get correct but high-performers get wrong). Follow-up analyses with domain-experts will help us understand the cause of these unusual response patterns.

Test design: SCS1 designed to measure CS1 knowledge for novice programmers

The Second Computer Science 1 (SCS1) is an assessment which measures whether a novice programmer has accurate working knowledge of CS1 concepts. The scores have been used by researchers to measure learner understanding of the SCS1 and evaluate the effectiveness of learning interventions. The SCS1 is a 27 question multiple choice test administered online. Each question asks a test-taker to choose the best option out of five available.

“The tricky question.” A mock SCS1 question (we made to reflect a real question). The correct answer is “B” but, most high-performing students selected “C.” Our analysis explores why that is the case.

Now that we have an idea of how we’re supposed to interpret these test scores, we’ll move onto the analysis of learner response data!

Data Analysis: Three question properties of interest

In test data, we want trends (where more knowledgable learners perform better), but we also want variation. Neither a test that everyone gets near 0 nor a test that everyone gets 100% gives us much information on the difference in knowledge levels between students. When analyzing test response data, there are three question attributes to investigate to ensure good trends exist and variation is sufficient: difficulty, discrimination, and distractors. We summarize these

Question attributes of interest: Understanding these three attributes will help us understand if questions are performing well.
  • Difficulty has an inverse relationship with correctness, where we typically say a difficult question (that is well-designed) tends to require more knowledge to answer and responded to correctly less frequently. Questions which have too low of a difficulty (too easy) exhibit a ceiling effect where almost all learners get them correct. Questions which have too high of a difficulty exhibit a floor effect where almost no learners get them correct.
  • Discrimination refers to how well a question differentiates between learners who know the material and those who do not.
  • Distractors refer to the distribution of responses, with emphasis on which wrong answers were frequently or infrequently selected. For example, an incorrect response which is selected more frequently than the correct answer may be potentially problematic.

Now, we analyze data from responses from 489 learners enrolled

Data Analysis Pt 1: CTT as a “first pass”

Just as we collect descriptive statistics before conducting more complex analyses, it’s often helpful to conduct Classical Test Theory (CTT) analysis as a first pass at understanding response data. To some extent, CTT is done by almost all teachers and educators! CTT is about considering results in aggregate, often in relation to a total test score. So if you’ve ever calculated the total score on a test, you’ve done CTT! Now we’ll consider the three question attributes mentioned earlier: difficulty, discrimination, and distractors.

Difficulty for a question (in CTT) is calculated as proportion of respondents who got a question correct. We show the difficulty for each question below.

A tricky thing about CTT’s version of difficulty is determining whether a question is too hard (or easy) for a target population of learners. This is a bit impossible to determine with CTT uses total test score to represent learners’ knowledge levels. So a shortcoming of CTT is that there is no way to separate learners’ knowledge level and question difficulty.

Question difficulty as measured by the percentage of learners who got each question correct. The SCS1 is a difficult test with all questions having <50% correctness. Given that this is a multiple-choice test with 5 options, questions with < 20% correctness (as indicated by pink boxes) may be problematic.

The next thing we’ll consider is discrimination. Again, discrimination reflects how well a question distinguishes between learners of different knowledge levels. As a rule of thumb, greater discrimination is typically desirable and low discrimination is troublesome. We show the discrimination for SCS1 in the questions below:

Question discrimination as measured by point-biserial correlation (relationship between question and overall test performance). Higher correlation is better, with a point-biserial correlation of <0.2 (blue line) typically being deemed as problematic. Q5 and Q20 have both poor discrimination (blue circle) and high difficulty (pink squares, carry over from the previous figure).

There are two questions (Q5, Q20) which have potentially problematic difficulty and discrimination. Q20 is actually “the tricky question” I showed you at the beginning of this post. So let’s look at the distractors for the tricky question, as shown in the figure below.

Distractors for “the Tricky Question” (Q20, shown above). When we aggregate learner responses, we see that options C and D are selected more frequently than the correct response (B). This suggests that we may need to review this question’s response options.

We see that there are multiple distractors which are chosen more frequently than the correct answer. This could imply many things, ranging from the question assessing knowledge that learners and missing to the answer key having an error.

What would be more insightful than seeing the distribution of distractors in aggregate is to compare the distribution of high performers to the distribution of low performers. While CTT looks at learners in aggregate, item response theory enables us to disaggregate learners and compare response patterns of learners with different knowledge levels.

Data Analysis Pt 2: IRT to model question & learner properties

Whereas CTT confounds learner and test properties, IRT models learners to analyze learner properties (e.g. knowledge level) and test properties (e.g. question difficulty) separately. A fundamental aspect of IRT is question difficulty and learner knowledge level are both on the same continuum. We show this in the figure below, where a hypothetical learner (in blue) would likely get questions A and B correct because their knowledge level is greater than the difficulty of a question. They would likely get question C incorrect because their knowledge level is lower than the difficulty of question C.

Representation of latent variable continuum with 3 questions (A,B,C). Because this learner’s knowledge level is greater than the difficulties of A and B, we would predict they get those questions correct. Their level is lower than C’s difficulty, so we predict they get that question wrong. Zero reflects the knowledge level for the average test-taker, where learners are often assumed to be distributed normally. (sorry…)

A “good” question should discriminate (differentiate) between learners of different knowledge levels. So the figure below has a nice steep “S” curve, indicating high discrimination. This is often desirable as a learner with a lower knowledge level (e.g. outlined orange learner) will likely get the question wrong and the higher knowledge level (e.g. solid green learner) will likely get the exercise correct.

Item Characteristics Curve (ICC) for a well-performing question from the SCS1 (Q19). This question has a reasonable difficulty level (0.68) and a high discrimination (steep S curve).

In stark contrast, poor questions tend not to discriminate between learners. This is visualized in the figure with very flat curves. This is not ideal because the probability of selecting a correct answer does not change much between learners of different knowledge levels.

ICC for poorly performing questions in SCS1. The questions have poor discrimination as indicated by the very flat curves. This is not ideal because the probability of getting a question of correct does not change for learners of varying knowledge levels.

This modeling of learners’ knowledge levels independent also helps us shed light on distractor patterns, as we can analyze response patterns of wrong answer choices for learners of varying levels of knowledge. As a rule of thumb, we want learners with low knowledge levels to select distractors (wrong answers). As knowledge level increases, we want the likelihood of selecting the correct answer to increase and eventually become the most likely option selected. In the figure below, the question on the good question on the left reflects this pattern. In stark contrast, the poor question on the right is unusual because learners of all knowledge levels are more likely to select a distractor than the correct answer. Furthermore, as knowledge level increases, the likelihood of selecting the correct answer (B) decreases. This is typically a bad thing.

Distractor patterns for a good question (left) and poor question (right). For the good question (Q19 in SCS1), learners with lower knowledge levels will likely select certain distractors, or wrong answers (A, B). As learners’ knowledge levels increase, they are more likely to select the correct answer (C). In contrast, the poor question (Q20) performs poorly because learners of all knowledge levels are more likely to select a distractor than the correct answer. Furthermore, as knowledge levels increase, the likelihood of selecting the correct answer DECREASES, which is not good.

So with data analysis helped us identify potentially problematic questions. We know the response patterns for potentially problematic questions are odd, but we don’t know why. And we need to understand why before we improve our assessment! Follow-up analyses is all about helping us understand why questions have odd response patterns and then we can decide what to do.

Follow-up: Expert review of problematic items

Follow-up analyses is all about contextualizing and verifying the findings from the data analysis. We must understand the test design, look at the questions, consider the learners, and be very skeptical. A great start for follow-up analyses is expert review, or sharing your data analysis with a domain expert (e.g. a computer science educator) and hypothesizing potential explanations for the odd response patterns.

For “the tricky question,” we know from CTT that the question is very difficult and does not discriminate between learners of varying knowledge levels well. From IRT, we identified that low-performing learners tend to select incorrect option C and high-performers tend to select incorrect option E. Why is that?

A typical suspect is some misunderstanding as a result of confusing wording in the question. The figure below shows “the tricky question.” We see that some of the wording of the question (underlined in gold) may be confusing or unfamiliar to learners. Or perhaps the wording of a response option (e.g. A or E) confused learner.

Another suspect is a potential misalignment between the knowledge the test assesses and the content covered in class. Survey data from learners revealed that most learners in our sample took a data programming version of CS1 which did not emphasize function scope. So perhaps this question assesses knowledge that learners did not learn. This reveals a key tension: there may be a misalignment between the knowledge learned and what the test measures. We often design standardized tests but CS courses are very diverse and teach different topics, perhaps introducing this misalignment.

Follow-up analyses in the form of expert review. We found that low-performers and high-performers were likely to select different incorrect options. This may be because of a misunderstanding in the question prompt (underlined in yellow) or in the answer options (e.g. wording of option E). Or this may be because the knowledge this question assessed (function scope) was not covered in the course learners took.

So we were able to rely on domain experts to generate a few hypotheses as to why a question was problematic. To determine which hypothesis is true, it’s typical to conduct think-alouds or cognitive walkthoughs with learners in the target population. By understanding learners’ thought processes, we can get more evidence as to why a question is problematic. And with that evidence, we can decide how to change, iterate, and improve our tests!

Conclusion: More rigorous evaluation of instruments in computing education

We live in a world of reductionism. CS students often receive letter grades which we intended to reflect their mastery of certain knowledge. And test scores serve a large part of grades.

We also live in a world of beautiful diversity. Initiatives such as CS4All are working for more inclusive learning experiences. So as we welcome more diversity into computing, we must work hard to ensure our measurements of what students know measure what we intend for them to and do not bias against certain groups. Psychometrics provides frameworks and methodologies to evaluate and ensure validity and reliability in how we interpret our measurements, in effect ensuring instruments are considerate to the growing diversity of learners. So let’s iterate to better!

There is a lot of rich, thorough detail on how we conducted IRT, including confirmatory factor analysis to verify the questions measured the same latent construct (CS1 knowledge). We also describe and differentiate between CTT and IRT to identify the merits of IRT. I’ve also linked slides to my SIGCSE 2019 talk and a link to supplementary resources. And of course, please reach out to me if you would like some ideas on how to use evidence to improve our test and how you interpret your test scores!

--

--

Benji Xie
Bits and Behavior

I design equitable and critical human-data interactions. Embedded Ethics Fellow, Stanford HAI, Ethics in Society. PhD, UW iSchool. Prev MIT CS, Code.org.