Reading Thousands of Python Solutions at Once

You teach programming, and you’re in the class computer lab, helping students. The big deadline is approaching, and it’s getting very late. A student asks for help. Why doesn’t his code pass all the required test cases?

You ask him to explain his logic to you. You do not recognize his approach to the problem. It could be a novel idea or fatally flawed.

You look at his solution and the test cases it fails. His code is not clearly written. You cannot tell why his code is returning the wrong answers.

This assignment has been around for years. You wonder if anyone else has ever successfully implemented this approach. If only you could read the thousands of previous student solutions that are sitting in your file system!

You start randomly printing out solutions and looking through them, such as these five examples, which are all supposed to iterative exponentiate the argument “base” to the argument “exp”:

def iterPower(base, exp):
'''
base: int or float.
exp: int >= 0

returns: int or float, base^exp
'''
# Your code here
res=1.
while exp>0:
exp-=1
res=res*base
return res
def iterPower(base, exp):
if exp == 0:
return 1
ans = base
while exp >1 :
ans *= base
print base
exp -= 1
print exp
print ans
return ans
def iterPower(base, exp):
'''
base: int or float.
exp: int >= 0

returns: int or float, base^exp
'''
# Your code here
result = 1
while exp > 0:
result *=base
exp -= 1
return result
def iterPower(base, exp):
'''
base: int or float.
exp: int >= 0

returns: int or float, base^exp
'''
ans = base
if (exp == 0):
return round(base/base, 4)
else:
for n in range(exp-1):
ans *= base
return round(ans, 4)
def iterPower(base, exp):
result = base
if exp==0:
return 1
if exp==1:
return result
while exp > 1:
result *= base
exp -= 1
return result

You cannot read thousands of solutions in a reasonable amount of time. So what would you want to read, to get you as close to the real thing as possible?

You find two solutions that are identical, except have different comments. You decide that you don’t need to read both of them. Then you write the first step of your analysis pipeline:

Analysis Pipeline Pseudocode
1 A. Remove comments from all solutions
B. Remove solutions that are now duplicates

You find another solution that is identical to the first, but with different formatting choices, such as extra white space within and between lines. You decide you do not need to read that one either. You add another step to your analysis pipeline.

Analysis Pipeline Pseudocode (cont'd)
2 A. Reformat all solutions, so they all have the same formatting
B. Remove solutions that are now duplicates

You find another solution that is identical to the first, but with different variable names. You decide that you do not need to read this one either. You pause.

Variable names are pretty critical to the readability of code. You don’t want to be stuck reading student code with variable names ‘temp1’, ‘temp2’, ‘temp3’… How do you pick which one of the two solutions, which only differ by variable names, to read?

You don’t pick. You let the students vote, and pray that the wisdom of the crowd solves it for you. And the students have already voted. They already submitted solutions with their own choices of variable names, independent of everyone else. You could figure out the most popular name given to each variable, and use those names for the variables in the solution you, the teacher, has to read.

How do you tell if two students voted for the same variable name? How do you tell that the variable ‘i’ in one solution is the “same” as the variable ‘x’ in another solution? What does the “same variable in two different solutions” even mean?

You could probably hand-label variables across student solutions as the “same” if they are playing the same “role,” e.g., indexing into the same array or accumulating the same result. How do you automate labeling variables as the “same” across student solutions?

You execute all solutions on a test case, like:

iterPower(5,3)

The variables in each solution take on a sequence of values. You can see this yourself, using the Online Python Tutor. As you step through each solution, you notice that the variables that play the same role in each program take on the same sequence of values during execution on this test case.

For example, the variable that takes on sequence

1, 5, 25, 125

shows up in the most solutions, 3081 out of 3842, to be exact. Its most common name is ‘result’ but other names students (from around the world) give it are ‘wynik’, ‘out’, ‘total’, ‘ans’, ‘acum’, ‘num’, and ‘mult’, to name a few.

This is a conservative notion of “same” in some cases, when two variables have the same role but differ by their first or last values. This is an overconfident notion of “same” when two boolean flags switch from “false” to “true” at different points in execution, capturing different information but still taking on the same sequence of values — “false” to “true”. But it works well in practice on the introductory Python programs you have, so you run with it, as imperfect but a good start.

You find all the unique sequences of values that variables in these thousands of student solutions take on. You call them “common variables.” You keep track of all the names students have ever given them. You determine the most popular one.

If two different common variables are both most commonly called ‘i’ then you let the common variable that shows up in the most student solutions be named ‘i’ and you name other, less popular common variable ‘i2’. Now every variable name corresponds to a distinct sequence of values during execution on the test case(s).

Then you add a new step to the pipeline:

Analysis Pipeline Pseudocode (cont'd)
3 A. Rename all variables in all solutions to their most popular names
B. Remove solutions that are now duplicates

These two solutions, which only differ by variable names:

def iterPower(mybase, exp):
res=1
while exp>0:
exp-=1
res=res*mybase
return res
def iterPower(base, exp):
result=1
while exp>0:
exp-=1
result=result*base
return result

…become identical.

Variable names have semantic value now, and there is a shared vocabulary across all solutions. The variable ‘x’ in one solution behaves the same way on test cases as the variable ‘x’ in another solution.

You have taken your stack of thousands of solutions, and reduced it by a large fraction. There are hundreds of solutions now, without comments or unusual formatting, and with common, semantically meaningful variable names that reflect most students’ variable name choices.

You start looking through this reduced set. Two solutions have the same set of lines, but the order of statements within the loop is swapped:

def iterPower(base, exp):
result=1
while exp>0:
exp-=1
result=result*base
return result
def iterPower(base, exp):
result=1
while exp>0:
result=result*base
exp-=1
return result

You recognize that changing the order of statements did not affect the values of the variables during execution, so it must not have mattered. You don’t need to read both. You add a final step to the analysis pipeline:

Analysis Pipeline Pseudocode (cont'd)
4 A. Find all the solutions that only differ by statement order
B. Keep the solution with the most popular statement order
C. Remove all the others

Now, when you read a solution from the collection of solutions that remain, it is not necessary a solution any particular student wrote, but it does represent the solutions of hundreds or thousands of students. You calculate how many students each solution represents. You read them, from “largest” to “smallest”, skimming because they all use the same set of variable names, in an interface that highlights the differences between solutions, as you read.

This is one way to read thousands of Python solutions at once.

A screenshot of OverCode

This represents the work of Rishabh Singh, Philip Guo, Jeremy Scott, Rob Miller, and myself, with additional help from Stacey Terman, who expanded OverCode to handle incorrect solutions (not discussed here). For more information, go to our project homepage or the github repo.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.