My MIT PhD thesis in CS, told using the 8-point story arc

…found here in How to Structure A Story.

Stasis

This is the “every day life” in which the story is set. Think of Cinderella sweeping the ashes, Jack (of Beanstalk fame) living in poverty with his mum and a cow, or Harry Potter living with the Dursley’s.

I was hired as a teaching assistant in Computation Structures (6.004). This class is famous for its culminating semester-long project; every student must make an entire processor out of simulated logic gates. Much of my time was spent helping ~200 enrolled students debug their simulated circuits. This was fun, but it had several frustrating elements:

  1. The simulated circuits were created by students using a language that was invented by the lecturing professor and very difficult to read or skim.
  2. Since student solutions were only evaluated based on input-output behavior, they could create many different innovative (or inefficient) designs. Established optimizations, like ripple carry adders, were discussed in class, but the diversity of good and bad designs found within our own crowd of students’ solutions were never systematically examined. One student had the courage to publish his own innovative design on the course forum, and we teaching assistants spent much of the rest of the week helping everyone else understand and implement it. How many other students came up with a cool idea but were afraid to share it, or did not even realize that it was new and different? Some students had no idea whether their solutions were “normal” or unusual.
  3. Students failed many of the same teacher-provided tests for the same reasons. These <test failure, problem> patterns would evolve in my mind as I helped more and more students. I was sometimes reduced to an inefficient middleman of debugging advice serving a long queue of customers: “Oh, you failed test 286? Check your Z logic!” The class forum did not seem to alleviate this problem, possibly because students did not encode their test failure problems in a consistent manner for easy searchability.

The professor, who had created the course decades ago, had created multiple implementations of every major circuit design assignment over the years. He was the real resident expert. I shadowed him helping students like medical student might shadow a senior physician. And yet, in order to have the perspective he had without also spending decades teaching the same class, I wanted to augment my cognition with technology. Without knowledge of human-computer interaction, UI design, or any toolkits for creating an interface, I did not even know how to start.

Trigger

Something beyond the control of the protagonist (hero/heroine) is the trigger which sparks off the story. A fairy godmother appears, someone pays in magic beans not gold, a mysterious letter arrives … you get the picture.

I was wrapping up my late-night shift helping students in the computer lab, when a student asked for help with the Turing machine lab. His Turing machine could not detect whether a string of open and closed parentheses was properly balanced, i.e., had an appropriate closing parenthesis for every open parenthesis. He was running out of time. The semester was almost over, and if he was not successful within the next day or two, he would fail our course.

On that late night, I had a hard time helping Paul turn his Turing machine into one that would pass all the automatic grader’s test tapes. I could not tell whether his approach could work with a little debugging or was fatally flawed. I did not want to demoralize him by telling him to abandon his efforts so far and start over unless it was truly necessary. And what if he was on the path to a novel solution no one else had created before?

While not a necessary condition for the correctness of his intended approach, finding a similar working solution in the pool of previous student submissions would serve as an existence proof and evidence that he should persevere. I had copies of hundreds of other students’ correct Turing machines, but it was difficult to see by inspection whether any of these solutions were successful implementations of his intended approach.

This was due to both superficial static and dynamic differences between solutions. Students design their own symbol library and state names for the finite state machine portion of their Turing machine. Using their custom symbol and state names, students can list behavioral specifications in any arbitrary order. Even after translating these textual statements into a diagram, the diagrams can look very different from one correct solution to the next. It is difficult to immediately see any differences in behavior or strategy based on these representations.

Eventually, through trial and error, he got his Turing machine to work on all the test tapes, and passed the course. But I was deeply troubled by the experience. I felt like the information was there, in our staff server, but my brain could not see through the superficial textual differences nor mentally execute all the student solutions I had access to in order to have the perspective that would have helped me help Paul.

The quest

The trigger results in a quest — an unpleasant trigger (e.g. a protagonist losing his job) might involve a quest to return to the status quo; a pleasant trigger (e.g. finding a treasure map) means a quest to maintain or increase the new pleasant state.

After the semester ended, I looked into this further. I ran all the correct two-state Turing machines collected during the semester on the same test tape containing a string of open and closed parentheses. The movement of the tape-reading head across this input was logged in coordinates relative to the common starting point at the left end of the test tape (see Figure 1). I had watched so many students’ Turning machines execute before that I knew there were be some patterns, I just did not know exactly what or how many there would be.

Based on visual inspection, I found that most machines exhibited one of two common movement patterns. Roughly 50% of students used Strategy A: pairing inner sets of open and closed parentheses (Figure 2). Another 40% used Strategy B: pairing the first open with the first closed parenthesis, the second open with the second closed parenthesis, etc. (Figure 3). The bold trajectories in these figures represent particularly clean representative examples.

Figure 1. Tape on which all 148 student-written two-state Turing machines from one semeter of 6.004 were tested, and the numbering system by which locations along the test tape are identified.
Figure 2. Strategy A Turing machines: those which paired inner sets of open and closed parentheses, as is standard in mathematical notation. (73 out of 148 Turing machines)
Figure 3. Strategy B Turing machines: those which paired the first open with the first closed parenthesis, the second open with the second closed parenthesis, etc. (58 out of 148 Turing machines)

The remaining ~10% solutions (not shown) include less common strategies, including at least two that are actually incorrect, i.e., cannot handle arbitrarily deep nesting of parentheses, but are marked as correct because they pass all the test cases in the teacher-designed suite.

Many teaching staff members:

  1. were aware of the Turing machine solution they each had found on their own time.
  2. not aware that there were two mutually exclusive common solutions. At least one staff member admitted steering students away from solutions they did not recognize, but in retrospect may have indeed been valid solutions.
  3. not aware that the test suite was insufficient to determine correctness.

Based on my analysis, I was able to brief fellow staff members on the space of solutions — both good and bad — in preparation for subsequent semesters and suggested additions to the test tape suite to catch more submissions that were previously erroneously marked as correct.

Surprise

This stage involves not one but several elements, and takes up most of the middle part of the story. “Surprise” includes pleasant events, but more often means obstacles, complications, conflict and trouble for the protagonist.

I bumped into Rob Miller in the stairway of my building. “I’m interested in helping people learn. Do you know anyone who is working on that?” I asked. “I do. Come to our research group meeting,” he replied. By the end of the semester, I officially joined Rob’s group as one of his PhD students.

Encouraged by my early success analyzing Turing machine, I tried employing a similar methodology to student-written simulated digital circuits. However, I was using the language I was most comfortable with at the time, MATLAB. Not the right language. I got nowhere.

Then, Rishabh Singh arrived on my virtual doorstep, with a pile of student-written Python solutions to the same programming exercise, freshly dumped from the logs of MIT’s introductory Python programming course on edX. Like tracking of the head of the Turing machines, we realized, after a few false starts, that we could identify meaningful patterns in Python solutions by tracking the values of variables during execution on test cases.

We brought in Philip Guo, who contributed variable value-logging code from his Online Python Tutor, and Jeremy Scott, who helped implement the UI, and created OverCode:

We evaluated OverCode against a non-clustering baseline in a within-subjects study with 24 teaching assistants, and found that the OverCode interface allows teachers to

  1. more quickly develop a high-level view of students’ understanding and misconceptions
  2. provide feedback that is relevant to more students

OverCode addressed the first two of my three frustrations as a teaching assistant:

  1. It canonicalizes students’ code so that I can skim their answers as a group or as individuals without first adjusting to each of their variable naming, statement ordering, and style choices.
  2. The diversity and distribution of good and bad designs generated by our students is made explicit and explorable. For example, OverCode shows common solutions, which are often reasonable since many students independently arrived at them, and as the user scrolls to less common solutions, solutions can become more convoluted, include unnecessary code, use reasonable alternative Python language keywords and constructs, or solve the problem using teacher-forbidden strategies, e.g., recursively rather than interatively.

Critical choice

At some stage, your protagonist needs to make a crucial decision; a critical choice. This is often when we find out exactly who a character is, as real personalities are revealed at moments of high stress. Watts stresses that this has to be a decision by the character to take a particular path — not just something that happens by chance.

I felt I needed to do more — more impact within teaching environments or more technically sophisticated. I approached the lecturer who served both MIT’s residential and online introductory Python programming courses. I asked her questions like, “What do you think? Would you like to use it? What do you need that it does not do yet?”

She played with it. She thought it was cool. Then she asked about something I never expected, “I want to see their variable names.”

Some people undervalue variable names, but they are the most basic form of documentation that helps humans read programs. This @HackerNewsOnion tweet is included for the sake of humor; if you ever wrote a function which used these variable names, everyone else, including your future self, would hate you. Some introductory students mix themselves up by creating names that suggest something other than what the variable contains, e.g., swapping names for the index into an array with the name for the value in the array at that index. It can throw a wrench into their program comprehension process.

Donald Knuth compares a good programmer to an essayist who, “with thesaurus in hand, chooses the names of variables carefully and explains what each variable means” [Literate Programming]. Without modifying execution, names can express to the human reader the type and purpose of an object, as well as suggest the kinds of operators used to manipulate it.

When I asked our resident lecturer of software engineering what makes for a good variable name, he asked, “Do you want to start a holy war?” Various naming conventions, like Hungarian notation, have evolved to help developers use their freedom wisely. The Google C++ Style Guide authors assert that their most important consistency rules govern naming, which are arbitrary but consistent in order to increase human readability.

Programmers can develop their own heuristics for good variable names through the experiential learning process of building, debugging, and sharing increasingly large programs with others and their future selves. During interviews, one professor explained an elaborate set of guidelines that she personally developed and teaches to her students, e.g., all method names must be verbs.

In practice, very little time is spent on variable naming in introductory classes. Maybe a lecture, at most. It felt like a risk to build something specifically for this, given that it was not already a task weighing down on teachers. But, the user had spoken. Maybe, with new tools, giving feedback on variable names would become time-efficient and scalable enough to become common place.

Building on top of OverCode and the support of my co-authors Jeremy Scott and Lyla Fischer, I designed and implemented Foobaz, a system that enables time-efficient, scalable, personalized variable name feedback within the context of existing assigned programming exercises. The teacher’s view is a variable name browser where names are shown along with all the context necessary to judge their quality. Teachers only need to annotate a small number of variable names with qualitative judgements in order to generate personalized formative assessments about variable naming for the majority of their students.

In our first user study, the Foobaz system helped each of 10 teachers give personalized variable name feedback on thousands of student solutions from MIT’s edX introductory Python programming course. In the second study, students composed solutions to the same programming assignments and immediately received personalized quizzes composed by the teachers in the previous user study.

Like OverCode, Foobaz addressed the second of my three frustrations as a teaching assistant: making the diversity and distribution of good and bad choices, i.e., variable names, explicit. But what about my third original frustration, about helping students even get to a correct solution?

Climax

The critical choice(s) made by your protagonist need to result in the climax, the highest peak of tension, in your story.

I had built, validated, and published two systems now, all for correct solutions. It is not unreasonable to focus on correct solution in an environment where most students in introductory programming classes can check and check and check their code against teacher-designed test suites until they pass. However, a huge amount of teaching staff time, especially in residential environments, on helping students get to a correct solution.

I recalled yet another stressful moment in the computer lab as a teaching assistant:

Student A: “I cannot figure out why I’m having this bug.”
Me: “Oh, I haven’t seen this behavior before. Strange…”
<We sit together, brainstorming, trying things, debugging, stuck…>
Student B walks in, overhears us: “Oh, I just had that buggy behavior. Have you considered…?”
They chatted. Yes, they had separately created the same bug, one after the other. Student B helped Student A debug it. Mystery solved!

By random chance or because I am not a novice, I had not generated that problem in my own code before, but Student B had. I could model good debugging skills, but Student B could be just as, if not more, helpful to Student A in a short amount of time. Student B had earned expertise through experience that I had not.

I felt a mixture of pride at this peer-teaching moment, relief that Student A could move on, and embarrassment that I had not found the bug myself, as the supposed expert in the room.

Reversal

The reversal should be the consequence of the critical choice and the climax, and it should change the status of the characters — especially your protagonist.

Shortly after that experience of coincidental peer teaching, I stopped thinking of myself as a “resident expert” and started thinking about how to help students more directly share their earned expertise with other students, in a way that would be beneficial for both the giver and the receiver.

After helping each student fix a bug and observe a test case go from failed to passed, I asked the student to compose a hint for other students like them who are still failing the same test. I then asked them to enter that hint, indexed by the id of the test case it might resolve, in a web application I wrote, where it would be accessible to all students in the course. Other teaching assistants started pointing students to the web application before helping them, to decrease repeating themselves. We also released it to edX students taking the same course online.

When I saw two students with complementary optimizations of an adder circuit, I made them teach each other their respective optimizations before I gave them their required lab check off.

Then I pushed on toward automating this peer-to-peer help on optimalizing circuits. For a simpler circuit design assignment, I used the number of transistors as an approximate metric of optimality. When students submitted a correct circuit, they were automatically shown a slightly more optimal student solution and a slightly less optimal student solution. They then had to

  1. write a hint about how to turn a solution like theirs into the more optimal solution, forcing them to engage with and hopefully learn from a more optimal solution
  2. write a hint about how to turn a solution like the less optimal solution into a more optimal solution like theirs, using their earned implementation expertise in the process

In a small lab study, we found that students could improve their own solutions based on these hints written for solutions like theirs by other students who had the experience to write them. I dubbed this “targeted learnersourcing.”

These workflows also addressed the third and final of my three frustrations as a teaching assistant, by letting students be the messengers of the lessons they learned during debugging. Rather than doling out hints based on my own observations from seeing the same bugs over and over again. I no longer had to be omniscient, like the professor I shadowed. I could let them step into that role and compose hints for each other, for the bugs they had resolved and the designs they had implemented, after they had earned the necessary knowledge through their own experience.

Resolution

The resolution is a return to a fresh stasis — one where the characters should be changed, wiser and enlightened, but where the story being told is complete.

So here I am, on the cusp of a postdoc at a school with even larger residential programming classes, with new assignments and professors to support. To get here, I shed my former life as a roboticist and learned how to write Python programs that process and manipulate other Python programs, design and implement student and teacher facing systems, and deploy web applications. I found and fell in love with the human-computer interaction community, which provided venues to share my exploits in user interface design and crowdsourcing with learners. As long as I stay curious about all the different solutions to programming problems, I’ll try to deploy forms of these systems that illuminate and make use of the great diversity of solutions (and bugs) students create.

Thank you Ali Alkhatib, Niloufar Salehi, Sharon Zhou, and Ethan Fast (and one person whose name I didn’t catch!) for your feedback during #writers-block.

Acknowledgements

Thank you, thank you, thank you to my advisor, Rob Miller, and my other co-authors on this thesis work, Rishabh Singh, Jeremy Scott, Philip Guo, Lyla Fischer, Aaron Lin, and Carrie Cai. The past and present members of our User Interface Design group were instrumental in helping me feel at home in a new area and get up to speed while having fun. I’m also grateful to my past internship mentors at Google and MSR, specifically Dan Russell, Merrie Ringel Morris, and Andrés Monroy-Hernández, who gave me the chance to try out my chops in new environments. I also want to thank my friends outside MIT, especially the greater Boston community of wrestling coaches, who helped me learn important lessons outside the classroom. And, last but not least, my partner Victor, my parents, and my brother, who have each supported me in their own way, from hugs, to proof-reading papers, to being technical sounding-boards, and finally, to demonstrating, as my brother has, how to switch fields, pick up a whole new set of skills, and look good doing it.