MSR Interview #5: Jürgen Cito and Gerald Schermann

Alexander Kristian Dahlin
sourcedtech
Published in
10 min readMar 7, 2019

This article is the fifth episode of our MSR Interview blog series. In case you missed it, you can check out our previous interviews with Abraham Hindle, Georgios Gousios, Vasiliki Efstathiou, and Sarah Nadi. This week, we’re publishing the interview with Jürgen Cito and Gerald Schermann. Jürgen is a post-doc at MIT and Gerald is a PhD Student at the University of Zurich. Thanks to Waren Long and Vadim Markovtsev and Francesc Campoy for conducting the interview.

Jürgen Cito (Left) and Gerald Schermann (Right)

Here are some of their more recent publications on technology:

Could you please introduce yourself?

Jürgen Cito (JC): My name is Jürgen Cito, I currently work as a postdoctoral researcher at MIT, but I did my Ph.D. at the University of Zurich. There, I also worked on empirical research related to software engineering (SE). The recent study that we published at MSR performed empirical research on Dockerfiles specifically, from Github.

Gerald Schermann (GS): My name is Gerald Schermann, I’m a Ph.D. student at the University of Zurich. My main research focuses on Continuous Delivery and Continuous Experimentation (e.g., canary releases, A/B testing) where we try to use insights from empirical data to provide tool support for developers and release engineers when conducting experiments. Jürgen and I were co-advising a student who developed the tooling to mine Docker-related projects from Github and their evolution history.

Is it your first time at MSR?

GS: No, it’s our fourth time we attend MSR and it is the second year we’ve been presenting. Last year we presented the insights we learned from an empirical study on the Docker ecosystem on Github, this year we present the underlying (now updated and enriched) dataset that we collected to draw these insights from.

When you say more rich data, what is the main improvement you think, that you did not have or was missing in the previous version, things like that?

GS: It’s more the structure of the database in general that we improved. Last year’s version was more like “get the data in the fastest way possible such that we can run some analyses”. This obviously doesn’t lead to the best data representation, yet good enough for us to write first exploratory queries. However, in order to make it more accessible — also for other researchers — we had to restructure. The main improvement was that we introduced the notion of a project, a project can contain multiple Dockerfiles. The previous version focused on Dockerfiles and project information was just metadata associated with them. Now, having the project as a starting point allows us to run more detailed analyses by taking into account all the Dockerfiles and their evolution within a project.

Okay, and so when you query, you just get the Dockerfile itself, or is there also some kind of structure internally?

JC: We store a fine-grained model of Dockerfiles and their changes over time (snapshots) by parsing each revision of the Dockerfile on GitHub. This allows executing very specific queries from that structure. For instance, we can ask “Give me all the RUN commands that contain Python”. And then we can correlate this with other information by conditioning on the first query: “How many projects that install python also install nginx?”

The thing that has changed most I’d say during the last year, regarding Dockerfiles, is the addition of multi-stage builds. Has that impacted your research at all, or is it just something more you need to parse and that’s it?

JC: To capture that, it was just a simple change to the parser, basically. It was so easy to do that we didn’t even think about it too much. Adoption of features and trends over time is one of the interesting analyses we can perform based on our snapshot model. We can see how many Dockerfiles have actually adopted these multi-stage builds. To give another example, we have also analyzed the adoption and distribution of the newer “health check” feature over time. I think this can provide very interesting insights not only for single projects but rather on an ecosystem level.

Was there anything that was unexpected out of this analysis that you did of the dataset, something where you were like “oh, that’s weird”?

JC: We encountered some surprising findings when performing analysis on the RUN command in Dockerfiles. It allows you to run any arbitrary command in the container. So we asked ourselves whether we can find any structure within it. We analyzed a sample of 500 RUN statements and saw that about five categories emerged. About half of the statements are dependencies (apt-get, yum, npm, pip, etc.), 26% are filesystem commands (cp, mv, mkdir, etc.), then there are permissions, build instructions, and environment changes. The surprising bit to all this was how neatly the majority of RUN commands (~80%) fall into these five categories. We expected it to be quite a bit more heterogeneous, given the variety of projects on GitHub. This insight allowed us to write quite a simple classifier that can now be also used for different contexts.

What other things have you seen at MSR this year that caught your attention, or thought were kind of cool? Your favorite talk, favorite dataset, whatever?

GS: I really liked the talk about the StackOverflow dataset that shows the evolution of SO posts and their code snippets. Mining StackOverflow is pretty common in the MSR community, if I remember correct, in every year I attended MSR there was at least one paper that used information gained from SO to run some analyses. However, looking at the evolution of posts especially updates on code snippets are interesting. One of their main observations was that code snippets on SO are rarely changed, and consequently, top voted posts and code snippets might not even work anymore because of changed language or library APIs.

And do you have any ideas on how you could apply that to your research? What cool stuff do you think it could be applied to?

JC: Staying with the theme of StackOverflow, I have seen people applying various NLP methods to learn models from both natural text and code snippets in SO posts. These would also be interesting for my own research interests. I am currently looking into learning program analysis models from software repositories. I think the rich data in StackOverflow questions and answers have the potential to allow us to learn rules for static analysis, for instance.

GS: Another angle is code reuse or code clone detection, checking how many snippets from StackOverflow actually end up in Open Source projects. Let’s assume that a SO snippet gets updated because of a new API, we could identify projects that incorporated the previous, now broken version of this snippet and could, for example, inform the contributors of these projects, making them aware that their code might break if they update to version X.

So for your paper, you used GitHub to get all the Dockerfiles, how was the experience, was it easy for you to find all the files, or was it painful?

GS: No, it’s actually quite easy, the GitHub API in combination with Google’s BigQuery are powerful tools. We used Big Query for preselection, querying for all projects that contain Dockerfiles. Docker’s naming convention for Dockerfiles helped us on that front. Once we had the list of all Dockerfiles, we used the GitHub API to checkout and analyze the projects.

JC: At first we tried to answer all the questions we had through BigQuery. The problem is that if you want to do any exploratory work, then you hit the free tier level rather quickly. If you don’t want to spend a couple thousand $ to do this research, then I guess our approach of mining after the fact is definitely better. If you have a very specific question, and you know the exact query that will provide an answer, then, by all means, I would say BigQuery is the way to go.

Something else I’m really interested in is MLonCode, have you seen some of that during MSR this year, and what do you think about it?

JC: Maybe not MSR specifically, but I have seen research at ICSE that combines program synthesis and repair with learning. My current work actually applies ML techniques on our dataset to improve the synthesis of Dockerfiles. The idea of the project is to learn from developer interactions in containers to infer a generalized state machine and synthesize Dockerfiles (hope to publish a paper on this soon). ML techniques can help to guide the search space for this synthesis task.

What do you think of the concept of MLonCode, have you tried in your research at all, is it something out of the way?

GS: I haven’t applied it on my research yet. My core research focuses on the deployment and execution aspects of software, and apart from feature toggles that are a way to implement continuous experiments my research doesn’t really touch source code. However, I guess in your case it’s different…

JC: The work I described earlier on synthesizing Dockerfiles definitely goes into that direction, but it is new territory for me. Going forward, I would like to incorporate more of it in my research.

You were talking about the talk you gave today, and about how you’re using ML to predict the performance of programs, could you talk a little bit about that?

JC: In my dissertation work, I designed an approach that constructs performance models from production performance measurements (e.g., latency) and integrates it into the source code view of developers. To provide near real-time performance feedback, the model is updated incrementally based on changes made by developers. For inference, we use a combination of lightweight analytical models augmented by ML models. The analytical part captures the programming model and the ML model is trained on measurements that capture the dynamic effects of production systems.

That’s really interesting, I’m very curious and I want to try it. Is it open source, can I install it from somewhere easily?

JC: Two of the systems that implement this framework are open source: PerformanceHat for Java in Eclipse [1] and Visual Studio PerfViz for C# [2]. Unfortunately, the complexity of software performance models does not allow for a one-size-fits-all solution that can be used out of the box. The project sites provide some cool demos and documentation on how to get started.
[Aside: The research on PerformanceHat has been published at this upcoming ICSE and also contains more details in the following paper: https://2019.icse-conferences.org/event/icse-2019-technical-papers-interactive-production-performance-feedback-in-the-ide]

In general at MSR, what was your favorite thing that you’ve learned during these two days, or three days?

GS: It is impressive to see how MSR is growing from year to year, not just by the number of people attending the conference, but also topic-wise. It is not limited to “let’s mine source code” all over the place, it is very diverse, you see papers about performance engineering, code review, testing, API usage, or in our case the Docker ecosystem, and many more.

JC: It thinks that it’s also getting better every year. I feel papers are consistently becoming more rigorous, in terms of methodology but also in terms of overall quality. I feel like the methods and the techniques that people have used over the years can benefit so many fields as well that are not necessarily classical source code, like configuration code or, more generally, other models of computation that are formally expressed as code.

Apart from yours, did you see any specific datasets where you were like “that’s kind of cool”?

GS: Besides the StackOverflow dataset from this year’s edition that I pointed to before, the TravisTorrent and GHTorrent datasets are always worth mentioning. We used them last year for a paper when we were analyzing build failures in open source projects. We queried failed builds from TravisTorrent and then extracted the specific build logs and classified them according to a build failure taxonomy.

Let’s talk a little bit about the research that you are working on now, you were mentioning types on top of configuration code, on top of Dockerfiles, what do you mean by that?

JC: I want to make writing configurations more robust. I think developing infrastructure and configuration aspects of software is still very exploratory and feels a lot like ad hoc hacking. My new research wants to bring more program analysis to this setup and achieve some form of infrastructure code engineering. I think there might be an opportunity to combine work from type theory and empirical research we see at MSR to learn types derived from probabilistic models. Rules for configuration files are evolving, but so are our repositories. With all the information out there, can we infer static typing rules that are informed from a probabilistic model learned from datasets such as the one we presented?

That is very interesting, would you expect doing that to build something that helps people with their existing languages, or do you expect to maybe create a new one that is more robust?

JC: My research philosophy is to observe people in their natural habitat to guide my research and develop approaches that will augment their activity, but not interrupt it. Specifically for software developers, I try to design systems so that they can be integrated into their development workflow, with the hope that they will more likely adopt it. If an intervention is too disruptive, I think people are hesitant to use it (maybe I am also just a bit too pragmatic).

Are you planning to go to MSR 2019, and what do you expect to see there?

GS: As I’m wrapping up my Ph.D. it is not sure whether I have the resources for a submission for the next edition, so I will very likely skip it. I would love to go, MSR is one of the coolest conferences around, also and especially because of the community.

JC: It’s not far for me so I will probably go. I look forward to papers pushing the boundaries on what we can learn from all this data surrounding software.

Learn More about MSR 2019 and MLonCode:

--

--