CSR Tale #11: All Things Accelerators with Chris Rossbach

Amogh Akshintala
CSR Tales
Published in
13 min readMay 17, 2019

By Amogh Akshintala

This story is structured as a conversation between one of our editors, Vijay Chidambaram, and Prof. Christopher Rossbach. Chris is an Assistant Professor at UT Austin and Senior Affiliated Researcher at VMware Research Group. Chris’ work tends to focus on compute accelerators, their interfaces, and support for them in system software. Vijay sat down with Chris to talk about the impetus for this line of research, how he picks projects and to elucidate lessons he’d learnt that might be of interest to the broader community.

— — — — — — — — — — — — — — —

The root of the research vision

Vijay: One of the things I really like about your research is that you have a strong vision for where you want to go. How did it all start? How did you get interested in heterogenous computing? Let’s trace back to the roots of one of your recent work, AmorphOS.

Chris: My interest in this is really driven by the desire to tease out the realism of the heterogeneous vision by building real systems. For 15 years we’ve been reading and writing papers that say Moore’s law is ending and we’re going to have specialization. I find the end point of the heterogeneous vision, the one in which you have a bunch of specialized devices all in some constellation, to be a compelling vision, but it’s not as clear to me that we can ever get there. So, what I’ve really tried to do for the past four or five years, is to say, okay, if we get there, what sorts of things should we be able to do, that interest me, that I would want to build. Can I realize them with some combination of current and hypothesized hardware? What is really involved in building a system that does that? Ultimately it all started with 3D Cameras. Back in 2009, when I was in graduate school — it doesn’t pay well as you might recall.

V: (laughter) Yes.

C: I found that it did a lot for my state of mind to make money on the side. So I was doing consulting for a company that built these 3D cameras, that eventually was acquired by Microsoft and became the pieces of hardware that you know as the Kinect line. The company I was working for was a competitor, but Microsoft bought us out.

V: Who were the folks involved in the startup? Was this a UT startup?

C: This is actually back in California. So before I went to graduate school, I was playing music for a bit. When I decided to go back to grad school, I started playing with computers again. I started working with a company called Canesta for three or four years, and they had a 64 by 64 pixel array of infrared, laser-based, 3D cameras. I started doing consulting for them. And over the course of the next several years, all through grad school, I came to own more and more of their technical infrastructure. I wound up writing all of their user-mode libraries, their SDK, many of their demos, much of their drivers stack.

V: And you were doing this while in grad school?

C: Yes. By the way, if you’re a grad student reading this, don’t do this! My happiest time in grad school was this period of about seven months where Canesta didn’t have any work for me. I was able to focus on just classes and research. Oh my god, it’s so much easier. So whether it’s a good idea or not, it is what I did. Where I made the leap from transactional memory to emerging hardware was when we were actually trying to build the gesture recognition system for the 3D cameras: A major PC vendor had contracted us to build the next generation of TouchSmart, which would have four depth cameras on either side. The first thing we did was assemble a screen that had these cameras on and hook it up to a CPU, and it’s consuming 99.9% of the CPU just unpacking images and looking for hands. That’s not what you buy a computer for.

So GPUs seemed like a natural fit for this. So I got in my car, we were in Austin, drove up to Fry’s and bought a GTX 580. Drove home. Didn’t fit in my computer. Okay fine. I actually wound up taking a saw to the chassis so that I could get the thing in there. Okay. Put it in, turn it on. Won’t boot. Why? Oh, power supply is not big enough. Drive back to Fry’s. Come back. Put the new power supply in. It’s still not enough power. Drive back to Fry’s. Come back with a thousand watt power supply. By the time this was over, my computer in my office just looked like a mound of all these things. I couldn’t get the power supply in it. The GPU is hanging out the back. It’s got 4 cameras. It looked like a, you know, sort of Frankenstein movie computer. Eventually I learned to program the GPU by trying to process the images from these cameras with the GPU. Next step was to put it into a driver, because that’s what was required in order to use it as a HID device in Windows. And lo and behold, you can’t do it.

V: How come?

C: There were no kernel-side interfaces. Part of the issue is that you have the user-mode library at the top, a well defined interface, and at the hardware boundary, you’ve got something like MMIO, another well defined interface. Everywhere in between, it’s opaque by design and there are lots of good reasons for it. I call this siloization.

The lack of interfaces at any other layer means, you know, since the OS is at the vendor’s mercy, there’s no way to do it. So, what we wound up having to do is figure out some way to make up-calls from kernel mode into user space, etc. It’s just a matter of the interfaces weren’t there. Because the interfaces weren’t there, the OS can’t do resource management. You can’t provide any guarantees to the user of the form, your input device will work. So the project failed, they scrapped it.

V: But it seems like what you were saying would be doable to build, it would be a giant engineering task, but doable.

C: So if you start from a clean slate, it’s doable. In fact, if you start with a clean slate, it’s not that you can’t figure out a way to get the data. But this is also about low latency, right? You could build a pipeline that captures data from USB and exports it up in to the user space and then into some user-facing GPU library, down into the GPU and back and forth. That’s doable. But the end-to-end latency would be milliseconds and this is not what you want from your mouse.

V: (laughs)

C: The devil’s in the details and that is probably not a fairly detailed rendering of it, but basically the trade-off space, it was such that you couldn’t build this system. Then I started digging into it and I’m like well actually there’s a whole bunch of problems that stem from stack structure and from the abstractions for other compute devices — they are too device-centric. For example, when you write a program on your CPU, you don’t open up an emacs buffer type int main, and then, line one is go find my CPU device and start it up… You allocate memory and stuff but you don’t launch a kernel on it. You don’t copy data to and from it, etc.

The device management API as the fundamental abstraction for heterogeneous compute breaks a whole bunch of things. Once I realized that, that a whole bunch of adjacent problems became clear. That culminated in my conclusion that okay, what we need is data flow. We need to get device management out of the set of abstractions. So I built P-task, which is a data flow system, and then discovered, okay well that solves one problem. Now the OS can manage it, but it introduces another problem, which is nobody can write a data-flow program. Out of the frying pan into the fire. So what do we need? We need a compiler that can take your normally expressed program and underneath the covers turn it into data flow. And so we built such a compiler, Dandelion. It also had an FPGA backend, built by Eric Chung and John Davis. So it was almost like I was traversing this labyrinth of adjacent problems — OS, GPU, data flow, compilers for data flow on GPUs and FPGAs. Finally I found myself circling back to the question of whether the set of lessons and ideas that I learned doing GPUs generalize as well as my instincts tell me they do for arbitrary accelerators.

So the seeds of AmorphOS can clearly be traced all the way back to like Canesta, in 2003. But the formalization of the idea, you know, like the first time I put a stake in the ground that I wanted to explore that area is 2013 or 2014. I didn’t actually do any system building for FPGAs until late 2015 and then there’s the long road of getting from the idea of ‘I want to do FPGA OS support’ to 2018 where it’s published.

The journey from idea to publication

V: Could you talk about the journey from idea to publication?

C: The journey was challenging partly because there’s a new level of hardness associated with FPGA programming that I sort of underestimated. There are certain parts of the development cycle that you take for granted. I don’t think the hard part is so much about the mindset of going from a CPU to a GPU like, oh my God, we have to think in parallel. You can do that. You can learn abstractions. Similarly, the mindset of going from CPU or GPU to writing Verilog, that’s a broader chasm, but again, it’s an abstraction. Programmers, we learn how to write code to abstractions.

It’s just more that the development process is so much more frustrating. You actually need to take some time to be clear about your design. There’s just much, much higher latency between whatever change you make, whether it’s to your design or to your code and testing, and deploying.

The first thing you’re going to do is you’re going to build something and then you’ll simulate it, and you’ll get the wrong answer. Okay, what do you do? Pretty soon, you’re looking at wave forms. After some time, okay great, it’s sort of abstractly logically correct. Now I’m going to put it on the FPGA board and we’re going to be good, right? No.

A whole different set of problems comes up that even if the logic works in simulation, it may fail in ways that are difficult to understand if you have not got the background, it’s like timing will fail because your wire paths are too long or you don’t have enough buffers, or enough clock domains. Things that people who are coming from a systems standpoint are not used to dealing with. You’re like “I debugged it, it’s correct, why doesn’t it work?”. The observable behavior is either a compile error and some inscrutable message that is typically worse than LaTeX, or you start it up on the board and it just doesn’t work. It just sits there doing nothing. You want to debug it so you put it back in the waveform viewer, but there doesn’t seem to be anything wrong.

V: So how did you get past this?

C: Sheer obstinacy. Just keep going and you get better at it.

V: Just to add context this a little bit: You’ve been looking at this from 2015, right? Has the tool chain gotten better?

C: (Deep sigh) No. Anybody who’s been working on FPGA development this whole time would have to say yes. The relative impact of each of these death by a thousand cuts problems is lower. One big problem was that I would make a change and it would takes hours to compile. That’s improved. It takes 90 minutes now. Still, you’ve totally forgotten what you were doing. You’ve gone to lunch, you come back, you’re completely onto a hundred other things by the time your build completes. I think the constellation of challenges has not changed.

The other reason doing this work was hard is because I felt there was some push-back from the systems community and the FPGA community about the idea that you would ever share this hardware. You could put 120 CPU cores on an FPGA from Xilinx circa 2014. The F-1 devices that we use in the AmorphOS paper, they’re huge! Saturating them is very, very difficult. So there is tremendous opportunity for sharing.

However, the FPGA community’s perspective is different: they believe users buy an FPGA and use it to build an application-specific custom solution. They don’t think users would buy an FPGA to actually reconfigure it in-situ in a server. There was never this kind of like malleable accelerator vision that is behind the AmorphOS paper. We also got pushback saying even if your FPGA is huge, you’re pin or I/O constrained along the edges and so you wouldn’t share it.

I think much of the history of this project was about subsequent revisions to the core idea, which was present in the first prototype I built in 2015; to present the idea in a compelling way to the systems community, and to convince them that sharing was a good idea. I had data on a small number of applications that showed you could share an FPGA profitably very early on with my own prototype. We got some again in 2016 and then, the paper was submitted multiple times as we learned how to write it, and as we improved the system. I think we realized that positioning it as a utilization and compatibility problem helped people appreciate it. “I’ve got to rent this thing, I’m going to be billed whether I utilize it or not.” “ I don’t want to be locked in to F1. I want to be able to go to Catapult. I want to be able to go between Xilinx and Altera.” All of the same ideas that inform and enable OS-level sharing support, enable these other things. By focusing on those, I think that made the motivation compelling and accessible.

Positioning is crucial

C: People think the challenges are like core technical challenges and oftentimes it’s not; it’s more like how do you present your solution? How do you make it accessible to your community. I think saying things like this in the open totally helps that. Many graduate students don’t know about these things. I think I’ve learnt something in the process: pretty much every other system I’d ever built, I didn’t have these forms of pushback. I thought that if you had a really good technical idea, it was okay to submit a less-than-polished version. However, I learnt quickly that if people don’t buy the vision and it’s sloppy writing, it’s over.

V: Completely agree. Writing is very important, and my view is in the sort of end days near the deadline, you have some time and you can spend it on doing some experiments, or you can spend it on polishing the writing. My take is that someone (perhaps the advisor) should polish the writing. I feel like it has more return on investment for time spent than say one more experiment. I mean, of course, you have to be over the hump to do this. If you’re missing crucial experiments, it won’t work.

C: I think that’s something that before AmorphOS, I probably would have come down on the other side of, because every previous paper was so easy to motivate — “Oh, you’re going to build a sequential compiler for a cluster of GPUs, like OK”. There was always one more experiment to run.

V: Now that AmorphOS is published, I feel like it falls under the bucket of work where in hindsight it feels obvious, like we should have been doing this all along. It’s interesting to hear that that wasn’t experience for you while you were doing it.

C: We did see that every single objection people had, empirically, there is some nugget in the AmorphOS paper that shows that yeah that’s a concern. But it’s not necessarily the only point in the space. You don’t want to put two applications that both hog all the memory bandwidth on an FPGA, that’s a bad idea. Co-schedule with complimentary resources. Everyone was right in their objections, but we weren’t expressing it in a way that made the whole picture accessible and compelling.

How to formulate a research vision

V: A common problem in coming up with a research vision is that researchers find many things interesting, and they may not have this one compelling thing that they really want to see happen. Now that I know this story, I understand how you came up with your vision: you actually worked on it and you saw the problems first-hand. Do have any tips for other people who are struggling with formulating a vision?

C: I mean you just sort of said it. My tip might not be realizable. I’ve had kind of a hard time imparting this wisdom in a usable way to my students. The tip would be always be driven by a concrete workload. I could have sat down and said, oh, GPUs, they don’t have OS-level resource management. That’s a problem. We ought to have it. But I don’t think I would have stumbled on the important issues nearly as quickly and in fact, I don’t think I would have had nearly as an easy time putting it in a context where people can relate to that research if I hadn’t been looking at this broader agenda of “ We got cameras, we’ve got parallel hardware, shouldn’t we be able to do X?”. In my case, X was build gesture recognition. Anytime you approach a problem, come up with the following statement — I’d really like to be able to do X. What are the technical barriers? As long as X is sufficiently forward looking, you’re gonna find technical barriers at every layer. So then the challenge becomes one of choosing X, such that X is something that other people also think we ought to be able to do.

My X was “I want to be able to fully utilize this piece of expensive hardware and if I don’t build this, I can’t”. When I talk to students, some of them go “Oh I want to make this tweak to Spark, I want to introduce caching”. I mean there are compelling research agendas that are about improving caches or extending a system. Sometimes, research projects like that materialize that you can get traction with. But the best stuff really comes from something audacious like Shouldn’t we be able to have a self driving car? So that would be my advice for formulating a research vision.

--

--