AMPLab’s co-creator on where big data is headed, and why Spark is so big
Michael Franklin is professor at the University of California, Berkeley, and co-director (and co-creator) of the popular AMPLab there (AMPLab is short for Algorithms, Machines and People). The wildly popular big data framework Apache Spark came out of AMPLab, as have a handful of other popular technologies.
In Part 1 of this edited interview with Franklin, he discusses how the AMPLab came to be, why Spark has caught on the way it has, and what problems he and his colleagues will tackle next.
Part 2 of the interview, which will publish soon, focuses on Franklin’s thoughts about the database industry, where he focused much of his early-career attention.
SCALE: Can you start with your personal history, for those who are unfamiliar?
MICHAEL FRANKLIN: I am a database guy by training. Very early in my career I got involved in building one of the earliest massively parallel database systems as a programmer. That kind of got me hooked on research and the enjoyment and the challenge of trying to think a little bit beyond — or maybe more than a little bit beyond — what the industry is currently doing. Figuring out what new techniques and what new methodologies you need to come up with to solve bigger problems than people are currently solving.
After working on parallel database systems, I decided to go back to graduate school and get my Ph.D. My work there was mostly on distributed systems and trying to understand how to optimize distributed memory, particularly for database applications, and also how to provide good fault tolerance in a distributed environment.
When was this?
This was in the early ‘90s.
After my Ph.D., I got a professor job. Really, my research in my early years as a faculty member was along those lines, parallel and distributed database systems. When I got here to Berkeley, there was a lot of excitement around the Internet of Things and sensor networks, so we started looking at how do you provide data management in highly distributed, highly dynamic environments like sensor networks.
I also did a lot of work on real-time stream processing. Part of that led to me start a company with one of my students to do streaming database query processing. We had a company called Truviso, which really pioneered the concept of what we were calling a stream relational database system, where within a single system you can work on streaming data and you can work on stored data.
The philosophy was that all data begins life as streaming data. It’s moving from where it’s created to wherever it is going to be processed. Data only really stops moving when you store it. So the idea was to build a system where you could unify the processing of data in motion and data at rest, and that’s really what Truviso did.
Truviso was ultimately acquired by Cisco and, as far as I know, is currently part of the technology stack of what they call connected analytics, which is at the heart of a bunch of their Internet of Things initiatives. It’s also being used for network analytics there.
“What I realized was that you could … have the systems people kind of working for the machine learning people to try to solve scalable machine learning. That was really the start of AMPLab.”
Uniting algorithms, machines and people at AMPLab
How did you get involved with AMPLab?
As part of starting and running Truviso, I took a two-year leave from Berkeley, which is something a lot of faculty do. While I was out doing that two-year leave, in the 2007–2008 timeframe, if you were talking to any customers at all you couldn’t help but notice that everybody was just starting to collect more and more data, more and more quickly. And they were trying to figure out how they were going to deal with all this information they were collecting, and how they were going to get value out of it.
When I came back to Berkeley after that leave, it was pretty clear that was an area I wanted to work in. Now, that’s what people call big data analytics; I’m not sure that the term was there quite at that point. But it just so happened [Databricks co-founder] Ion Stoica, one of my colleagues here, had also been out doing a company and his company was actually using a lot of data. They were collecting a lot of data because they were doing video distribution and they were using huge amounts of data to make intelligent decisions about how to route and how to transmit video. So Ion and I and some of the other faculty started talking, and it became clear that as a research topic, big data analytics was going to be a big deal.
We were fortunate that there was a project going on here at Berkeley at the time called the RADLab — short for reliable and distributed computing — which was a cloud computing project, really looking at how do you use machine learning to automatically provision, maintain and run large computer systems. So in a sense, machine learning people were brought together to work with computing systems people to help solve this problem, and in some sense the machine learning people were kind of working for the computer systems people.
What I realized was that you could take that relationship and flip it — keep the team together, keep the machine learning people and the systems people working together, but now in this case have the systems people kind of working for the machine learning people to try to solve scalable machine learning. That was really the start of AMPLab.
People really like to focus on the data analysis part of machine learning, but the fact that we have optimized systems to do that at scale is also important.
There has always been a huge disconnect. Machine learning was largely looking at fairly small data and how do you come up with more efficient learning algorithms that had provable properties about convergence, correctness and things like that. A lot of the scalable data-management work was focused on fairly simple analytics — counting things and aggregating things. So there was a need to bring these two things together to do advanced analytics at scale.
How has AMPLab been so successful in fostering such popular research?
Our unfair advantage at AMPLab was that at just about any other academic or even industrial lab, there will be a machine learning group — or even more than one — and they will be working on their machine learning problems. There will be systems groups and they will be working on systems problems. They, by and large, don’t talk to each other.
Because of this RADLab project, we had a group of world-leading machine learning people and a group of world-leading systems people that had been together long enough that they could sort of speak each other’s language. That was a huge benefit that we could leverage trying to do scalable analytics. That was a big part of AMPLab getting a fast start.
Another interesting thing about AMPLab is its interdisciplinary nature, at least across computer science. We have machine learning people, systems people, I’m a database guy. We have security people, networking people, human interface people. So it really covers a wide swath of computer science. It’s a real collaboration, where people sit together, they work together, the lab is in one physical space — so it’s optimized for interaction across these different areas.
Another thing we did, initially out of necessity, was at the very beginning we engaged industry quite closely. When we were starting the lab, our intention was to fund it completely with industry money. We weren’t expecting to get any government funding for it. IBM was actually very interested and made a small seed grant initially.
“We try to look beyond the day-to-day problems people in the industry are facing, but part of reason we have been able to have so much direct impact is because of this constant engagement we have with people in the industry.”
Google was really the first place we approached. We told them this vision we had of integrating algorithms, which is machine learning and statistical processing; machines in terms of cloud computing and cluster computing; and people in terms of crowdsourcing and human computation, and putting together this team to do this. Google got very excited about this and made a major multi-year commitment to supporting the lab, and then we were off and running.
Where we are now is we have 30 industrial sponsors and we meet with the sponsor companies throughout the year. But we have two main meetings per year where we show them everything we are doing and we get very detailed feedback from some of their best technical people, where they talk to our students, talk to our faculty and give their opinions on the directions we’re taking. They tell us about things that they are seeing in their worlds that they would like us to look into, so we get great insight where some of the problems are in industry.
By and large, they’re happy with us leading the intellectual directions of the lab. They are not interested in giving us very detailed guidance, but rather want to see where we take things, because as a university we have fewer constraints in many ways than they have.
Is this the future model for how you do a research lab? If f I look at Mesos, Spark and the stuff coming out of AMPLab, they’re commercially valuable perhaps because of the nexus between academia and industry.
I think so. Our work is very much informed by what’s going on in the industry and what the open problems are in industry. We try to look beyond the day-to-day problems people in the industry are facing, but part of reason we have been able to have so much direct impact is because of this constant engagement we have with people in the industry.
The other thing we were able to do is while this was all going on, the federal government started getting interested in big data. Ultimately, through a long process, AMPLab was chosen as what’s called an Expeditions in Computing computing project from the National Science Foundation. That, along with some other funding we get from DARPA and from the Department of Energy, means our profile right now is about half government funding and half industry funding. It’s kind of the perfect combination for a lot of reasons.
“The decision to go with Scala was an incredibly risky one, and I think at the time if we had realized the stakes we were playing for, it’s an interesting question as to whether we would have really done it.”
Apache Spark as big data revolutionary
Is Spark the biggest thing to come of AMPLab so far?
In terms of what the lab has produced, it’s a pretty sizeable effort overall. There is probably about 80 people involved, including probably 40 to 45 graduate students and postdocs, so it’s a pretty big research effort.
And a lot of what we do ends up being put into the open source system we’re building called the Berkeley Data Analytics Stack, BDAS — it’s pronounced badass. In order to test ideas, we want to implement them, and in order to really implement them we put them in the context of this bigger system so they really solve problems.
If you look at the BDAS stack, there is a bunch of stuff in there. Mesos is part of it. Mesos really grew out of this earlier project called RADLab. Spark interestingly started in RADLab, but it was while all the conversations for creating AMPLab were going on, so we really consider Spark to be the first AMPLab project.
In fact, it is hard to compete with the impact that Spark has had in the short time since it was created. It’s really pretty much taken over the big data ecosystem in terms of parallel processing engines.
Spark is clearly to date the thing that’s had the most external impact, but there’s lots of stuff that has come up that’s been impactful. The Tachyon system for distributed storage is getting a huge following and a has company spun out around that. Then the machine learning stack that we’ve been building — MLLib and the pipelines work we’ve been doing — a lot of that is getting a lot of interest and increasing traction.
What’s happening in the lab now is because we’ve established this ability to build useful systems that really do solve problems, by and large when students work on software and release it, people pick it up and try it.
For example, we had a few graduate students as a class project look into how would you execute programs written in the statistical programming language R, which traditionally runs on a single machine. How would you run that over a Spark cluster? They worked on that problem, they solved a part of it (not the whole thing), they called it SparkR, they put it out on GitHub and people started using it. Now, SparkR has been picked up by a number of companies, it’s got an open source following and it’s a real part of the Spark ecosystem.
As the project continues, more and more of the things we’re doing are starting to have impact, and part of the appeal of the Spark ecosystem turns out to be just the breadth of things you can do. There’s a lot of systems out there that are targeted at real-time processing or targeted at MapReduce style processing or targeted at relational query processing. Spark does all of those things pretty well, and the ability to do all of that in a unified system, I think, is what’s really driving Spark adoption now.
“Hadoop MapReduce really whet people’s appetites and got people to understand that parallel processing wasn’t just for people in supercomputer centers and could be used by mere mortals.”
Speed and utility aside, how much do you attribute Spark’s success to the fact that it’s so developer-friendly?
It turns out to be incredibly programmer-friendly and developer-friendly. The decision to go with Scala was an incredibly risky one, and I think at the time if we had realized the stakes we were playing for, it’s an interesting question as to whether we would have really done it. As a research project, it’s like the people building the system really were excited about Scala. They saw the advantages of using it and that’s what they wanted to. And I think the faculty by and large said, “Hey, if you’re writing it and that’s what you want to do, that sounds fine.”
What’s been amazing is that far from holding back adoption of Spark, because of its ease of use and the conciseness of the programs and the flexibility and accessibility, Spark is actually helping it bring Scala into the mainstream, which is a really interesting outcome.
There’s a bunch of decisions that were made, either explicitly or that just happened as we were going along, that had we realized we were building the next generation of big data software infrastructure for the world, I’m not sure we would have made. Scala was certainly one. The idea of putting all this different modalities into the same system — the fact that you can treat the same data as a table, as a graph, as a data frame — is another.
As a research question, it’s fascinating. Can you build a system that encompasses all those different views of data and still gives you good performance? That is a fantastic research question.
It’s not clear if you were starting a project whose goal is to achieve wide adoption that that’s the way you would go. In fact, that is one of the big reasons why Spark is catching on is because you can use it in all these different ways and it works really well in all those different ways.
It was like the perfect thing to come along after Hadoop whetted everyone’s appetites.
We owe a huge debt to Hadoop. For one thing, we still use the Hadoop file system extensively. But also, Hadoop MapReduce really whet people’s appetites and got people to understand that parallel processing wasn’t just for people in supercomputer centers and could be used by mere mortals. That was a great start, but it was just lacking in so many ways and Spark got to ride those coattails.
So much of what is coming out of AMPLab, including Spark and Tachyon, is based on in-memory processing. Commercially speaking, is there a point where the price of RAM needs to drop for adoption to skyrocket?
I think this is a misunderstanding about Spark in a lot of ways. Spark is optimized for being smart about how it uses memory, and certainly when everything fits in memory is when it’s at its best. But from the very early days of the project, we knew that there’d be cases where data that you cared about wasn’t going to fit in memory and the idea was that you would want the system to degrade gracefully.
The worst case would be if everything fits in memory, it’s running great, but the minute you grow the data just a little bit larger than memory that you fall off a performance cliff. We engineered from the beginning to make sure that that didn’t happen.
What’s really happening now is, I think, we’re going the other way. We feel that we’ve made a lot of progress in scale-out architectures, so if you get more data you buy more servers and you get more memory, and things just keeps scaling that way. Now that we’ve got the problem largely under control, we’re going back and looking at even single-node performance.
There’s a project that’s happening called Project Tungsten, which is a Spark community effort that is focused more on what’s going on in a single server. The Spark community — not as much in the lab but the Spark community — is really looking at, “OK. Now how do we go back and start attacking some of the single-node performance problems, which in many cases are actually easier than the scale-out problems.”
“This is always a pendulum where you swing from highly distributed to more centralized and back in. My guess is there’s going to be another swing of the pendulum, where we really need to start thinking about how do you distribute processing throughout a wide area network.”
The future of data processing is at the network edge
What’s the next big thing I guess that that’s going to come up in big data, whether it’s out of AMPLab or somewhere else?
Basically, what we’re trying to do is a couple of things. The first thing we’re trying to do is make machine learning easy to use. Just in the way that you don’t really have to understand the algorithms of a database engine to write an application that has a database behind it, we want to do that for machine learning. We want to make it possible for mere mortals to do advanced analytics and do predictions without having to be experts in machine learning or having to know how to write the algorithms for machine learning.
We’re following the database query-processing playbook in terms of doing that. So in the database system, you have a library of operators that work on the data; you have a language for creating query plans that then takes those operators and combine them to actually produce an answer; and then you have a language on top of that for a programmer to write, where they specify the answer that they’re interested in getting, but not necessarily how to get that answer. There’s a compilation process that goes through all those levels. We’re trying to do that for machine learning.
At the operator level, we have MLLib. At the pipeline level, we have Keystone ML, which is another pipeline system that’s already in Spark. On top of that, we’ll have a declarative interface for people to be able to say, “Hey, here’s what I want to predict,” and then have that be compiled and optimized and executed by the system. I think that’s going to be a huge game changer if we can do that.
The other thing that we’re trying to do is to make good on a promise that the AMPLab made when we started, which is the people part. We’re working on how do you integrate human-in-the-loop processing along with the other things you need to do for big data analytics.
“There are just environments and applications where you can’t tolerate the latency. You can’t tolerate the risk of data loss or even a temporary communications hiccup, and you need to react in real-time to the environment.”
I think a bunch of the work we’re doing there could have some real impact, because the reality is today, there’s still bunch of things where you really want people looking at the data to make sure that everything’s okay. How do you bring that kind of processing into the system?
In terms of where things go from there, as I said earlier, when I first got to Berkeley, people were really excited about these widely distributed systems where you had all sorts of smarts out in the network — you had sensors, you had processing going on in the edge of network.
This is always a pendulum where you swing from highly distributed to more centralized and back in. My guess is there’s going to be another swing of the pendulum, where we really need to start thinking about how do you distribute processing throughout a wide area network, with highly variable and unreliable components out in the real world that are interacting with the physical world and with people in a real-time way. I think there’s a huge set of problems that people have to solve for that. That’s what I’m going to be thinking about.
So, essentially, we’re looking past the current cloud model of dumb devices sending data back to a central database for processing?
Yeah, understanding where processing has to happen. Clearly, if you can do it back in the central location, that’s going to be the easiest in many ways, but there are just environments and applications where you can’t tolerate the latency. You can’t tolerate the risk of data loss or even a temporary communications hiccup, and you need to react in real-time to the environment. I think that kind of distributed intelligence is a hard problem and an important one.
How much of this is a data problem, and how much is a silicon problem?
I think it’s a data problem. I think the platforms are really pretty powerful already. There are definitely challenges around power and communications and things like but, by and large, I think it’s getting to the point where it’s going to be a data problem.