Michael Franklin teaching big data. Source: Hong Kong University of Science and Technology

Database expert on why NoSQL mattered — and SQL still matters

The first part of this interview with University of California Berkeley professor and AMPLab co-director Michael Franklin focused on the creation of AMPLab and some of the important big data projects that have emerged from it.

Part 2 focuses on database technology, where Franklin spent much of his career. Aside from studying it and teaching it, Franklin also started a database company called Truviso, which Cisco acquired in 2012.


How far along are we down the road of distributed databases and data management systems?

Usually, when you say “database person,” you precede it by “grumpy, old.” We’re famous for looking at a lot of systems work and saying, “Yeah, we’ve been doing that for 20 years, 30 years.” Even, frankly, if you look at MapReduce — inside of any parallel database system like Teradata, like IBM’s parallel edition, like Oracle’s RAC, inside there is a MapReduce engine. Those techniques have been known for many years. So grumpy, old database people really did figure out a lot of things in the past.

That being said, I do think as a database person that things have really fundamentally changed. Probably, things have changed the most since the adoption of the relational model back in ‘80s. The big data ecosystem is really fundamentally different in many ways than traditional data management. In particular, people like to talk about scalability, because the big in big data means you have lots of data.

But again, scale-out techniques have been known for quite some time. They’re a little different now because of some different systems assumptions and so on, and maybe whereas before somebody might think that a thousand-node system was a big system, now it’s easy to talk about 10,000 nodes.

“Compared to a lot of grumpy, old database people … I believe that things have fundamentally changed and they’re not going to change back. I think we’re really at the beginning of a new era.”

To me, what’s really fundamentally different about this new-generation data management isn’t really isn’t just scalability, but it’s really flexibility. If you look at the ability to store data first and then impose structure on it later — sometimes this is called schema on read or schema on need — that’s a complete game changer.

Because the way things used to work if you wanted to do a data management project is you’d say, “OK, step number 1: Figure out every piece of data that you might want to ever store in your system, what it looks like, how it’s organized, and then how it’s related to all the other pieces of data that you might ever want to store in your database system. Then step 2 is get your hands on some real data,. And then step 3 was try to make the real data conform to this model that you would create in step 1.”

Many projects never made it that far, and back when people were first starting to do things like data warehousing, the literature was just full of horror stories where people would throw millions and billions of dollars into these systems and they never get them to work.

In this new regime, where you store the data first and then figure out what to do with it, it’s completely changed it. Now you can collect all the technical data you can think about collecting. Yes, you have to do some extra work when you go to use it; and, yes, you might take a little bit of a performance hit because you don’t have the storage completely optimized; and, yes, there may be some consistency problems that you need to understand. But by and large, the friction of getting your data-management system put together now has just decreased dramatically.

“The real breakthrough [of the relational model] was the separation of the logical view you have of the data and how you want to work with it, from the physical reality of how the data is actually stored.”

If you look at elastic computing, through cloud computing, and some of the mechanisms that are in Hadoop MapReduce and then things like Spark, just the ability to add more resources and have the system gracefully absorb those resources is something that didn’t exist before. And it’s not just the ability to grow your system, but it’s the ability to expand your system as you need it and then shrink it back down when you don’t need it anymore.

Again, this completely reduces the friction. It used to be that you would have to build your datacenter or your system for the biggest problem you would ever imagine that you’d have to solve, and now you don’t have to do that anymore. Now you can build your system for what you think you’re going to need, and then you can surge with cloud resources when you need to do that, or you can just do the whole thing in the cloud in the first place.

That has changed things pretty fundamentally.

Then this ability to move smoothly between languages like SQL for querying, languages like R for doing statistical processing, graph processing — the things that you can do easily in Spark. That’s completely different, so you no longer have to commit to a single paradigm for working with your data. You can store the data in the system and then you can do the things that make sense with your graph system using that, the things that make sense with relational query processing using that, the things that make sense for statistical processing using that. And you can mix and match them.

So compared to a lot of grumpy, old database people you might talk to, I believe that things have fundamentally changed and they’re not going to change back. I think we’re really at the beginning of a new era. Sure, just like the beginning of the relational revolution, there’s a lot of work to do to make systems more robust, there’s a lot of work you have to do to make systems more performant, there’s a lot of work you have to do to make systems easier to use. But we’re just at the start of that journey.

“Even as Hadoop was getting more popular … many of my colleagues and I were just waiting until people realized that writing MapReduce programs directly is a real pain and that there were languages, in particular SQL, that had been designed to solve many of these problems.”

You mentioned SQL. Did you think, as you watched Hadoop and Spark get popular, that SQL would be the focus of so much attention on those systems?

I think I can say without fibbing too much that, yes, even as Hadoop was getting more popular and people were getting more excited about it, many of my colleagues and I were just waiting until people realized that writing MapReduce programs directly is a real pain and that there were languages, in particular SQL, that had been designed to solve many of these problems. I was pretty sure SQL was going to play a big role in these systems.

I guess maybe you could see it coming as far back as Hive.

You don’t even have to go to Hive. This is exactly why database systems caught on in many ways. It’s because it’s just too hard to write that stuff directly. Furthermore, you don’t want to, because the thing that a lot of people don’t realize about the relational model and systems like SQL is that the real breakthrough there wasn’t the language. The language is just sort of an artifact.

The real breakthrough was the separation of the logical view you have of the data and how you want to work with it, from the physical reality of how the data is actually stored. And built into the relational model is that the vision, it’s called data independence. What that lets you do is it lets you change the layout of your data and the organization of your data and the systems that you’re using and the machines that you’re using without having to rewrite your applications every time you change something.

Likewise, it lets you write the application in a way that you’re not really too concerned about how the data is organized at any particular minute. That flexibility is absolutely vital for data-oriented systems because once you collect data, you tend to keep it. Applications that you write tend not to go away. You need that ability to evolve the physical layout of the data, and you need that ability to protect developers — even though they may not want to be protected — from those sorts of changes.

Anybody that worked with database systems for any amount could see this happening because Hadoop was basically breaking all those rules, and that was a lesson that had been learned decades earlier.

Sample SQL code. Source: Wikipedia Commons

I’m leaning toward saying the NoSQL movement was overblown, but you’ve made me think. Maybe the core capabilities, such as schema on need, were more important than the name itself?

I think what the NoSQL movement showed was that there’s a valuable set of applications that didn’t require the guarantees that traditional database systems were trying so hard to preserve: consistency, concurrency control, recovery, these sorts of things. You could never lose a piece of data. You can never have a piece of data in your database that didn’t conform to the schema. All these things that were put there to protect the database, really, from the programmer.

There were a lot of applications where those things really weren’t needed and the cost that you paid for those guarantees in terms of performance, in terms of scalability and in terms of ease of use were just too high. That’s really what the NoSQL movement showed, that there are valuable — not just interesting but also financially valuable — applications for which you don’t need those guarantees. By getting rid of those guarantees, you can build a much more flexible and easier-to-use system.

Frankly, had it not been for the NoSQL movement, I don’t believe that traditional database standards would have headed that way.

What does the future look like for traditional database vendors? There’s still so much business in those software licenses.

I think there are bunch of applications for which the traditional systems have shown that they’re a very good solution. Certainly, anything that requires it to be the system of record, anything where you can’t afford to lose any data, where you can’t afford to make a decision based on information that is corrupted in some ways. Those problems aren’t going to go away. I think the traditional vendors have a role to play there.

In terms of this other world, analytics and all that, it’s going to be a little tough for them. Partially this is because this world has gone open source and so their traditional business model won’t work and they’re going to have to rethink what that business model’s going to be. Then, part of it is just the mindset shift that you have to go through, and it will be interesting to see if those traditional, legacy companies are able to bring in enough talent with a new mindset to be able to compete in these new areas.

“I feel that there were certain important areas, like big data analytics and scientific computing, where open source is absolutely going to be the way things go because you can just get so many people working on a problem. But I’m not sure it’s for everything.”

Databases have traditionally been proprietary technology, but is open source the only way to build a technology going forward? And if so, how do you build a viable business around that?

In terms of “Is this the future for all or most data-oriented software?”, I think it remains to be seen, honestly.

One of the great things about the open source world is how quickly things change and how quickly you can evolve and bring in new functionality. That also leads to some challenges for people that need some of amount of stability. Certainly, big data analytics turned out to be a sweet spot for open source development and open source systems, partially because the people who were using the systems were technically savvy and confident enough to deal with some of these issues of having to make sure you have the right versions of everything, and of having to change parts of your system when one of the underlying pieces changed.

I don’t know if that’s going to carry over to other areas of data management. I feel that there were certain important areas, like big data analytics and scientific computing, where open source is absolutely going to be the way things go because you can just get so many people working on a problem. But I’m not sure it’s for everything.

Can you think of many databases that caught on in the past decade or so that hasn’t been open source? I can think of maybe one.

Momentum is certainly going toward open source, but the answer depends on the business model question. That could have a huge impact on whether the open source model ends up taking over because, ultimately, people are going to have to get paid or standing behind these systems and for making sure that they are robust and that they are secure and that they are doing what they’re supposed to do.

I think there’s lots of interesting business models out there that people are doing for open source, and I think there’ll be a lot of innovation in business models as well as in software.

“As a researcher and as an academic, I’m just thrilled that people in the real world are willing to try out and use and adopt and deploy open source software because that gives us direct influence.”

The biggest challenge might be converting users of good free products into paying customers.

I think that’s going to be a big challenge for these companies. … Although I would like to say as an academic, I love the fact that open source is taking over the world.

I’ll go back into grumpy-old-database-guy mode: The way I used to have to do my work, and all the people in my field used to have to do the work, is we would come up with a new algorithm, a new join method, a new index, a new whatever, and we would do some prototyping to show it was a good idea. Then, we would make the rounds and we would go to Oracle and IBM and Microsoft and these companies, and we’d tell people about this great thing that we had come up with. Those people would either ignore it or put it into their product, and sometimes you didn’t even know.

But we were always one big step removed from actual users. Open source has just completely removed that barrier. Now, a student in our lab has good idea, they code it up, it looks like it works, they put a little more work into it to make it so that other people can understand what it’s doing and use it, and they put it out on GitHub and all of a sudden, it’s out in the real world.

We actually had a case recently where Evan Sparks, one of the students in the lab, gave a talk at our research retreat where we meet with the sponsors. During the retreat, a lot of students were getting up and saying, “Here’s the piece of the stack that I’m working on. Here’s what it does.” At the end, they would say, “Oh, there will be an alpha release of this at the end of the month, or we did an alpha release of this last week, or whatever.”

Evan gets up and he talks about Keystone ML, our machine learning pipeline system. He says, “And there’ll be an alpha release of the system right now.” He brings up, in front of 200 people, his GitHub page, and flips the switch from private repository to public. This is all about removing friction from having a really good idea and showing that it works and building an artifact that shows that it works, to actually having people get to use it and try it out and maybe adopt it. That friction is just gone.

As a researcher and as an academic, I’m just thrilled that people in the real world are willing to try out and use and adopt and deploy open source software because that gives us direct influence. And you’re seeing that with the impact that AMPLab is having.