Hadoop creator Doug Cutting on evolving and succeeding in open source

Derrick Harris
> S C A L E
Published in
13 min readDec 2, 2015

--

Doug Cutting helped create Apache Hadoop nearly a decade ago while working at Yahoo, where he helped the project become an open source juggernaut and the figurehead of the big data movement. In 2009, Cutting joined Cloudera as chief architect and continues to evangelize Hadoop, help lead the Apache project, and make the technology consumable by mainstream enterprises.

Before Hadoop, Cutting worked primarily in web search, developing the open source Lucene search engine, as well as a Hadoop precursor called Nutch.

In this interview, which has been edited for clarity, Cutting shares his opinion on the continuing evolution of Hadoop, the rise of Apache Spark, and the challenges facing open source projects and the companies that want to monetize them.

SCALE: When people say “Hadoop” today, what does that mean? I think it means a lot more than it probably did a few years ago.

DOUG CUTTING: Definitely. It started out meaning this one project, but pretty rapidly, as other projects started up that depended on Hadoop, people started referring to the whole collection as Hadoop.

It has come to mean this movement of folks onto a platform based around open source. In the long term it might evolve completely away from Hadoop as a technology, but we’ll still refer to the ecosystem as “Hadoop.” I think it’s more the data-processing style that Hadoop established that people now refer to.

You work at Cloudera, which now markets the idea of a “data hub” or platform rather than a distribution of Hadoop. Does this have something to do with this evolution of the technology?

Calling something a distribution of Hadoop was always a problematic phrase. With all these different interests at play, Apache has tried to, and ought to, maintain its trademark.

And Cloudera needs to develop its own product identity. So at some point we actually switched to CDH and removed all association of the word Hadoop with CDH. Most people think of it as meaning Cloudera’s Distribution of Hadoop, but it doesn’t say that anywhere.

Cloudera would like to move people toward thinking about our Enterprise Data Hub as the technology we’re selling. But it’s hard to say whether that’s what the world will end up doing. Language is a complicated social phenomenon. The legal and business interests are only one factor at play.

The Hadoop ecosystem, as presented in Cloudera’s software. Credit: Cloudera

Regardless, it’s handy to have a name for whatever this thing is and whatever it becomes. I think there is something real here — worthy of a name, of an identity — in that folks really are switching to using a suite of open source tools. And that suite of tools is changing over time in response to user needs. It’s not only technically different in that we’ve got distributed computation running on commodity hardware, but also politically different in that the center of control is no longer just a couple of vendors.

The center of control is whatever users find most useful. We see things like Kafka and Spark come to the fore, really outside of any central control because users have found them fabulously useful. Now they’re core elements of this stack. I think we’re going to see more of that.

“I think the name will be around for a long time, and that we’ll keep calling the ecosystem ‘Hadoop’ after Hadoop is no longer the core component of it.”

Is there a point where an open source community — Hadoop or otherwise — evolves so much that it outgrows the initial name, or loses any discernible connection to the original technology?

If there’s a thing there, then it deserves a name. Even if the thing changes, we can tolerate the notion of the name surviving as long as it’s the same thing that has evolved. I think the name will be around for a long time, and that we’ll keep calling the ecosystem “Hadoop” after Hadoop is no longer the core component of it.

That seems likely to happen. HDFS and YARN are still central, but it doesn’t seem unlikely that someone will come up with a better storage system or scheduler that are compatible enough that they can replace those two things.

“[M]ore often we’re better off to follow folks like UC-Berkeley, who developed Spark, and LinkedIn, who developed Kafka.”

You mentioned Spark and Kafka as two projects that have risen to the fore in the big data space. What else do you look at as technologies that might step in and replace things in the Hadoop stack?

Cloudera just launched Kudu, which is another storage layer — one of the first substantial innovations in the storage area in a long time for Hadoop. It’s really optimized for the class of applications folks want to build with this ecosystem of technologies. That’s a neat one to see take off.

If I knew what the next big thing was going to be, I’d be doing it. I’d rather think an appropriate role for Cloudera is to follow rather than lead, to a large degree. Now and then, they can make a bet like Kudu, but more often we’re better off to follow folks like UC-Berkeley, who developed Spark, and LinkedIn, who developed Kafka.

For each of those, there were 10 other projects that haven’t really captured the imagination of the community to the same degree. We didn’t have to invent all of those and go through that cost. There’s a lot of risk in that.

“In any one area, whoever gets there first with something that’s high-enough quality and is substantially different than other things can take over.”

How to stand out in open source

Do you have a sense of what makes one technology stand out against another in a crowded space?

I think it’s a combination of things. You have to be sufficiently different than other things, and people need to find advantage and value in using it. You need sufficiently high-quality technology that really is usable. Often times, there’s also a substantial institution behind a project to get it to that point.

In any one area, whoever gets there first with something that’s high-enough quality and is substantially different than other things can take over. Then, they’ve got to follow through with having people who are capable of building a productive community to keep it moving and keep it ahead of competitors.

Spark is a great example of something that attracted a lot of contributors and kept them, so it’s been growing. It hasn’t alienated people, and that has been a real key to its success.

The so-called Big Data Analytics Stack. Credit: University of California, Berkeley

Spark has some competitors, but they all seem to be a little behind with level of quality, level of completeness and level of contribution. Maybe they’ll catch up. Some of them claim to have better architectures, but I’m a little skeptical that they’ve got enough of a fundamental improvement that they can surpass all the other things.

We can see a similar story going back to Hadoop. There could have been other things out there, but we were early enough, and with help from Yahoo we got it to be good enough quality that it was out there and people figured they didn’t need to recreate it. They could just pile on and use it.

“It has been amazing to me the degree to which people really follow what’s going on in open source and experiment with the bleeding-edge stuff at pretty stodgy institutions.”

Has the pace of open source picked up markedly in the past several years? Spark is by no means a mainstream technology, generally speaking, and yet we’re already talking about “next-generation” competitors.

You’re right. It does move quickly, and there are people who are very speculative. But I think we’ve seen enough interest in Spark that it’s been anointed — not just by Cloudera, but by IBM and lots of folks.

People in industry are really, seriously investing in Spark. It has been amazing to me the degree to which people really follow what’s going on in open source and experiment with the bleeding-edge stuff at pretty stodgy institutions. Even in banks, the people in the tech departments are really on top of all this. They follow all the open source politics and they follow all the latest technologies, and they’re experimenting with them.

Something I learned many years ago in Lucene was that we had a mailing list and there was a community of people I knew who were involved in using Lucene. Yet, I’d go to a conference and I’d run into hundreds of people I’d never heard of who said they built their applications around Lucene. It surprised me a lot at first — it was at least a 10-to-1, and maybe even 100-to-1, ratio of the actual set that was using Lucene and people who I knew were using it.

“[T]here’s a big community out there of people who aren’t talking about [Spark], who’ve already settled on it as their next-generation technology.”

Think about your own software use: You download an app and use it, and you don’t contact the developers. You don’t add a review. You just use it because it works.

With open source, we don’t even have download stats, really. Cloudera has a little bit of a handle on that, but Apache doesn’t, really. And people, of course, can copy things around and get them through different channels. I think there’s a lot more use out there than we’re aware of.

So with Spark, it’s at a young phase and there are only so many public endorsements of people using it in production and so on. But there’s a big community out there of people who aren’t talking about it, who’ve already settled on it as their next-generation technology. I run into this everywhere.

You run into it around the world. I was recently in Japan and in Budapest two weeks before that. Everybody I spoke to was like, “Yeah, we’re really gearing up with Spark. We know that’s where things are going and we’re on with it.” Maybe that’s just because of the people I run into at the conferences I attend, but these conferences aren’t tiny anymore. It’s a pretty substantial swathe of IT folks.

A packed hall at the recent Spark Summit Europe. Credit: Databricks

Even if competitive projects can claim architectural advantages, isn’t the sheer momentum from large user communities — and large committed users — too much to overcome at some point?

It’s a lot harder to supplant something when you’ve got people who already know the technology. You have to be a substantial leap ahead of that in order to replace it. Things are going to continue to change.

I see a lot of companies out there that are founded around single technologies, claiming “We’re the X company.” I worry about those. I think there’s going to continue to be turnover, and the lifespan of a lot of these technologies is going to be short. Five years is going to be really the time, their golden age, before something might come along and replace them.

You hope a company’s going to last longer than five years. You hope that people are not all forming companies that are going be acquisition targets. I’m really pleased that Cloudera has found a way around becoming just “the Hadoop company,” but rather (and pardon the buzzwords) saying, “We’re the open-source big-data ecosystem platform company, or the Enterprise Data Hub company.” I think that’s a smarter play.

“I see a lot of companies out there that are founded around single technologies, claiming “We’re the X company.” I worry about those. I think … the lifespan of a lot of these technologies is going to be short.”

When I talk to customers, it becomes clear that users — not vendors — are the people who choose what they’ll use. We have this stack of tools that we support. We encourage people, “You only want to use supported tools.” We’re the vendor. We can tell you what we support and you ought to just pick from our menu. I’d say nearly all of our customers, it feels like, use one or two things we don’t support.

They all have a slightly different mix of things they pick. Maybe 75 or 80 percent of their stack overlaps with the one that we support, and they pick a few other things that they decided, “For us, this is really the best tool.” They’re not just picking the one thing and saying, “This is it for us.”

And if they’re picking a stack of six components, I don’t think they want to work with six different vendors. They’ll work with one, maybe, and they’ll take a risk on a couple other things and not have support for them. That seems to be the pattern I’m seeing. So for businesses, I think adopting a stack is a better story, and then you can evolve that stack.

“It’s harder to make a fortune in open source. The more you pick a niche within that, the harder the road you chart for yourself.”

WIth a very focused business model, it seems like you also have to give yourself a lot of runway. Even a few million dollars in funding doesn’t necessarily buy you a lot of time.

The size of the market and the margins are not what they were in the heyday of Oracle and Microsoft. One of the attractions of open source is they’re less expensive technologies and that you’re not necessarily dependent on the vendor. It’s harder to make a fortune in open source. The more you pick a niche within that, the harder the road you chart for yourself.

The business of big data

Do you concern yourself much with the business side of the open source and big data spaces — revenues, margins, IPOs, stock prices and the like?

As a part of this industry, it’s a pretty critical component. We’re certainly venture-fueled, as much of it is. I’m starting to work more with Mike Olson in Cloudera’s strategy office figuring out what our long-term directions are. Some of that’s technical and some of that’s business. Understanding how the market works is pretty important. Also, understanding how Cloudera’s going to be able to keep its lights on.

At the same time, I obviously care a lot about this open source ecosystem and maintaining its vitality, and I think about how Cloudera and the open source community can play together and reinforce one another.

“I think Google is still operating at a slightly larger scale than most folks, with a slightly more evolved stack, but less and less so. Whether the open source ecosystem will ever decisively pass Google, I don’t know.”

You used to mention Google as the company that people should look to if they’re trying to predict the next big open source projects, especially in the data space. Is that still the case, or has the torch been passed to to LinkedIn or Facebook or any other companies?

I think Google is still operating at a slightly larger scale than most folks, with a slightly more evolved stack, but less and less so. Whether the open source ecosystem will ever decisively pass Google, I don’t know.

On the other hand, the more interesting question might be how wise it is for big web companies to keep going it on their own and inventing new bits. To some degree, they’ve been forced to because there aren’t open source tools yet in different areas that really scale and really address their problems. But it’s becoming less and less the case that web companies need things different than retailers and other industries.

The trend I’m expecting to see is more that the big web companies will start sharing the same technologies as everybody else. This means that we’ll start seeing things popping up from lots of different places, not just the web companies. And instead of always independently managing their stacks, web companies will start to work with vendors more. I think they’ll start to look more like banks and whatnot in their use of technology as the stack matures.

Doug Cutting presenting in at an event in 2015. Credit: Flickr user techmsg

What is the the next “web” then, in terms of being that edge case industry that develops new tech? Is it the Internet of Things, or Uber and the on-demand economy?

I think this stuff is going mainstream. I think Marc Andreessen’s “software is eating the world” is really happening. We really see users across the industrial spectrum. You’ve got agriculture, railway, manufacturing — all these guys are deploying big systems.

IoT is a symptom of that. IoT is, itself, a cross-cutting industry. Is Tesla an IoT company? Or Union Pacific or Caterpillar? I don’t see any one industry being the biggest driver any longer. I think we’ll see it spread out and we’ll see the “IT-ification” of industries, in general. Probably there’s going to be some that advance more so than others, but it’s hard to predict which ones.

Telecom, for example, is a big area in which we see a lot of customers. They’ve seen huge growth in mobile, so they’ve got the things out there that are generating the data covered with sensors in these phones and they’re able to exploit them. They’ve got more online things in the world than just about anybody else at present.

“I also don’t want to get high on my own fumes here. … I live out in the country, where nobody knows anything about this stuff, which is really nice.”

I would argue that Hadoop helped move open source into the public eye, to a large degree. Do you take any personal pride in having helped pushed the world into this open source model?

I definitely am very pleased. On the other hand, I don’t want to take too much credit. In a lot of ways I was lucky to be a person who had the right experience to both see and act on the opportunity at the time. I don’t deserve credit for that too much.

I see myself as a being a symbol, in a lot of ways. I go around for Cloudera, and people want to have someone who embodies this movement a little, to whom they can ask questions about it. I try to be that person for them. It has been a tremendous success, by every measure, much more than I ever would have imagined or expected.

But I also don’t want to get high on my own fumes here. I try to keep it in perspective. I go to conferences and everybody’s like, “Whoa, you invented Hadoop!” However, I live out in the country, where nobody knows anything about this stuff, which is really nice.

--

--

Derrick Harris
> S C A L E

Hi :) Find me on Twitter to see what I’m up to now.