From scaling LinkedIn to selling a nervous system for enterprise data

Derrick Harris
> S C A L E
Published in
14 min readMay 12, 2015

--

In the first of a series of in-depth interviews we’re planning at SCALE, I spoke with big data expert Jay Kreps. Kreps is most often associated with a real-time data pipeline technology called Kafka, which he created at LinkedIn and which is now the foundation of his new startup, Confluent.

Below is an edited transcript of our discussion, which covers everything from Kreps’ tenure at LinkedIn — where he helped build some of the company’s core data infrastructure — to the reasons why open source software has become such a big, but limited business.

Taking LinkedIn from an Oracle shop to a big data leader

SCALE: You’re best known for co-creating Kafka at LinkedIn, but what was your history there before Kafka?

Jay Kreps: I joined LinkedIn with the intention of being more of a user of data infrastructure, working on recommendation algorithms and that kind of thing. That’s what I had done previously; I had more of a machine learning background.

I did actually work on a bunch of things like that at LinkedIn, but my observation was that a lot of our problems were less about creating fancy algorithms and more just basic data and scaling issues. That was what held us back on almost everything we did.

The first project I did in there in the infrastructure space was a key-value store called Voldemort, one of the very early NoSQL systems. It was a distributed key-value store and a clone of Amazon Dynamo, basically. I had read that paper and we did an implementation of that and put it into production. It’s actually still used at LinkedIn at very large scale — I think probably getting north of a million requests per second.

We open sourced it, and that was really fun, and after that I thought, “Hey, this actually is a much bigger impact to work on these data systems that, in some sense, make every problem better.” So I was kind of hooked on it.

“Over time the systems that we had that did that kind of data stream transport were just melting under scale and growth, so it became clear we had to do something.”

I really liked the open source aspect of it, as well. That was the the first time I had had any involvement in open source, and when we did Voldemort there was a bunch of different people contributing code, I got to work with these really smart people in different parts of the world, and that was really fun.

The next thing I did was the Hadoop deployment at LinkedIn. The goal was this data lake idea where you get a copy of everything happening in the organization into Hadoop. We thought that would be a really easy thing to do, so we budgeted 3 weeks to get data in, and then a couple months to rebuild the People You May Know feature.

And we said, “Well, the hard thing will be this cool recommendation algorithm, but the the first thing we’ll have to do is just get the data.” It turned out to be a bit of a reverse. We did that, and we did a couple other projects, and we were just struggling to build a pipeline of data for Hadoop.

Kreps (left) and some former LinkedIn colleagues in 2013. Credit: Derrick Harris

Because of that, I got interested in how data was flowing in the rest of the organization, and my observation was we had this problem everywhere. We wanted to make use of the same data in real time for security and fraud detection, and for more-real-time recommendations so we don’t show you the same stuff every time. And the same stuff we wanted to get into Hadoop, we also wanted to get into the data warehouse and we wanted to get into search indexes.

We had all these different systems, some we wanted to respond and act in real time while some were offline data dumps like Hadoop, and that was how we came up with the idea of Kafka. We pitched it internally and people were like, “Well, that sounds like a lot of work.” But over time the systems that we had that did that kind of data stream transport were just melting under scale and growth, so it became clear we had to do something.

“You always have the advantage of youthful arrogance. With Kafka, we said it would take about 3 months, and then we worked on it for the next 5 years.”

When did you join LinkedIn? I recall you telling me previously that you joined when People You May Know was still an Oracle application.

That’s right. I joined in 2007, and we got to do a bunch of really fun work to scale the whole site in different parts, which was really challenging. It was fun, too. A lot of the things that came out of really getting all the data together was the ability to do much more sophisticated stuff with it, to do use more predictive ingredients on the stuff.

Of course, the rest of the company took that much further after I was not involved in that area much anymore. They added a whole team of data scientists doing this stuff that ended up being much better than I ever was.

How difficult was it in those early days to build new tech from scratch and deploy systems like Hadoop at scale?

You always have the advantage of youthful arrogance. With Kafka, we said it would take about 3 months, and then we worked on it for the next 5 years. We did ship it at least after 4 or 5 months, so we got something done in that time period.

Kafka’s a fun one because it’s one of the core problems in distributed systems, which is producing an ordered log or stream of data that’s fault-tolerant and replicated over machines. This is something that exists in the world, it’s a common algorithm or thing that you would learn about if you took a distributed systems class, but there was this opportunity to do it at a really massive scale. Now, I think, LinkedIn has about 800 billion requests going through these logs.

Some things were definitely kind of risky projects, but they paid off in the end.

A diagram of where Kafka fits into the data architecture. Source: Confluent

Freeing Kafka from its glass ceiling in Silicon Valley

So what does the user base of Kafka look like today? It’s much more than LinkedIn at this point.

We released it really to very little fanfare, but we talked to some other Silicon Valley companies, and we quickly learned that seemingly every company there had some bad solution to this problem. It was interesting to see that so many companies had built basically similar things all around the same time — something happened in the world and everybody basically developed the same idea of how they’re going to handle this log or event data.

We ended up with a pretty good user base in Silicon Valley, but that was it — nobody else in the rest of the world had ever heard of this thing. Often it would just kind of spring up, usually for a couple applications, and then spread from there. Over time, that just leaked out of the valley and now you see companies in every industry — retail, finance and everything — doing these really big, aggressive projects, which is really awesome to see.

That was really what inspired us to strike out on our own. We were like, “Hey, this actually makes sense in every company, in so many industries, to have this central nervous system for data where everything can tap into it.” It solves that basic problem of “How do you get this stuff, where is it available quickly and when you need it?”

“When people say that a lot of these big data problems are really a tech company thing, I think it’s actually totally untrue.”

That’s a good point. Last year, I spoke with an mid-sized sensor company out of Alabama that decided it wanted to get into the Internet of Things and real-time connected devices, so it taught itself Kafka, Amazon Web Services and other components and built a backend.

For a long time, web companies have had this discipline of measuring the world. They’re instrumenting their product — which is all software so it’s relatively easy to instrument it — then measuring what’s happening to a fine grain and trying to improve it. But it’s not that practical, or hasn’t been that practical, outside these pure software artifacts where it’s really easy to put measurements everywhere.

What’s really exciting to me is to see that happen in the rest of the world. There’s definitely some hype around the “industrial internet” or the “Internet of Things,” but getting that level of instrumentation on things that other really large-scale businesses are doing is awesome. When you think about a big retail company or FedEx or an airline, these companies actually have massive scale and a ton of things happening.

So when people say that a lot of these big data problems are really a tech company thing, I think it’s actually totally untrue. A lot of these other companies are actually at much larger scale than some Silicon Valley companies in terms of their business, they just don’t yet record everything that’s going on.

Kreps presenting in 2009. Credit: Flickr / Russ Garrett

The future of web innovation is up the stack

Tech companies are really big into building their own technologies, but certain things — like Kafka — seem to have become de facto standards. Did you notice a point at where it became acceptable to use stuff created elsewhere?

That’s exactly what happened. I think about it in terms of barriers to entry. This was true with key-value stores, for example. Everybody made a key-value store, including us at LinkedIn, but you don’t really imagine an end-state of the world where there’s like 50 of these.

The reason everybody does it is because when you have some new big change in the world, the barrier to entry is actually very low. You have this new feature — “We want to be able to scale horizontally” — and that motivates something new to come along. And of course there’s nothing there that competes, so the barrier to entry is maybe 3 months of work.

But once you’re 3 or 4 years in, it’s a pretty big investment. If you want to make something that’s as good as Kafka, it’s actually really hard. You can do it, but it’s hard.

The same is true for these data stores. Cassandra has kept developing, and if you want to make something as good and full-featured as Cassandra that solves the full set of problems in that space, it’s a pretty big effort. That’s why you see people say, “Yeah, it doesn’t make sense to try and rebuild this.”

That’s totally happening now. Netflix recently gave a talk on how they’re taking out most of their homegrown data pipeline and replacing it with Kafka, and their reasoning was basically, “Hey, there’s a bunch of stuff that’s in that system that we don’t have the time to do, and it seems to be moving forward at a faster pace.”

A slide illustrating Netflix’s future Kafka use. Source: Slideshare / wangxia5

Where will the future of innovation in companies like LinkedIn or other large web companies be focused, considering they’ve already built all this infrastructure and solved those problems?

I think the most exciting thing is actually that in software you always move up the stack until all of a sudden there’s some big tectonic shift, and then the foundation has to be rebuilt. Right now we’re rebuilding the foundations. I don’t think that’s going to go on forever, but I think there’s still a lot of work to do there.

Some interesting stuff I saw happening at LinkedIn that I’ve seen at other companies is really awesome machine learning libraries that work on top of distributed systems. LinkedIn developed a fantastic online machine learning library that adapted to data as it came in.

I’ve seen really interesting stuff done in the monitoring space. I think, actually, that’s another area where each company we talk to has some in-house setup. There are a bunch of companies tackling that commercially, but I don’t know if anybody has quite got it yet.

“The most exciting thing is actually that in software you always move up the stack until all of a sudden there’s some big tectonic shift, and then the foundation has to be rebuilt.”

Another example of that would be anomaly detection. One of the interesting things that becomes possible when you have, say, a thousand data streams about your business that you capture, is to do some kind of automatic analysis on this that alerts you when something is going wrong.

That’s the kind of next level up the stack that I think is still infrastructure-y, but not the core foundational bits. That would be my prediction. I think that’s usually how these kinds of companies work — you see a lot of really smart people in some of these companies and they’re willing to throw resources at whatever their current point of pain is.

My expectation is that a lot of the infrastructure, as it begins to work, will become less of a pain. It will become much harder for a tech company to differentiate on a key-value store, just because there’s so many of them that even if you make the best one in the world, it’s only going to be better by 5 percent, which isn’t meaningful to the business overall.

Kreps (center) with co-founders Jun Rao (left) and Neha Narkhede (right), a tad smaller. Source: Confluent

Launching Confluent in a sea of open source startups

You mentioned striking out on your own, which was a reference to your startup, Confluent. What does it mean to be a commercial Kafka company?

We have this vision of a central platform for stream data. However, while we built the core engine as an open source project, actually putting it in place and achieving full connectivity within an organization was a lot of work. That’s not really the type of problem you can attack purely in open source.

People were building homegrown monitoring systems for Kafka, they were building all these custom connectors. You could see where in an organization, over time, they’d have a team of 10 full-time people whose job was to do this. And each company would repeat that over and over again.

This makes sense for some early tech adopters, but if you really want to make this change in the world, you probably need to give people something that works and provides the complete picture out of the box. So we’ve started to put together all the best practices we saw from early adopters in Silicon Valley and elsewhere.

What we’re offering is basically this package that does all that stuff. We’re offering training around it, and support — similar to what you’d see at open source companies like Elasticsearch or any of the NoSQL companies — really to help companies get started in this area. We hope to add to the platform and make Kafka a really powerful core engine in every company.

“The thing that has made open source work is just that the survival characteristics of open source systems are so much better than proprietary software.”

Open source has become a big deal for a variety of reasons, but it sounds like there’s still something to be said about being able to buy a working software product off the shelf.

The way I see it is this: In most areas of business, open source doesn’t work. So you don’t see a lot of applications that are purely open source.

I think the area where it works is pure platforms, and the reason it works is because that platform is so tied in across everything you’re doing — and for a large company that’s such a large commitment — that you need some kind of open standard. You need something that is going to make that turn into a good long-term bet.

I think the thing that has made open source work is just that the survival characteristics of open source systems are so much better than proprietary software. They outlive companies that sponsor them, they outlive companies that adopt them. They just keep going on and on.

So for these big platform decisions … it would be impossible for us to pitch this crazy idea of a stream data platform to companies as a proprietary thing that they would build everything on. Nobody would have any interest in doing that. … It’s just way too risky.

This is actually the big thing with open source. A lot of the original open source companies were really purely commoditization. It’s like you have Oracle. Oracle is really good, but it’s expensive. Now you have MySQL. MySQL is not as good, but it’s cheap.

I think what we’re doing … is we’re actually creating something totally new in the world — something that’s an original thought. And if you want to get an original thought off the ground, you want to get adoption and you want to get usage for some kind of new technology, the survival and adoption characteristics of open source are just hugely powerful and you really want that on your side.

A recent Gartner Magic Quadrant for database systems illustrates Kreps’ point about proven platforms, as well as a point he makes below about the competition in the data-management space. Source: Gartner

Realistically, do you think these types of distributed systems that people like us take for granted will be commonplace building blocks for any company wanting to do anything with data?

That’s my strong belief. We were willing to quit our jobs and give up our awesome LinkedIn stock options because (A) this makes sense in every company and (B) it’s actually a pretty valuable thing. The big transition that I see happening is making use of streaming data inside of companies be something that actually happens for real.

It’s an interesting area because, intuitively, most of what happens inside a company is some new information comes in and the company reacts to that asynchronously. It very much fits in the problem space of stream data and stream processing, and yet very few things in companies are actually built that way. In some sense, it’s the area that is the biggest but the most underserved.

“My suspicion is we’ll have probably another 5 years of relative chaos before there’s any kind of stasis or major consolidation.”

Is there going to be one company to rule them all in this world, like Oracle is today with its breadth of products? Or will the future be a bunch of specialists like we see today?

It’s hard to say exactly what the technical landscape will look like and it’s hard to say what the company landscape will look like. There are a bunch of things changing. There’s actually pretty rapid adoption of infrastructure in the cloud, people are moving to Amazon Web Services. It’s not like that’s going to happen overnight, but that’s a big change.

There’s also a lot of innovation. I think it’s a little too early to call it in any of the areas where you see startups doing development. You see something and you see a clear market leader, and you’re like, “Oh, the Hadoop companies have it in the bag.”

You don’t know yet. This is an area where there’s a lot of change happening and no company, regardless of how much of an advantage you have today, can relax on the technology, because there’s a fair amount of development happening in each of these spaces.

In the NoSQL space, you see some bigger players emerging, but actually if you look at the state of distributed databases, they’re really not done yet. They’re missing half the features you would expect out of Oracle; they’re just distributed and easy to scale. There’s still a huge opportunity for somebody to come in there and add in the rest of it and take over a much bigger area.

My suspicion is we’ll have probably another 5 years of relative chaos before there’s any kind of stasis or major consolidation. But what do I know? We’ll see how it turns out.

--

--

Derrick Harris
> S C A L E

Hi :) Find me on Twitter to see what I’m up to now.