Open Adoption Software Interviews: Streamsets

StreamSets is trying to bring order to big data with data-collection and pipeline technologies invented with today’s systems in mind. In a field traditionally dominated by proprietary products, StreamSets’ Data Collector product was open source since the beginning, designed to integrate an ecosystem that includes open technologies such as Hadoop, Spark, Kafka and more.

In the latest of our Open Adoption Software (OAS) interviews, StreamSets co-founder and CEO Girish Pancha explains why the company opted for an open source approach, and how it planned from its inception to monetize a fully functional open core technology with higher-level products. Pancha also discusses the power of open source communities, which have helped StreamSets speed its roadmap significantly by contributing important integrations and giving the company insights into how data architectures are evolving.

JAKE FLOMENBERG: Tell us about StreamSets and where the company is in its OAS evolution — Project, Product or Profit.

GIRISH PANCHA: I would probably put us in a transition between product and profit. We started off in 2014, and when we started the company we were just the proverbial two guys and a PowerPoint. My co-founder and CTO Arvind and I had worked together at Informatica, where I had been the chief product officer. He’d also done a long stint also at Cloudera picking up the open source DNA and vibe.

When we connected, we came to the realization that a number of disruptions around big data, open source, etc., could provide us an opportunity to do something meaningful in the data middleware and data integration space. So we set out to build the industry’s first data operations platform. What StreamSets wants to do is disrupt the current model where data engineers custom-code data movement pipelines, set them in motion, hope and pray they work and that they continue to function over time.

This industry has been focused historically on the developer side of the data movement and integration problem. But in this new age of streaming data and real-time analytics, it’s critical to manage the quality of the data flows on a continuous basis. If you think about the initiatives that big data solves now — things like “know your customer” initiatives, customer experience personalization, cybersecurity, and IOT-driven projects — these are all real-time applications. They’re no longer your old-school strategic BI use cases. These apps basically can only work well and can be trusted if the data is continuously clean and complete.

We set out to build a 100 percent Apache-licensed open source product in the middle of 2014. Fifteen months later, we released a commercially viable version of that product. We had been through some beta cycles, but within 15 months we were ready to jumpstart or accelerate from project to product.

That particular product and open source project, StreamSets Data Collector, has done very well for us; we’ve had over a quarter million downloads. Given it’s a product that’s really meant for a very targeted data engineering audience, we’re very excited about that. But we’re even more excited that we’ve identified nearly 1,000 enterprises that have downloaded or are using the products, including nearly a quarter of the Fortune 500. Not bad for less than 2 years in market.

More recently we’ve also released important add-on proprietary products that multiply the value for our customers and promise even greater commercial success for us.

StreamSets has been open source since the outset, but did you ever consider building a proprietary company?

Actually, we never did. We had this vision of building an operations product — what we call dataflow performance management, or DPM. Think of it as the equivalent of what New Relic and AppDynamics have done for app operations. So we didn’t have any worries about not being able to be monetize effectively.

Getting ubiquity for our data-collection technology was all important, so we thought from the beginning, “How do we get the word out? How do we get quickly to this huge market opportunity?” So when we launched the company, we announced that was available as 100 percent Apache-licensed product. I think companies that start from scratch have a harder time coming to that conclusion. They start off with proprietary IP and then slowly get dragged into providing open source components, but we really embraced open source fully from the get-go.

One key reason is that we have the OAS DNA in the company. My co-founder is an ASF member, which is a big deal as it’s by nomination only. We have a number of other early employees that are all well familiar with ASF projects and technologies. And for me as CEO, technology-hardening was all important, because there are a lot of moving parts.

In my previous career, I spent quite of bit — probably a third of all my R&D budget — on testing and validation of my technology, and I really didn’t have insight into it. It would come to me as a set of reports about the quality of the product. Now, with the model we’ve adopted, I’m in the Slack channel, I’m in the Google Groups. I get to see firsthand that our users are being successful and where we may have deficiencies in the product. This type of visibility is a major strong point for open source.

Why did you decide to make the project Apache-licensed but to keep it to yourself, as opposed to putting it in a foundation out of the gate?

Most people would say that by having it as an Apache project, you can accelerate the product roadmap because you have a large community. I think that works well if it is an operating system, a database, or something that’s somewhat standalone.

But in this case, the nuances of being data middleware solution are not really well understood except by a select few. This is a small community of developers that over the years have ended up in a small number of companies — Informatica, IBM, maybe some folks that ended up at SAP and Oracle. It’s really a very small community and even most of them are not big-data savvy.

We believed that by providing the product management guidance in this process, we could actually get a better, more functional product than if we were to have to work with a number of different vendors. At this point in time, we are erring on the side of keeping the project going in a straight line, we know where we need to take it.

But I think once it hits a critical stage, a point where it’s going to MVP for a large set of enterprises, a foundation would absolutely make sense and we’d be open to even having other vendors take it and do what they want to it.

How do you decide what features belong in open source and what doesn’t?

An important principle for us has been that there is only one fully-featured version of the open source StreamSets Data Collector. It is important the the community knows that you’re not going to hold anything back. Plus, for our users, this is core data movement infrastructure that replaces custom-coding or even possibly legacy products. It must be rock-solid and enterprise-grade. We focused on delivering an order of magnitude improvement in the developer and data-engineering experience — one that has a nice user interface, is secure, and is elegant enough architecturally to integrate with enterprise tools, DevOps tools and scale-out technologies, such as YARN, Mesos and Kubernetes.

On top of that we have delivered a semi-independent or autonomous product, StreamSets Dataflow Performance Manager, which is an “Integration Platform as a Service,” or iPaaS, offering that manages and monitors data movement across complex enterprise environments that involve dozens or hundreds of data collectors.

What role, if any, does community play in the development of, or around integrations with, your open source product?

Our open source project is somewhat unique in that, as I mentioned earlier on, it’s a data middleware technology. We help bring control and visibility into data in motion across the enterprise data sprawl, which includes different databases, data stores, storage compute engines and the cloud. This is a very large and unbounded problem, and one that’s continuously prone to change. Connectivity to all systems in an enterprise, both legacy and modern, is critical.

The community has been instrumental in helping expand our reach across this space. As an example, we have excellent Hadoop and RDBMS DNA within the company, but the community has contributed integrations with a number of other technologies that were not initial areas of focus. This includes things like Elastic, MongoDB, Cassandra, Couchbase and Neo4j.

How does community adoption factor into economically acquiring customers, customer lifetime value and things like that?

Ultimately, we think this is a very, very broadly applicable technology problem and that there’s a huge market opportunity. People try to segment this in a variety of ways but, from my perspective, choosing the right size of company is critical. The community adoption has allowed us to really see the shape of adoption and pick and choose the right companies to target.

As I mentioned, we’ve got about a quarter of the Fortune 500 that have come to us, and that’s plenty of opportunity for us then to make them successful and monetize them. We thrive where there is complexity, and so we expect that every enterprise that’s in the Global 2000 and, ultimately, the Global 8000 is fair game for us. Keeping our open source community engaged and successful, through Slack and Google Groups, has helped us prospect for customers, and, equally as important, allows them to self-run product evaluations.

How do you work with the other players in the big data community, given that it’s full of other open source technologies?

There are two elements to that. The first is just that the ease of technology certification in open source is very handy. Obviously, vendors like Hortonworks and Cloudera, we have done that ourselves. But recently, one of the distributors of Couchbase in a part of the world that we wouldn’t have gone to, I think it was South Africa, ended up contributing a Couchbase connector back to us. That can be done really because we’re open, they’re open, and now anybody in the middle can go about helping out.

The other piece is really more about go-to-market strategy. Many of these vendors ultimately have a land-and-expand strategy. They make money based on how successful their customers are at standing up applications, and how much data these applications have under management.

But one of the reasons why big data remains a science project and doesn’t get “operationalized” or “productionalized’ is that the movement of data into these systems is very manual and brittle. So there’s a natural inclination for these vendors to endorse and bring in a vendor like us to help their customers be more successful. They can effectively refer us after they have done their initial commercial transaction, so that they can come back and close larger deals and deliver on their expansion strategy.

From that perspective, we have gone to market with a number of these big data companies and technologies from a partnership perspective. Those include both independent companies such as Cloudera and MapR, but also bigger cloud infrastructure providers. For example, we also engage with Microsoft Azure and have started doing that with Amazon Web Services.

Is this unique to the big data space because if how inherently open source it is?

I’ve found the big data ecosystem to be orders of magnitude, maybe a hundred times easier, to integrate and interoperate with than traditional vendors.

At Informatica, I spent many, many years dealing with the proprietary vendors — Oracle, Teradata, IBM, etc. — figuring out how to integrate with them. In the early days, it was a challenge. They tended to make it hard. I guess a way to describe it is that they made it pretty hard to get data out and they didn’t make it much easier to get data in.

I think today they are more open, possibly driven by open source. We’ve also been able to support most of the incumbent technologies. Exadata and SAP HANA and all of those are also fair game for us, and we do have commercial customers using these offerings.

Do you have visibility into how data pipelines are shifting, or what technologies are catching on or slowing down in terms of usage?

We absolutely do. In fact, we surveyed our community a few months ago about how people are using the open source software. There are a couple of different points that I would make here.

First, from a data architecture perspective, more often than not there are multiple data storage and compute technologies being used on a single application, what we call the data sprawl. This is compared to the old world of, let’s say, business intelligence, where everything could be thrown into a SQL database. Here we find that there’s a little bit of machine learning, maybe there’s a little bit of Spark, there’s a little bit of SQL. Maybe people are throwing data into a system to create search indexes. All of these together end up solving a business or operational problem, which could be that “know your customer” problem, or fraud detection or cybersecurity.

Second, the trend in the data space toward the cloud is very evident. The first wave really was the application wave, the SaaS wave. But now we find that people are much more comfortable storing big data in the cloud. This is evidenced obviously by success of companies like Amazon and Microsoft, but you can also see that in the newer entrants, companies like Cloudera making their technologies available in the cloud.

We’re definitely at the forefront of that because a number of our commercial customers end up having a hybrid world, where they are starting to blend on-premises and cloud infrastructure. That’s a trend that will continue on the data front because data is different than, say, applications. In addition, a multi-cloud architecture is more realistic than with apps because there’s no reason for all the data to end up in a single cloud. Cloud is really just more of an economic model for them to being able to burst and reduce maintenance costs. Workload portability allowing the data to move to the best place to run the job will be an increasingly common requirement.

Then I think the final trend that we see — even though the name of the company, StreamSets, may indicate that we saw potential in the streaming world — is that you can’t leave the batch world behind. We’re finding almost all of our customers have a blend of both streaming and batch. It really kind of it depends on the data urgency, so to speak, as to what mode you need to use. You really need to think about it, and they do think about it and use StreamSets as a single data movement technology to solve both problems.