Introducing Datacoral: A Secure, Scalable Data Infrastructure — No Management Required

Today, I could not be more excited to publicly launch Datacoral and announce our $10M Series A, led by Sudip Chakrabarti at Madrona Venture Group with participation from Social Capital and other investors.

At Datacoral, we are taking a fresh look at how data-driven companies can dramatically reduce time spent managing data infrastructure — and instead focus on driving more business value from that data. Today, companies of all sizes want to be data-driven, which means successfully using the data they own to make their products or services better. But building infrastructure can be one of the biggest barriers to companies of any size, and maintaining them is a continued burden.

With Datacoral, companies can access the power of their data insights without having to worry about the plumbing of their underlying data infrastructure. We have several customers in different verticals already reaping benefits of our platform today (Greenhouse, Front, Ezetap, and Fin, to name a few), and our new funding will accelerate our onboarding of enterprise customers, as well as expand the functionality of our solution.

Lessons learned building data infrastructure at Facebook and Yahoo!

The idea for Datacoral is a culmination of learnings from working on data infrastructure and distributed systems for over 15 years at companies including Yahoo! and Facebook.

My experience with data infrastructure started in 2000, at Yahoo!, before the term “big data” was even coined. A couple of years later, we built a parallel SQL engine that could store a couple hundred terabytes of data on NetApp filers and execute queries on tens of machines that mounted those filers via NFS. We were thrilled about it, even termed it a “nirvana” architecture, because who would ever need to manage more data!

Fast forward to 2008 at Facebook, our data infrastructure team was building a SQL engine, which would become Apache Hive, on top of a very nascent Hadoop. While many ideas from the “nirvana” architecture applied, the scale was much bigger. While working on Hive, I built an auto-instrumentation system — nectar — which captured all user activity with a standard set of attributes. This may sound menial, but it relieved us of endless hours spent instrumenting manually, and enabled Facebook’s infrastructure to accumulate and log data at scale. We went from a fifty terabyte cluster to hundreds of petabytes over a five year period! It got to a point where we could not build a data center big enough to hold all of our data. So, I worked on a project to make our data infrastructure multi-datacenter and multi-tenant.

Over the years, I have learned that scaling data infrastructure has some unique challenges. Data infrastructure is not one size fits all. Different technologies are needed for different use cases and different scales. Systems that work well at one scale may keel over at the next order of magnitude. Data infrastructure in fact needs to scale faster than the application itself. The load on data infrastructure grows as the multiple of customer growth, engagement growth, and increase in data applications. This hyper-scaling is almost always unanticipated and significantly compounds challenges with generating value out of that data.

And finally, as data infrastructure engineers, we never trusted outside parties with our data management. We built functionality within our own data centers instead of sending data out to third party systems.

Why start yet another data infrastructure company?

In 2015, when I was an Entrepreneur in Residence at Social Capital, I talked to many companies about their data infrastructure needs. One would think that over time, the cost and cognitive overhead of running a a data infrastructure stack would have decreased given the popularity of open source tools, SaaS services, and public clouds. But, surprisingly, it seems the opposite is true! There are a mind-boggling number of choices for collecting and managing data, which is resulting in more complexity than ever. Companies invariably cobble together several systems and services to get the end-to-end functionality they need, then have to write complex data pipelines, which are hard to manage. In many cases, they end up sending their data to third party SaaS tools for analysis, which means loss of control over that data.

Working closely with Jay Zaveri and Social Capital’s Discover team, which helps incubate startups solving hard problems, I started exploring what it might take for a company to maximize the productivity of data teams while still retaining full control over their data. It seemed like the critical new ingredient was serverless computing, which could allow for a fully-managed data infrastructure stack that could be fully hosted in the customer’s cloud. In other words, a true “private SaaS model” seemed realistic because of serverless computing.

Given the market requirements and technology trends, I realized that were four demands that needed to be met together for that to happen.

  1. Remove management burden: We needed a fully-managed service that automates management of data flows while ensuring data freshness and integrity. Companies shouldn’t have to worry about the day-to-day operations to guarantee the data is flowing and running smoothly.
  2. Elevate data scientists beyond jobs and tasks: We needed to provide an interface that frees up data scientists to focus on the shape and semantics of data, instead of jobs and tasks in data pipelines. This interface should allow a data practitioner to instantly gain data insights and address any data problems with proactive alerts.
  3. Ensure companies maintain control of their data: We needed to back that interface with a serverless implementation with so that the solution could easily be fully deployed into the company’s cloud, thus making sure that their data is never leaving their systems.
  4. Offer a flexible solution that could grow with the business: We needed to make it easy for companies to try this out easily without having to rip and replace their existing systems. A “building blocks approach” could start small on top of their existing systems and then grow from there.

Introducing Datacoral

These four demands informed how we built Datacoral. We have been at it now for the past two and a half years. Our data infrastructure as a service can be deployed fully within a company’s virtual private cloud. It automates the tedious work of data plumbing, while empowering teams to focus on generating insights. It has been heartening to see our initial customers get started with a single use case, then expanding their footprint, so we grow with them over time. We will write more about our product and use cases in future posts.

Today I am really thankful to the Datacoral team, our customers, investors, and partners who have supported us through our initial years. I speak for everyone here at Datacoral when I say we couldn’t be more excited about the opportunity to help companies everywhere get the most out of their data.

You can reach out to us at hello@datacoral.co.

Image Credit: Vincent Pommeyrol / Getty Images