In my initial post, I introduced Datacoral and our focus on helping enterprises everywhere get the maximum value out of their data.
This week I am kicking off a series of posts that dive into Datacoral’s core technology and offer some practical “how to” examples that demonstrate what life as a data engineer or scientist is like for Datacoral customers. But first, let’s get oriented on the problem that has been talked about a lot!
The Challenge — The Data Hairball
“All sorts of businesses are collecting a ‘rolling hairball’ of data on their customers. Whether they can use their insights to transform into tech companies is another matter.”
This quote is from a recent Wall Street Journal article that highlights how unlocking the value of data is at the core of every company that wants to become a technology company. We identify two key reasons for why it is hard for companies to truly leverage the data they have.
- There are a mind-boggling number of choices of technologies and services for collecting, analyzing, and managing data — like ingest tools and services, ETL and job orchestration systems, data warehouses and big data query engines to name a few. There is significant expertise required to make those choices and actually assemble a system that works for end-to-end data flows — from data in different sources to insights that are acted upon. Such expertise is very much in short supply.
- Data flows are coded up as data pipelines which require a deep understanding of the underlying technologies. Data pipelines implement the business logic of data in scripts that are filled with boiler plate code to handle the integration points between the different systems and orchestration logic to make sure that data is processed in the right order. These data pipelines become brittle and hard to maintain over time as the business logic of the data changes, again requiring expertise that is hard to come by.
At Datacoral, we have worked to overcome the complexities of securely piecing together the systems needed for end-to-end data flows and have dramatically simplified how data-flows are specified.
Data Programming — Moving Data Teams Up the Stack
It all starts with elevating data scientists and engineers beyond today’s reality of manual effort of orchestrating jobs and tasks in data pipelines and instead getting them to focus on the business logic of the data.
We call it Data Programming, rather than Data Engineering.
Data engineering today can really be boiled down to the integration of a variety of systems and the scripting data pipelines that span those systems. Data engineers have to understand the architecture of the underlying data infrastructure, the jobs in data pipelines used for automation and the dependencies that govern how the orchestration happens at scale.
In the world of programming, this is akin to working with an assembly language where there is a strong correspondence between the syntax and structure and the architecture of the target microprocessor. Most programmers use higher-level programming languages (Node.js, Java, .NET, Python) that allow them to write code that describes the business logic of an application and is portable across hardware architectures.
So, why shouldn’t there be a higher-level language to create programs that focus on the business logic of the data without having to know about the architecture of the underlying data pipeline and data infrastructure?
At Datacoral, we are introducing a SQL-like high level language — Data Programming Language (DPL) — which allows data professionals to author data programs to manage end-to-end data flows without having to understand the underlying systems.
So, for example, instead of thinking about building an ingest pipeline from Salesforce into a data warehouse with its myriad jobs and tasks, one would just write a single statement for the `collect data function’, i.e., something like:
UPDATE SCHEMA salesforce
Once such a statement is executed, data from different salesforce objects starts automatically flowing into corresponding tables in the salesforce schema of the data warehouse!
We have added SQL-like syntax to specify different data functions that are typically performed in end-to-end data flows. The signature or type of data functions is essentially the schema of the data being returned by the functions. So, data programs can be statically type checked, for example, changing the schema of data function without also changing all the transformations that use the output of the data function results in a compile time error. Having such a capability significantly simplifies how end-to-end data flows are built and maintained over time.
Data programs get compiled into data pipelines that then get executed on Datacoral’s data programming runtime platform. The platform consists of
- a scalable way to manage state through a shared metadata layer and
- a data-event driven pipeline orchestration layer
The runtime captures the necessary state to provide users visibility into both data freshness and data quality. The platform itself has been built in a fully serverless manner. More on this in later posts.
Datacoral Slices — Abstracting the Complexity of System Integration
Data programs typically consist of several data functions — like collecting data from different sources, transforming that data in different query engines, and publishing the transformed data into different systems. We have implemented these data functions as serverless microservices that have standardized interfaces for data and metadata that is coordinated through the runtime’s shared metadata layer. We have also standardized how these microservices are packaged, deployed, and monitored. With this standardization, we have encapsulated and abstracted away the complexity of piecing together different systems that are needed for end-to-end data flows. These microservices can be added on an as-needed based on the functions specified in the data programs. We call these microservices slices.
We have built an extensive catalog of slices of different types. Collect slices make raw data available consistently. They provide modular endpoints for instrumentation, capture changes in production databases and retrieve data from any API. Organize slices use the notion of materialized views to support consistently transforming data in any query engine. Harness slices can publish data to third-party apps for company wide use, or to production databases for direct access by applications.
Serverless all the way — Scalable and Secure Architecture that Works where Your Data is
Traditionally SaaS offerings try to share the same infrastructure in a multi-tenant installation across customers to minimize operations overhead and amortize infrastructure costs. As we are serverless-native, there is no need for us to share infrastructure across our customers. Given Datacoral’s architecture, serverless means that our service is not only serverless for the customer, but also serverless for us.
The deployment and consumption of Datacoral itself is similar to the deployment and consumption of other AWS services. Our software gets deployed inside our customers’ VPC, which means that their data never leaves their environment and is also encrypted using customer managed keys. The result is an unprecedented level of security for an end-to-end data infrastructure stack in the cloud.
Solving Real Problems for Enterprises Today
We are super excited about the incredible business value these set of combined technology choices are already bringing to companies like Front, Greenhouse, Jyve, and MealPal. If you are trying to best leverage AWS Redshift and deciding which tools to use or are just embarking on creating your first data infrastructure stack — Datacoral can help.
Checkout the Data Engineering Podcast I did on Serverless Data Pipelines using Datacoral to learn more.
We are also looking for strong engineers to join the team!
Next up — the data programming interface.