Data Programming Interface (DPI) for Analytics Tools (instead of APIs)
In the previous post, we introduced the concept of a data program where a data practitioner who knows SQL can declaratively specify data flows as a combination of data functions. The data functions themselves have signatures/types that are based on the data schema — the attributes (and their types) that form the input to the data functions. In this post we will talk a little more about what it means for the data schema itself to be the interface, which we call the Data Programming Interface.
First, let’s discuss how data processing for analytics is typically thought about. A good example is Google Analytics which is used for the instrumentation and analyses of website activity. The simplified anatomy of such an analytics tool is shown in Figure 1 below.
As far as the user of the analytics tool is concerned, it is just a two step process:
- Developer makes an API call — essentially a function call that requires a standard set of attributes (like userid, timestamp), and some application-level information.
- Once the API call is made, pretty graphs appear in a UI provided by the tool (with some latency).
The developer doesn’t have to worry about what kind of processing happened for the data to get cleaned and aggregated. The analytics tool handles it.
In reality, these analytics tools have multiple systems that are used for collection and transformations of the data to prepare it for visualizations. But, functionality-wise, all tools internally can be thought to have the following three steps:
Step a. Events that the tool receives through the API call are stored in a raw table
Step b. A data pipeline takes the raw table through a series of transformations — cleaning, enriching and aggregating
Step c. The output of the data pipeline is stored as a set of summary tables from which data is extracted for visualization
The visualizations are made available (Step 2 for the user) because of Steps a, b, c happening within the tool.
While this level of abstraction provided by the tools is great to get started on analytics, most data scientists find such tools pretty rigid and constraining. A few problems include:
- Given that input data is provided through API calls, application code is required if one has to send events of different kinds to the tool. For example, if analytics needs to be done on events coming from devices as well as ticketing events generated in Zendesk, there is code to be written to send both of these types of events to the analytics tool.
- One cannot easily “backfill/replay” events from a historical time period or redo the analysis after repopulating the events data
Now, imagine that instead of making an API call, a data scientist is allowed to create in a database, the input events table that has the same attributes and application-level information in columns but has data from different sources.
In Datacoral’s Data Programming Language, this would look something like below:
So, Step 1 (for the user) and Step a (internal to the tool) in Figure 1, can be replaced by the data program above. This leads to the creation of a raw table, such as the one below.
A data pipeline similar to the one within the analytics tool can then take the raw table through a series of transformations. This would result in the same output summary tables that can then drive the visualizations that our user is familiar with.
With data programs, instead of using an API to interface with the analytics tool, the data scientist is using the data directly to interface with it. To be more specific, the analytics tool would specify a data interface, i.e., a schema, for its input, rather than an application programming interface (API). This is what we are calling a Data Programming Interface (DPI).
At Datacoral, we believe that data practitioners who are comfortable in SQL can utilize a DPI to provide input data more flexibly into tools for analytics (amongst other things). Datacoral also makes it easy for data practitioners to build data pipelines as data programs. These data program would then describes their input via DPI. However, that is a topic for a future post.
A similar idea has been discussed before — Jonathan Hsu, in his blog wrote that a raw table that corresponds to a specific schema:
This table can activate sophisticated analyses around growth, churn, and LTV. At Datacoral, we have generalized some of these concepts into the much more powerful notion of Data Programming.
In the next edition of this series of blog posts, we will describe in detail the features of our Data Programming Language.
If you’re interested in learning more about Data Programming, or want to chat about building scalable data flows, reach out to us at email@example.com or sign-up for a demo.