Building Embedded Analytics with Cube

Published in

Engineering at Zafin

9 min readFeb 27, 2024

By: Jae Li, Software Engineer Co-op, Chandoksahej, Software Engineer Co-op, Rutvij Sharma, Sr. Data Engineer, Marcin Zegarmistrz, Technical Product Manager, Adnan Haider, SVP Analytics

Modern analytics workflow requires embedded analytics where reports, dashboards, models, and data visualizations can be accessed through a single interface. The goal of embedded analytics is to bring analytics as close to business operations as possible, reducing the amount of unnecessary context-switching between application interfaces. Conventional business intelligence (BI) solutions introduce friction in analytics operations since analytics and operations are configured as separate processes. For example, a workflow that consists of designing a product offering and gathering information about client behaviors requires two different workspaces, since they are imagined as categorically different processes: one operational and the other analytical. The former is performed through an internal tool and the latter through BI software. The purpose of embedded analytics is to bridge this gap between the operational and analytical processes by bringing analytics into the familiar context of operational workflow. With embedded analytics, the same workflow of designing a product offering can be redefined as a multi-step form where for each condition the user sets, a dynamic chart will display the clientele breakdown accordingly. To enable embedded analytics, a performant and easy-to-use data visualization solution is a key requirement.

In application development, building data visualizations is brittle and time-consuming. First, data has to be prepared based on the data series that needs to be rendered on the front-end, and then an API needs to be implemented to assemble and return the data series as JSON. Finally, the front-end needs to unmarshall the JSON and bind it to the charting libraries. This is already labor-intensive, but it gets worse: If we want to change the charts on the front end, perhaps to go from a daily aggregate to a weekly aggregate, then the data model, API, and front end all need to be changed.

The cumbersome process of building data visualizations is only one of many challenges. A bigger issue is the inconsistencies that arise when multiple downstream applications define their versions of the same metrics. In stand-alone BI solutions, commonly used metrics are defined internally. As a result, other downstream applications must define their versions of the metrics. Managing metric definitions becomes a task in itself. To address the difficulties that arise in building data visualizations, we introduce a semantic layer in the data pipeline, powered by Cube.

Semantic Layer

A semantic layer in the data stack serves as a central repository of metric definitions. The semantic layer is where raw data are made intelligible for analytics use cases. Consider the typical demands of business reporting: a sales VP would likely look at Key Performance Indicators (KPIs) like close rates, deal velocities, and transactions by segment, region, and team. These KPIs are metrics with fixed definitions to measure business operations. Other examples of metrics include aggregations that define categories and segmentations of data, such as active users and business regions. Seemingly crisp segmentations in fact leave many ambiguities if analysts define them on the fly each time. Do the weeks start on Sunday or Monday? Are British customers part of the EU jurisdiction? While some of these definitions are part of the data table, many are further aggregations on top of the existing information. The semantic layer provides a central location where metrics are defined once and for all for downstream applications. The layer also enables implementation of data controls by supporting fine-grained access to metrics.

A semantic layer provides data modeling for the underlying database, simplifying the data provisioning for making visualizations. As mentioned above, provisioning data for visualizations is an arduous process. Most modern applications use an Object-Relational Mapper (ORM) that maps underlying data representations in databases to those in object-oriented programming to allow programmatic interactions with databases instead of SQL. While ORMs abstract away SQL queries, each downstream application must implement its ORMs. The semantic layer provides a universal data modeling layer that is independent of downstream data applications. Downstream applications no longer need to build their ORMs. Instead, they query the database through the data representations, including metric definitions, defined in the semantic layer. Internally, the semantic layer transforms queries and metric requests into SQL queries. This architectural shift is depicted below, where data modeling and metric definitions move from individual applications to the central location of the semantic layer. Because the semantic layer provides the metric definitions for analytics use cases, it is also referred to as Headless BI, where the headless means the lack of a graphic interface or the capacity for rendering visuals.

Figure 1: Architecture of the Semantic Layer, Source: Benn Stancil, The missing piece of the modern data stack.

Why We Chose Cube as Our Semantic Layer

At Zafin Analytics, we use a semantic layer, Cube, to manage our metric definitions and accelerate product development in embedded analytics. The semantic layer is part of the Zafin Data Fabric that transforms and provisions data for analytics use cases. We initially started to investigate semantic layer libraries because we needed to build embedded analytics in the analytics product suite. The existing development process for making visualizations is time-consuming since each chart requires a REST endpoint and a dedicated microservice. We are looking for an embedded analytics solution to accelerate the development process. And since our product suite has several products that share the same underlying data source, we need a semantic layer to define metrics and data models once and for all. Another important requirement in our selection process is performance. We would like to cache frequent queries to accelerate query response and allow scaling as the user activity grows.

Since the semantic layer is relatively new in the data engineering toolkit, the options for semantic layer products are constrained. BI tools such as Power BI still dominate the market, and major vendors support embedding dashboards in custom applications. For Zafin’s use cases, using BI tools for embedded analytics feels like overkill. Also, as mentioned above, corporate BI products are black boxes with metric definitions tied to the product. Therefore, we mainly surveyed semantic layers that can serve as a headless BI component and integrate with our existing tech stack such as dbt’s semantic layer, Atscale and Stardog.

We landed on Cube or CubeJS as our semantic layer of choice. Built with performance in mind, Cube is a popular open-source semantic layer built in Rust. We chose Cube as our semantic layer because it offers the following advantages:

Support for complex queries and visualizations: Cube accelerates the data provisioning process for visualizations. Instead of making tailored APIs for each chart, data provisioning in Cube is achieved through Cube queries written in JavaScript. Managing charts no longer requires backend changes. Instead, one only needs to add the charting component along with the Cube queries to fetch the underlying data source. In addition, Cube also provides out-of-box support for building visualizations through its development mode and React library.
Caching: Caching is a critical part of any large-scale data-intensive applications that help optimize performance and reduce cloud data storage costs. Cube provides two forms of caching, in-memory cache and materialized queries called pre-aggregations. When Cube receives a query, it will search for an existing cached query. Using cached queries significantly accelerates the query speed. The pre-aggregations are refreshed on a customized schedule.
Self-hosted and horizontal scaling: Cube.dev offers open-source self-hosted and managed solutions. Unlike its closed-source competitors, Cube is under the Apache License 2.0. Cube gives us the flexibility to customize and self-host our Cube instance. As a self-hosted solution, Cube also provides options for horizontal scaling, enabling expansion and resource optimization as data needs evolve. Cube also has a managed offering which we may use in the future to lower our operational overhead.

Building Metrics at Zafin

Design user-defined metrics with Cube data models

As a pilot project, we integrated Cube with our existing application, Zafin Studio, to build data visualizations. Our visualization use cases are centered around the notion of user-defined metrics. An example of user-defined metrics is the monthly sum of average deposit balances for that exceeds $2000 at the account level. After some research, we found that we can leverage metric definitions in Cube data models to build these user-defined metrics. In Cube, we use measures to define common aggregations that form the building blocks of user-defined metrics in Zafin Studio. Each table in the underlying database is mapped to a data model file. For the balances table, a model file can look like below, where we define aggregations, such as sum, minimum, and maximum on the column average_balance. In Cube, these measures can grow more complicated as the corresponding SQL definition grows more complex.

Based on the data model, the user-defined metric can be composed as a Cube query shown below. In the query, we specify the aggregation average_balance_sum in the balances table, with the month as our granularity. We also added two filters, one on average_balance and another on customer_id, to our search. Since these queries are composed on the client side, building visualizations based on the metric becomes a matter of composing queries based on Cube data models. Here, the team has made an architectural decision not to hardcode user-defined metrics as Cube measures, which sit on the server side, but convert them into Cube queries instead. This decision is based on several considerations. In terms of operation, updating data models on the server based on user activities is in general not advised since the server should be stateless according to the RESTful principle. In our case, updating data models based on unpredictable user activities is impractical. In terms of privacy, user data should not be exposed as data models for all users of the Cube server.

Translate data representations to Cube queries

To integrate Cube with Zafin Studio, we also built a few utility microservices, the most important being a microservice that translates the internal representation of user-defined metrics to Cube queries. In Zafin Studio, there is a one-to-one relationship between user-defined metrics and Cube queries. The user-defined metrics have an internal data model that represents the example metric as the JSON object below. The microservice is responsible for converting the internal representation of user-defined metrics to Cube queries.

Allow clients to query Cube through a dedicated API

In addition to web application integration, we built a client-facing API to allow users to query user-defined metrics. We designed the API based on the client’s need to streamline data access and increase transparency. Through this API, users gain direct access to the dataset associated with the metrics they created through Zafin Studio. The client-facing API also gives users more flexibility to specify the time range and interval.

Together, there are multiple points of data access for analytics consumers through Cube, as shown in the diagram below.

Figure 2: Semantic Layer for Multiple Points of Data Access

Build CI for production-grade Cube cluster

Moving Cube to production and integrating with Zafin’s CICD pipeline is another major endeavor in the project. To meet the performance demand, we choose to deploy Cube in Zafin’s Azure Kubernetes cluster. The Kubernetes cluster consists of two main components, the Cube API and Cubestore. As depicted in the diagram below, the Cube API handles incoming requests and decides whether the query exists in the cache. If so, it queries the Cubestore for the cached query. Otherwise, it fetches the query from the database and saves the query to Cubestore.

Figure 3: Cube API Kubernetes Architecture

In addition to the Kubernetes cluster, we are also building a few components to streamline the DevOps process:

A script to generate Cube data models based on database changes
Enable rolling updates to eliminate server downtime

Conclusion

As an adaptive semantic layer, Cube made it easy made it easy for teams to use their existing data warehouse as a single source of truth, allowing them to serve consistent metrics to both enterprise customers and internal stakeholders. At Zafin, we chose Cube as our semantic layer to build embedded analytics, letting us shrink the time to embed visualizations from four person days to just one. As we progress in Cube implementation, we plan to expand into incremental pre-aggregations to leverage Cube’s powerful caching features and introduce Cube to other products at Zafin.

References

Introducing JSON. Available at: https://www.json.org/json-en.html (Accessed: February 16, 2024)
The Semantic Layer for every data app. Available at: https://cube.dev/ (Accessed: February 16, 2024)
Semantic Layer. Available at: https://www.getdbt.com/product/semantic-layer (Accessed: February 16, 2024)
Semantic Layer Solution — BI & Data & Analytics Software. Available at: https://www.atscale.com/ (Accessed: February 16, 2024)
The Enterprise Knowledge Graph Platform. Available at: https://www.stardog.com/ (Accessed: February 16, 2024)
Introduction to RESTful Web services — IBM Developer. Available at: https://developer.ibm.com/articles/ws-restful/ (Accessed: February 16, 2024)
Stancil (2021) The missing piece of the modern data stack. Available at: https://benn.substack.com/p/metrics-layer (Accessed: Feb. 22, 2024)