How We Use Cube as an API-first Business Intelligence Platform to Build Real-Time Dashboards

Published in

Revel Systems Engineering Blog

10 min readJan 31, 2023

Come again, what is Cube exactly?

Cube is an API‑first business intelligence platform, which helps data engineers and application developers access data from modern data stores, organize it into consistent definitions, and deliver it to every application via REST, GraphQL or SQL APIs, while maintaining security and performance.

Cube is a headless BI platform, which means that it is decoupled from frontend interfaces. As headless BI, Cube provides API-first, four-part semantic layer which consists of data modeling, access control and caching on top of streaming SQL to build custom analytics experiences with modern tools.

In other words, it acts as a data access layer (or middleware), which, based on the data model, translates incoming API requests into SQL query and executes it in your database, while fully managing query orchestration, database connections and as mentioned above — caching and access control. The result then is being sent back to the client. Cube accepts queries via REST, GraphQL or SQL interfaces.

data modeling part allows you to define data models (centrally-managed definitions), which are also called cubes. You can define them either in YAML or JS:

Such schemas are exposed through querying API to end-users (for them to query the data in the database).

Measures are an aggregation over a certain column in your database table. They represent mostly data which you want to calculate.

Dimensions are like attributes related to a measure. They mostly represent a descriptive data. For example, the measure userCount can have dimensions like country, age, occupation, etc.

Later, I’ll provide an example of how do we use one of our cubes.

access control part helps to manage security so that only the right people have access to the right data. Cube provides multitenancy support both on database and data schema levels. As multiple drivers are supported, you can have customers’ data in different databases with one Cube instance. Cube’s role- and column-based data access control ensures isolation of customer’s data.

Cube’s authentication and authorization is based on JWT tokens and security context (or access control). Authentication is handled outside of Cube via any available and configured JWKS (JSON Web Key Set). Cube currently has out-of-the-box support for such identity providers as Auth0 and AWS Cognito. Once the JWT token was verified and decoded, the claims are then being added to the security context which is then used to limit access to data for the particular users.

caching layer ensures performant downstream data delivery, accelerating your queries and providing low latency capabilities. Cube has a two-level caching system: pre-aggregations and in-memory cache, which is currently managed by Redis, but should be replaced by Cube Store soon.

How do we use Cube in our real-time analytics app?

Let’s see how we utilise Cube’s features in two of our use-cases — a simple card with the data and a bar chart.

A real-world, but intentionally simplified example of our Order cube, which we defined in our Cube API instance (backend), looks like this:

cube(`Order`, {
            sql: `SELECT *
                  FROM public.order`,
            refreshKey: {
                every: `5 second`
            },
            preAggregations: {
                ...
            },
            measures: {
                ...
                final_total: {
                    sql: `final_total`,
                    type: `number`
                },
                tax_amount: {
                    sql: `tax_amount`,
                    type: `number`
                },
                discount_total_amount: {
                    sql: `discount_total_amount`,
                    type: `number`
                },
                discount_total_amount_sum: {
                    sql: `discount_total_amount`,
                    type: `sum`
                },
                gross_amount: {
                    sql: `${final_total} - ${tax_amount} + ${discount_total_amount}`,
                    type: `sum`
                }
                ...
            },
            dimensions: {
                ...
                order_id: {
                    sql: `order_id`,
                    type: 'number',
                },
                establishmentId: {
                    sql: `establishment_id`,
                    type: `number`
                },
                last_updated_timestamp: {
                    sql: `last_updated_timestamp`,
                    type: `time`
                },
                ...
            },
            ...
        });

An also simplified example of querying such cube from the client-side to retrieve orders with total discount amount (measure) applied to each and sort them by last updated timestamp (dimension) in a descending manner (Cube queries are plain JavaScript objects):

const queryRecentActivityDiscounts: Query = {
    measures: [ 'Order.discount_total_amount_sum' ],
    dimensions: [ 'Order.order_id', 'Order.last_updated_timestamp' ],
    filters: [
      {
        member: 'Order.establishmentId',
        operator: 'equals',
        values: [ String(dataLink.mfMcUserEstablishmentData.activeEstablishmentId) ],
      },
      {
        member: 'Order.is_closed',
        operator: 'equals',
        values: [ 'true' ],
      },
      {
        member: 'Order.discount_total_amount_sum',
        operator: 'gt',
        values: [ '0' ],
      },
    ],
    timeDimensions: [
      {
        dimension: 'Order.last_updated_timestamp',
        dateRange: /* value */,
      },
    ],
    order: [ [ 'Order.last_updated_timestamp', 'desc' ] ],
    timezone: /* value */,
  };

We use React 18 with TypeScript 4.8 and Material UI v5 for our microfrontend and latest Node LTS version (v18.13.0 as of time of writing) for the backend as well as Amazon Aurora PostgreSQL v13.6 as a relational database solution.

We can now use the retrieved measure (discount_total_amount_sum) and dimensions (order_id and last_updated_timestamp) to display the data in one of our widgets, which represents recent activity data at one of our test client’s store:

Another quick example of one of our widgets, which represents sales gross amount per hour over the past 24 hours at one of our test client’s store, looks like this:

We used only gross amount measure to query the data and provided a granularity of hour to timeDimensions property (The timeDimension property can be any dimension of type time). As a value of timeDimensions you can provide a time range in a format ['from', 'until'], e.g. ['2023-01-01', '2023-01-31'] or you can even describe your range in words, like from 24 hours ago to now or last week — how cool is that?

const queryGrossAmountChart: Query = {
    measures: [ 'Order.gross_amount' ],
    filters: [
      {
        member: 'Order.establishmentId',
        operator: 'equals',
        values: [ String(dataLink.mfMcUserEstablishmentData.activeEstablishmentId) ],
      },
    ],
    timeDimensions: [
      {
        dimension: 'Order.timestamp',
        granularity: 'hour',
        dateRange: [ /* value */ ],
      },
    ],
    timezone: /* value */,
  };

As headless Cube is decoupled from any visualisations, you can easily use any charting library or connect existing dashboarding and reporting tools of your choice to represent the data.

We decided to use Recharts charting library, which was built for React components. It is built on top of another popular JS library D3.js and has type definitions. It provides a wide variety of customisable components and we chose a BarChart component for our widget. As Cube perfectly integrates with visualisation tools, we can use the retrieved result’s chartPivot method (queryResult.resultSet.chartPivot()) to pass chart-ready data to this component and visualise the data.

To retrieve the data from Cube, we use useCubeQuery hook from '@cubejs-client/react'.

As we are building a web app with real-time dashboard, it’s important for us to deliver live data to our clients at a glance with as low latency as possible. To achieve that, we use Web Sockets transport.

On our backend side, we provide CUBEJS_WEB_SOCKETS=true environment variable to .env file. On our client-side, we initialise Cube client by passing a WebSocketTransport transport to cubejs constructor:

const cubeJsApi: CubejsApi = useMemo(() => cubejs({
    transport: new WebSocketTransport({
      authorization: /* idToken */,
      apiUrl: `${ process.env.NODE_ENV === 'development' ? 'ws' : 'wss' }://${ process.env.REACT_APP_API_URL }/cubejs-api/graphql`,
    }),
  }), [ /* idToken */ ]);

We then pass this client object to useCubeQuery hook along with the query itself to query the data:

const queryResult = useCubeQuery(query /* e.g. queryGrossAmountChart */, {
    cubejsApi,
    subscribe: true, // subscribe to Websocket channel
  });

And voilà! Our client is receiving real-time messages and updates widgets whenever there is an update in our database (the delay takes about ~5 seconds).

As was already mentioned, Cube’s access control is based on JWT tokens as security context. Cube provides an example which shows how to generate JWT token on Cube side and use it on the client-side, and later replaces this flow with a better one — an integration with Auth0 as an identity provider.

In our use-case, we chose a semi-custom authentication and authorization method. Our microfrontend integrates with the main backend (out of scope here) and receives a JWT token (idToken) from it. Microfrontend then queries an Auth0 endpoint, passing it Auth0 audience and the token itself to check if it’s valid. Then, the token is injected into cubejs constructor when we initialise Cube client (example provided above) and is being sent to Cube’s backend in Authorization header every time when user queries the data.

Cube allows to implement your own JWT verification logic by using checkAuth function in Cube’s configuration file. This function is expected to parse a JWT in Authorization header and, once token is verified against Auth0 and decoded, set it’s claims to the security context (req.securityContext). This is exactly how we use it:

export const checkAuth = async (req, auth) => {
    if (process.env.NODE_ENV === 'development') {
        return;
    }

    const idToken = auth || req.get('Authorization');

    const decodedIdToken = await validateAndDecodeAuthenticationToken(idToken);

    // The req object has to be mutated directly: https://cube.dev/docs/security#custom-authentication
    req.securityContext = decodedIdToken;
}

We also use queryRewrite function to amend incoming queries with filters. As a rule of thumb, this function should be used in scenarios when you want to define row-level security within the same database for different users of such database. Since JWT comes from the main backend with claims that distinguish our clients (we call them merchants) and their stores (we call them establishments), we can use securityContext to get them. This way, we can be sure that each user will only see the data he has access to — from particular merchant and particular establishment.

export const queryRewrite = async (query: Query, { securityContext }: RequestContext): Promise<Query> => {
    
    // logic to extract allowed merchants and establishments from security context

    query.filters?.push({
        member: 'Order.merchant',
        operator: 'equals',
        values: /* allowed merchants */,
    });

    query.filters?.push({
        member: 'Order.establishmentId',
        operator: 'equals',
        values: /* allowed establishments */,
    });

    return query;
}

This is how we implement row-level security.

To accelerate our queries we also defined a preAggregations block with a rule in our Order cube:

cube(`Order`, {
            sql: `SELECT *
                  FROM public.order`,
            // This refreshKey tells Cube to refresh data every 5 seconds
            // https://cube.dev/docs/caching#in-memory-cache-refresh-keys
            refreshKey: {
                every: `5 second`
            },
            preAggregations: {
                ordersByDate: {
                    measures: [Order.count],
                    timeDimension: Order.timestamp,
                    buildRangeStart: {
                        sql: `SELECT NOW() - interval '2 day'`,
                    },
                    buildRangeEnd: {
                        sql: `SELECT NOW()`,
                    },
                    partitionGranularity: `day`,
                    granularity: `hour`,
                    refreshKey: {
                        every: `5 seconds`,
                        incremental: true,
                        updateWindow: `1 day`,
                    },
                },
            },
            ...
        });

Pre-aggregations are materialized query results persisted as tables. Cube.js has an ability to analyze queries against a defined set of pre-aggregation rules in order to choose the optimal one that will be used to create pre-aggregation table.
If Cube.js finds a suitable pre-aggregation rule, database querying becomes a multi-stage process:
1. Cube.js checks if an up-to-date copy of the pre-aggregation exists.
2. Cube.js will execute a query against the pre-aggregated tables instead of the raw data.

Whenever a user’s query includes Order.count as a measure and Order.timestamp as a dimension, the query is being executed against a pre-aggregation table. We defined granular update settings here — the build range (buildRangeStart and buildRangeEnd) defines what partitions should be built during the scheduled refresh (data will never be queried beyond this range).

We can also accelerate pre-aggregations build and refresh time by partitioning the data. This concept effectively divides the data between multiple tables based on some attribute. We defined partitionGranularity as day, which means that Cube will generate separate tables for each day’s worth of data. In our case, these partitions are stored in our AWS S3 bucket in parquet format. Once these tables are built, updateWindow will refresh last 1 day of data every 5 seconds(without going beyond the provided build range). The granularity property with a value of hour will allow Cube to aggregate the dataset to a single entry for each hour inside each table.

At first we tried to deploy our Cube infrastructure (backend) on AWS ECS, but we encountered some networking issues there and decided to use Kubernetes along with Helm charts. You can read about Cube’s general overview of deployment, which parts does Cube’s cluster include and their purpose here.

A little word on scaling — as Cube’s storage layer is decoupled from the query processing engine, it’s relatively easy to achieve as much concurrency as you need by horizontally scaling Cube’s cluster’s worker nodes (more on this here).

Let us know if you have similar issues or concerns about the deployment process and we might expand on this topic in our next blog!

Conclusion

This article is just a quick overview on how our team uses some of the Cube’s features. Diving into every detail of how we implement authentication and authorization process or how we deploy our app on Kubernetes (with coverage on our Cube’s cluster infrastructure) would require much more effort and reading time. I hope that you found something useful for you here, as I tried to include some main concepts and provide useful links to the sources in-place.

Anyway, all of the concepts mentioned and explained in this article allowed us to understand that real-time data streaming process can be quite complex thing to do, but Cube allows getting it done much easier.

As for our next steps, we need to fully productionize our app and review how we would scale our infrastructure in prod.

Feel free to reach out if you’re interested in us covering any related topic in more details (we didn’t go to prod yet, though) !

How We Use Cube as an API-first Business Intelligence Platform to Build Real-Time Dashboards

Conclusion

Written by Max