Data Analytics Reference Architecture

Venkatesh Shanmyganathan
5 min readMay 29, 2023

--

Context

When we want to learn about how to achieve the end goal of Analytics and beyond (ML and AI), we need to start from the basics of data and how it originates, how to manage and model it and then ingest to the analytics pipeline.

This article presents one way in which it may be thought through.

Logical Architecture

Let us start from the top of the architecture diagram and move down.

Following are the ways in which systems and applications may supply data for analytics:

  • Origin Systems or applications that have a web front end or a mobile front end use Google Analytics or Firebase to instrument the clickstream data. There are many articles, dev notes and help to work with Google Analytics’s Google Tag Manager and Firebase.
  • On the other hand applications that have services / backend or middleware, may instrument for Analytics using Domain-Oriented Observability design pattern by Martin Fowler explained in the link.

Moving on, the produced data from Google Analytics may be backed up in Google’s BigQuery — a very standard script provided by Google Cloud.

The domain object probes are sent to a messaging paradigm — here it is Google’s PubSub. More details about this is in the next section.

We attach a very standard stream processor called Google Dataflow. Google Dataflow, a managed service for Apache Beam pipelines, receives inbound messages from PubSub and performs three key functions:

  • transform data
  • compute metrics
  • store data

The dataflow stores the transformed data onto Google’s BigQuery and the computed metrics onto Google’s BigTable. More details about these in the further sections.

Analytical Data Services — the one circled on the above picture is a simple REST based API that queries BigTable which is essentially a datamart because it stores computed data.

The BigQuery itself becomes a modern data warehouse and a visualizing tool like Tableau or PowerBI may be used to discover data.

Instrumentation of Domain Objects

Typical instrumentation in microservices or services needs to be expressive and observable but at the same time should rather be decoupled from the functional code that exists. The objective is to be able to define domain events and pass it over without littering the codebase with crufty, verbose calls. The inherent requirement is to add business-relevant observability in a clean, testable way.

What to Observe?

A service which deal with functional management of domain objects, the jargon instrumentation for “Observability” means the combination of the following:

  • High-level business key performance indicators (KPIs).
  • Business/domain metrics might track things like Access Code Redeemed, Product Subscribed etc.

In order to perform these the domain objects are instrumented with code that uses the data that is prevalent in the respective microservice.

Probing the Domain Objects

A Domain Probe a design pattern enables us to add observability to domain logic while still talking in the language of the domain.

A Domain Probe presents a high-level instrumentation API that is oriented around domain semantics, encapsulating the low-level instrumentation plumbing required to achieve Domain-Oriented Observability. This enables the services to add observability to domain logic while still talking in the language of the domain, avoiding the distracting details of the instrumentation technology.

Domain Probes should enable tests to be conducted independent of the functional domain object management.

The following figure shows the strategy of event-oriented design for Domain Observability API. Rather than the domain object making a direct method call, it instead emits Domain Observation events (called Announcements) that announce its progress to any interested observer

Aspect-Oriented Programming (AOP)

Now that we have the observations and the domain objects, coming back to the requirement of removing the noise on the code entirely, in microservices we turn to Aspect-Oriented Programming (AOP). AOP is a paradigm that attempts to extract cross-cutting concerns, like observability, from the main code flow. An AOP framework modifies the behavior of microservice by injecting logic that’s not directly expressed in source code. This is also called meta-programming, in which we annotate the source code with metadata that controls where that cross-cutting logic is injected and how it behaves.

Having said this, the message data itself may be modeled using Activity Streams.

Stream Processing

As specified earlier, Google Dataflow, a managed service for Apache Beam pipelines, receives inbound messages from PubSub and performs three key functions:

  • Transforming Data — This involves data transformations that rename certain attributes and/or change the structure of the data. e.g. Let’s look at an example. Suppose an origin system is streaming information about Person entities using a structure like this:
{
"first_name" : "Alice",
"last_name" : "Smith",
"home_phone" : "555-123-4567",
"work_phone" : "555-987-6543"
}

We may want to transform such records into a structure that is consistent with the canonical model as shown in the next snippet:

{
"givenName" : "Alice",
"familyName" : "Smith",
"contactPoint" : [{
"contactType" : "Home",
"telephone" : "555-123-4567"
} , {
"contactType" : "Work",
"telephone" : "555-987-6543"
}]
}
  • Computing Metrics — every analytics solution operates on two different kinds of data: raw facts and aggregations.

A raw fact is a record published by an origin system which describes something that is meaningful to the business such as a single event or a single transaction. An aggregation summarizes information from a collection of raw facts. For instance, an aggregation might count the number of events within a given time period, or sum up the sales within a given geography.

A Dataflow pipeline is responsible for performing these aggregations. The pipeline consists of the following steps:

  • Storing Data — Raw facts are stored in the BigQuery warehouse and the computed metrics are stored in the BigTable datamart. The datamart design follows a OLAP cube models.

Analytics Data Services

Like mentioned earlier, Analytics data services are REST Data Services querying the datamart. Following are samples:

  • Get one Cell from the Cube
GET /{realm}/object? 
platform=98uu3&
intervalStart=2019-12&
durationUnit=Month&
location=urn_of_the_locn
  • Get a range of cells over a specific time span
GET /{realm}/object? 
platform=98uu3&
intervalStart.minInclusive=2019-01&
intervalStart.maxInclusive=2019-12&
durationUnit=Month&
location=urn_of_the_locn

What next?

A stream processor may be extended to further analytics pipelines like TensorFlow etc to include ML and AI algorithms.

--

--