Is this your next data logistics platform architecture?

Otrek Wilke
CodeX
Published in
6 min readOct 8, 2021

Digitalization moves on, more and more digital tools are implemented and produce data about your business. To create insight from all the data points a flexible approach to a data logistics platform is needed.

The basic 4 — tier architecture

Creating knowledge from data has at least three steps, first ingest data into the platform, second integrate various data sources, and third present a unified consumable and understandable data model to your consumer systems.

Data is produced in various systems, every system has a purpose, all-in-one systems tend to be difficult to understand and maintain, therefore a variety of systems and platforms will be in use. The data that is produced in all these systems need to be ingested into a unified data platform. This is done by the first layer, the data ingestion layer. This layer consists mainly of the staging areas and services connectors that need to be in place for a specific business.

This data needs to be processed and also merged into a unified data model, this is done in the data integration layer. Here various data sources are filtered, transformed, and combined. Additionally, this layer is often used for the orchestration of data ingestion and processing pipelines.

To have a stable data model to create reports from and to move data from one system to another a data presentation layer is used. This layer is typically implemented via a data warehouse, a (REST-)API, or “just” a database.

The fourth and last tier is the consumption layer, which can be a reporting/ BI tool or a consuming system. This layer only consumes data from the underlying presentation layer and is therefore decoupled from the data pipelines. This increases the stability of reporting or consumption, whilst it introduces flexibility in the data integration and also a verisonability of the presentation layer.

Basic schema of a data logistics platform architecture

Enough theory, how to implement this architecture on a hyperscaler! Every concrete implementation will be a bit different, let’s look at some tools to implement such an architecture.

Tools to implement this data logistics platform on Google Cloud

First, which tools are relevant to implement a data integration platform as described above. Google has the BigQuery service, which can be your go-to place to implement all 4 layers, as long as you only have data in sources that are available as connectors to Google BigQuery and as long as you what to implement single direction data transportation.

Otherwise, for the data ingestion, there are tools like Googles Cloud Storage, Cloud Function and Cloud Run to get data from a source system and ingest it to the data logistics platform.

For data integration, there is a wide variety of services available on GCP. Cloud Fusion and Cloud Composer are two services to orchestrate and implement data transportation and transformation. In case of BigData, Googles BigQuery service might also be a valid choice. Whilst dealing with streaming data needs tools like Dataflow or Pub/Sub.

The transformed and integrated data with all its possible integration layers can then be stored in a database like Cloud SQL, Big Table, Cloud Spanner, or Firestore. Again BigQuery might be a tool of choice here to store and present the integrated data.

To consume the data the possibilities are endless in implementing an API based on the integrated data model and use this API in any of your business applications. Apigee is Googles tool for API management.

In case of reporting on GCP the choice is here between Looker, Google Data Studio, and again BigQuery. Connecting other BI systems like PowerBI is possible as well.

Example implementation on Google Cloud Platform

Tools to implement this data logistics platform on Microsoft Azure

Not a google user? Well, let’s see how to implement a data logistics platform on Microsoft Azure! First, the data ingestion part, similar to GCP there is Azure Functions to ingest data from complex APIs or data lakes. Speaking of data lakes, together with Azure Storage this is the go-to solution to dump data from sources outside of Azure. In case of simple APIs and already available connectors, the Azure Data Factory can be a part of the data ingestion layer.

PowerApps might also be used to import data into the data logistics platform, though PowerApps might be more of a solution for manual data entry.

Azure Data Factory is mainly used as the tool to implement the orchestration of data integration pipelines and data transformation pipelines. These can be implemented in Data Flows, as well as in Databricks or Azure Batch processing and many cases directly within Azure Data Factory.

Hence, Azure Data Factory is the main tool to implement the integration layer, despite the case the integration platform needs to handle stream data, then Azure Stream Analytics and Azure Event Hub need to be taken to consideration, again Databricks and HDInsight could be used as well.

The result of the second tier, the data integration is again a unified data model, storing this data model can be done using Azure SQL Database (or a variation of it), Azure Cosmos DB, or even in an Azure Data Lake.

The third tier, the data presentation, can be done using Azure Synapse Analytics, the data warehousing solution on Azure, an Azure SQL Database specific to the data presentation, or even via a (custom) (REST-)API.

PowerBI is Microsofts tool for BI, consuming data from the Azure Synapse Analytics platform, an Azure SQL Database, or an Azure Cosmos DB. Many other connections are available to PowerBI as well, though it is tempting, using PowerBI for data integration is not recommended.

When presenting data via an API Azure API Management should be considered to implement API management and security.

Example implementation on Azure

Afterthoughts

The 4-tier architecture and its implementation on a hyperscaler comes always with the need for DevOps or CI/ CD processes. Therefore the infrastructure needs to be implemented via an IaC solution. Depending on the selected solution this comes at the cost of a more or less complex implementation of CI/ CD workflows and also at cost of writing tests to verify the state of the infrastructure, the configuration as well as the data.

Also, the configuration of all services needs to be specified as code, the paradigm of declarative programming is useful when working with infrastructure, databases, and integration services. This paradigm is different from standard imperative development approaches, therefore a shift in thinking about the development is needed.

Have you already implemented a data integration platform? Would it help your business to create more knowledge from your data? Let me know in the comments!

Every business has its special needs and the services above are by far not comprehensive, if your business needs data logistics and a custom fit implementation, feel free to get in contact.

And as always, if you like the article leave a clap, if this article was helpful to you, you might want to consider:

--

--

Otrek Wilke
CodeX
Writer for

Data Engineering made easy. Writing about things learned in data engineering, data analytics, and agile product development.