How we built a master dataset to transform 700B data points in an economic index

Published in

CreditorWatch

5 min readFeb 19, 2022

This article is a series of 4 in which we’ll cover end to end how we transformed 700 Billion data points into an index to track credit risk across Australia.

We will start from the very bottom, from the raw data sourcing, making our way into regular databases, leaping into the realm of big data and finishing in the complicated world of machine learning.

This series of articles would give any engineer and/or data engineer a (hopefully entertaining) journey of how to create end-to-end complex data-centric projects, covering the basics and increasingly adding complexity to finish up with a very advance enterprise-level data project with tens of technologies involved.

The first thing we’ll define is the difference between OLTP and OLAP, just because they are somehow related and in a data project of this magnitude, both are used.

OLTP stands for Online Transaction Processing. An OLTP system should be able to handle a large number of requests and respond in a very fast fashion. Usually, the queries we see in these systems are somehow simple and don’t use many records and they are usually indexes, so the OLTP has specialized in running a certain type of operations. An example would be any of the websites we use every day, such as e-commerce. In our case, it would be our website and APIs, where we serve credit profiles and assess risk in real-time in seconds or milliseconds.

OLAP, on the other hand, stands for Online Analytical Processing. It is used for complex analysis, predictions, and aggregates of big datasets. In this case, the response times are expected in seconds or minutes, there’s usually not much traffic and the queries are large aggregates of thousands or millions of data points. The business risk index would be an example of an OLAP platform.

OLAP projects run on data, and most of the time that data is generated in the OLTP layer. This is also the case with our business index. So in this first article, we will show how do we build the master dataset¹

In this article, we’ll talk exclusively about our OLTP platform and how we ingest and store the data to be used later by our OLAP platform, pointing at key concepts that are vital to creating a robust dataset.

Our OLTP consists of a number of micro-services² (around 30) that follow a Domain-Driven Design². This means that each micro-service has a bit of information they are responsible for ingesting, processing and serving. This allows us, as a company, to increase our scalability in terms of teams, since a specific team is assigned a small number of micro-services they have to maintain, scale, expand and monitor. The micro-service approach has proven vital to our team growth, since we’ve been duplicating the team size every year without major losses in productivity or any other issues, such as the need to synchronize deployments or breaking other teams’ functionalities.

The data for each micro-service comes from many different sources (emails, APIs, SFTP drops, manual uploads, …), so the ingestion processes are quite heterogenous⁴.

Some micro-services will need to use some data from other micro-services, which leads us to the data-sharing problem, probably one of the biggest problems of micro-services architectures and distributed systems⁵.

Most of the time, and for real-time simple data sharing, we use endpoints that will surface that data. If that’s not enough and we need a bunch of records, we use batch endpoints to request multiple data points and reduce the number of endpoint calls that the micro-services need to make/receive.

The problem is much more difficult once we want to perform joins with a dataset that is entirely in another database.

To solve this issue and share data between micro-services, we use a technique called Event Data Pumping². This technique consists in triggering an event every time there’s been a change in any data point in any of the micro-services. That event, then, can be caught by one or many micro-services and they can do with that dataset whatever they see fit.

An example might be our RiskScore. RiskScore uses dozens of different data points, and we want to be sure that we have the latest RiskScore on file in our OLTP platform. In that case, the RiskScore will listen to every single of those events that might change its value and, every time an event is received, it will re-compute the RiskScore with the newly updated data. Similarly, when a new RiskScore is generated, it creates an event that gets picked up by our alerting system and, if that RiskScore belongs to a company that is in the portfolio of any of our customers, it will create an alert.

And this is how we can go from ingesting a datapoint to alerting a user in a matter of a few seconds.

All this was a small introduction to probably the most relevant fact of this first part: every event gets persisted in our data lake. This means that every single data change in our platform is documented. We will see in the next series how this fact is extremely relevant, not just for auditing and debugging, but also build our master dataset.

Simplifying it a bit and following our AWS implementation, it would look something like this:

Just re-capping on that image, our micro-services (left) would ingest (or change) data. This data would get stored in DynamoDB, which will trigger 2 lambda functions⁶ via DyanmoDB Triggers.

One lambda function would be a back-up in our data lake (implemented in S3 using Parquet) and the other one will publish the event into Kinesis Data Streams, a managed data streaming service, which will then re-route it to the listeners for that particular stream.

A listener doesn’t necessarily need to be another micro-service, it can be a real-time analytics platform (such as QuickSight) or any other application or data store, such as ElasticSearch.

As a summary we’ve identified the most relevant players in our OLTP platform:

+ Microservices:

Ingest and broadcast changes of a small domain of information

+ Event Driven Architecture:

Architecture that allows us to share information between micro-services via events in semi-real time

And we’ve highlighted the most relevant facts:

+ Microservices follow a Domain Driven Design

+ Microservices communicate with each other’s using:

Endpoints
Batch Endpoints
Events

+ Every data change triggers an event

+ Every event is persisted in our data lake

In our next article, we’ll cover how we made probably the biggest tech leap in our history. We’ll explain how we build our OLAP platform, a key step to build our business risk index and how it’s related to our OLTP one.

Sources:

[1]Nathan Marz and James Warren. 2015. Big Data: Principles and best practices of scalable realtime data systems (1st. ed.). Manning Publications Co., USA.

[2]Sam Newman. 2015. Building Microservices (1st. ed.). O’Reilly Media, Inc.

[3] https://martinfowler.com/bliki/DomainDrivenDesign.html

[4] https://medium.com/creditorwatch/microprocesses-a-new-architectural-design-pattern-for-background-jobs-on-a-microservice-172a8a19ba8f

[5] https://medium.com/creditorwatch/join-data-among-microservices-1cda360c6c1c

[6] https://medium.com/creditorwatch/aws-lambda-facts-you-wish-to-know-before-processing-2-billion-lambda-executions-2021-78fe77183c80

How we built a master dataset to transform 700B data points in an economic index

Written by juanjolainez