Data Management in Data Warehouses

Debashis Das
7 min readFeb 9, 2023

--

Data warehouses is a place where we keep all data coming from different sources. Consider this as a central repository for all data points. Which can be used for analytics, machine-learning, archiving, data enrichment as well as data logging. In this blog I talk about how to handle different types of data sources as well as how we should store data and the basic architectures for batch and real-time processing

Data Sourcing

There are 3 types of data sources, I can think of.

  1. Event-based
  2. File-based
  3. API-based

Event-based Sources

Events can be generated by multiple systems asynchronously with writes into there local DB. Commonly people use event-bus to keep track of events and whoever needs it can pick it up from the event-bus. Event-bus tech like Kafka sends warehouse different events on the data with different data points. To visualise it let’s talk about a loan creation process. When a customer applies for a loan, based on the customers past records like banking informations, his current assets he is provided with a score by multiple agencies. This is called credit score. So based on his credit score he is provided an eligible loan amount like if a customer applies for loan of 10,000$, if his credit score is low, he might be eligible for only 5000$. Similarly like this a lot of middle events are generated until his loan is approved.

This generated data is coming from multiple micro services and other than applicationId all the data points are different. So its manageable by an event-bus. Lets talk about ways we can publish this data into event-bus

  1. Having a single schema for all the data points and have an event-type which tells the current state of the application
  2. Other approach would be to have multiple types of data with status but schema less.
  3. We can also have different topics for all of them which adds more infra-based dependencies on adding a new micro service. As well as having so many individual topics is Crazy!! So I rule this option out.

Having a schema is very critical when it comes to data-management as data-level validations can be done at schema level so you never get a Null pointer Exception, a field provided cannot be String if it is defined Integer in schema, this is one of its top benefits. This led to the invention of data-serialisers like Apache Avro. As in it validates that the data provided by parsing it against the schema provided. If we get a parse exception this means we have data issues. So we can trust our data to be clean if it has a schema based parsing.

File-based Sources

You can also have data dropped as files by 3rd party services into Blob stores like AWS S3 periodically, which is consumed and stored in data warehouse. Here we need triggers when the data become available where frameworks like Apache Airflow comes in handy. We can track these upload events and then download the files and dump these values in our warehouse. Apache Airflow is a tool which basically works on the principle of DAGs(Directed Acyclic Graphs). Where each node can do some tasks and we can link these tasks to one another. We will talk about it more when we discuss about Data processing.

But can we expect the data to be clean ? No this data will not be clean, you might need a data validations layer before storing this data. One other thing we can do is talk to the 3rd party teams and define a schema and then parse it using Apache Avro. If parsing fails means data coming is not in correct format. We can even have a common place like git repo where we maintain these schemas and version them. So if a contract change happens, all the users of the repo come to know about it. So this way contracts never break and are even versioned.

API-based Sources

API sources don’t come into picture until we need them. These may be source of truth for some data points, can be even be used for data enrichment.

Data Processing

After having a clear idea about sources, let’s talk about processing mechanism. Data warehouses generally support both table based Relational as well as raw formats like JSON. They generally don’t enforce Foreign keys and you can query inside unstructured data as well. So do we need to normalise/denormalise the data coming in a JSON formats to multiple tables. This question is a mystery for me. But I bend towards taking out only the values we need from these JSONs and leaving it in raw formats. Only issue in this approach is you need to backfill all the data-points when 1 more data-point is needed in the future.

Now let’s talk about the DB updates. Since we are a managing data in a data warehouse should be change values directly on the columns with new values. Suppose event A saying approved loan amount is 5000$ but later an event B comes which says approved loan amount is 7000$. Here should we change the value from 5000$ to 7000$ directly. No, never as a Data warehouse you need to know what happened when and why it happened. By updating this data you are deleting a critical information. Which can be later on used by data scientists/analysts to derive so information. So if you don’t update this what will you do, here is when Data Milestoning comes into picture. Milestoning is a process of keeping track on data in a timeline format. You can milestone data in 2 ways either by the time the event was generated or by the time when the event was processed.

I would like to do both as that is also a critical information. Suppose an event is generated at time t1 and processed at time t2, this t2-t1 gives us information on the lag aspect of processing which happened due to Kafka, our DAGs failed etc. So this information is also critical. We are building a data warehouse were any data is an information for us. It may not be now but having a timeline based structure helps us to identify critical points in a distributed systems.

Bitemporal Milestoning

This is a milestoning technique which helps us milestone data based on when the event happened and when the event processed. Let’s talk about the fields required for this. We can have event_start and event_end times for when the event happened, system_start and system_end times for when the values were processes. Let’s take an example to understand how these values will be stored in our warehouse.

Insert :

insert bitemporal

Update:

update bitemporal

Here date ‘9999–12–31T23:59:59 +0000’ is used to define an infinity date. In the top you can see what happens when an update comes. Here when a new insert comes we set event_start as the date which we get from Kafka as when this event happened. event_end is set to infinity date which means it is current active record. system_start is the date the record was processed for warehouse insertion and system_end says it is the current active record as per our system.

Now if you want to view the timeline of the data you can check is against which records are active in our DB currently using ‘application_id’ = ’123’ and system_end = ‘9999–12–31T23:59:59 +0000’ and do ORDER BY event_start/event_end. Gives you the whole set of changes it went through from start to end. Little complex to understand. If you need help contact me we can discuss on this.

Data Enrichment

Now you have at least the raw data cleaned and milestoned. Next comes business level validation. Business level validations are the validations you do on top of the data so that there is no issue in Data Integrity. Let me explain you with an example. You approve a loan of 1000$ but when a transaction happens you get withdrawn amount 2000$ which is not possible if the approved loan is 1000$. This simple example how important business level validations are important when it comes to data. Next comes if you want to check it against some value which doesn’t exist in the warehouse and is managed by different micro-service. We do a API-sourcing here to fetch those values validate the current set of values as well as if we need few fields to be extracted from it we can enrich our data based on it.

Architecture

I wanted to give 2 designs here, First for batch based processing and Second for real-time processing as user might need it based on their needs.

Design 1 : Batch processing using Airflow

batch processing

Here Tasks are individual nodes of a DAGs structure of Airflow. We can link them however you want and process them according to it

Design 2: Real-time stream processing using Flink

stream-processing

Here we take raw-streams and use sink to store raw messages and in parallel process them however you want based on requirements. Flink also has a lot of extra features of sliding windows which also adds cherry on top of it. However doing an API call during a Flink job makes it synchronous, to avoid this we back these values before hand into our warehouse(which makes db call synchronous but less latency) and the most efficient would be storing it in event-bus by creating a parallel stream and join streams while processing it.

Conclusion

Data management is not as simple as it seems, it takes a lot of components in place to store clean data without losing any information. You maybe unknowing deleting metadata which can be in very critical when scale increases. So think before you implement components. In my next blog I will take about how to store data if you have schema-less data.

Keep Designing !!

--

--