Designing More Flexible and Scalable DataVault Components — SATs (Part 2)

Cesar Segura
SDG Group

--

DataVault is undoubtedly one of the best candidates to consider when it comes to determining which methodology to adopt in your data architecture. At its core, Datavault offers an agile way to design and create efficient data warehouses.

The main objective of a satellite is storing the detailed information of the business concepts and their relationships / transactions from your data sources in an auditable way. The most common behavior of your sources of information is that they evolve, new sources of information emerge, they stop storing information, changes their field names, and endless circumstances. It is for this, our DataVault system and their components apart from being flexible, must be scalable, efficient and maintainable. In this part, we will pay special attention to satellites, due to the fact that there will be changes more often than the HUB scenarios. We will ignore the satellites of trackability, effectivity, etc that usually don’t have changes in their structure.

In the scenarios below, we will dive into some of the various design approaches for DataVault models and the various challenges associated with them. Our end goal is to apply a good design with the correct approach to ensure that our DataVault model is efficient, scalable and adaptable to our data architecture.

Satellites — Approach and design

When the need arises to implement the satellites around our HUBs or LINKs, there are some questions in our mind.

  • How many satellites of detailed information do we need ?
  • How much detail of information should we manage in a satellite?
  • What criteria should I follow to design them?
  • How could I evolve them?

The methodology should be to segment the information of our component by the three fundamental axes: Functionality, Security and Cadence. In this way, it will allow us to dispose of our data in the most efficient, secure, manageable and less redundant way possible.

Usually we tend to put everything in a single satellite, this is our first error. By default that is not in line with best practices. Firstly, we have to segment them by Functionality; the separation by subdomains of information based in our company reference model comes into play. Secondly, for each Functionality we define the level of Security, and finally, we split the resulting satellites by the different Cadences. All this process is named FSC segmentation.

FSC segmentation

As we can see in the following diagram, the main three guidelines in Satellite segmentation are:

FSC Segmentation

For an example in our use case CUSTOMER (Business Concept), let’s do the FSC segmentation satellite at high level.

Functionality:

Security:

Cadence:

We will be able to adapt to changes depending on; how we segment our information in the satellites, how our information sources evolve and how flexible our DV architecture is. In those cases it would be hard to manually manage the changes in our satellites. It is for this reason, it is very important to rely on a metadata system to manage easily and efficiently all these operations; the use of DataOps through metadata-driven being interesting in our Data Vault Architecture.

Satellite Evolution — in the information structure related to business concepts and their relationship

Once we have done our FSC segmentation, we must think about the different scenarios to implement a satellite. And the most important, the way in which they will evolve in our architecture.

Although we could expand on each of the types of satellites, we are going to talk about a normal satellite (that contains detailed information). Although this functionality could be slightly transferred to the rest of the types of satellites.

Initial SCENARIO

The SAT_CUSTOMER_SHOPPING_HABIT_S1_D satellite has already been defined by FSC at the lowest level. Our satellite is around the HUB_CUSTOMER, you can check the mapping load in the following diagram (pay attention on the Satellite part diagram details):

Initial Scenario for SAT_CUSTOMER_SHOPPING_HABIT_S1_D Satellite

In the previous diagram, we try to explain the different information mappings to load data into our RDV (Raw Data Vault) Architecture. It is defined how each field is generated.

On the left side, we have the PSA/NPSA phases where the different data sources are located. It will apply in the next scenarios in advance. There is an information table for each CUSTOMER country, one for Spain and the other for Italy. Both have a similar format schema and number of fields except for the name of the fields. We will delight in our Attributes mapping (SAT).

On the right side, we describe the format of our HUB (in RDV):

  • HUB_ID_CUSTOMER: Hash ID of the concatenation of the BK (Business Key) + tenants.
  • COD_TENANT: Multitenant, in this case we only apply a single level, but there can be a Multitenant (more than one level).
  • The Attributes defined like this:
    -> SW_VISITED_SHOP_LAST_7D: Attribute 1
    -> DT_VISIT_LAST_SHOP: Attribute 2
  • EDW_DT_LOAD: System incorporation date.
  • EDW_HASH_DIFF: Hash field generated by the concatenation of all attributes. This field will ease the checking of differences of your information (checking current hash field with the previous one), before inserting new information.
  • RECORD_SOURCE: Track information of our data origin

Among the scenarios that we will plan, we will rule out the option of creating a new satellite with the new fields (complementing the previous one in the same FSC level). Since there may be more changes, and we do not want a new satellite of one or two fields in our architecture every time. That situation would impact us having to completely refactor our BDV (PITs). It is not a scalable way to consider.

The scenarios that we propose are the case of a new source of information (the merger of the company with another). Although in other cases, for example when the same existing source of information incorporates or changes new fields in the same FSC segment of our satellite, it could really apply anyway.

We will Zoom into the Satellite mapping, the HUB mapping will be the same for all three ones.

SCENARIO Single(Attribute) Satellite

It is based on the principle of having each attribute information in one single field. It will check the difference between the new hash diff (concatenation of each attribute) and the last current one, in order to insert new information if these are different.

In this scenario our mission is:

  • Incorporate new Tenant Information.
  • It will incorporate the new AT (Attribute) in our existing SAT (Satellite): it will add the COD_TYPE_LAST_BUYMENT field. Check for the FS3 field behavior in the HUB.
Single(Attribute) Satellite Scenario

The main challenges of Single AT will be refactoring our existing processes and structure.

  • Unavoidable that we have to alter the physical structure of our satellite, adding a new field.
  • We will have to refactor our current process in order to manage the new field.
  • Maybe the current process will have to assign values whether null or pre-established defaults, in cases it doesn’t apply to manage that field. In our case, for Spain / Italy.
  • That scenario will require too much effort in DataOps in DevOps coordination, it will mean more risk.

SCENARIO Composed(Attribute) Satellite

It is based on the principle of having all AT in one single field concatenating their values in a specific order.

In this scenario our mission is:

  • Incorporate new Tenant Information.
  • It will incorporate the new AT in our existing SAT: it will manage the new AT information in the current DESC_COMP_ATTRIBUTES field. It will remain together with the existing ones.

Check the FS3 field behavior flowing into DESC_COMP_ATTRIBUTES in the diagram below:

Composed(Attribute) Satellite

The main challenges of Composed AT will be to sophisticate the methods of store, access and labeling our attributes.

  • All the attributes have to be always stored in a specific order for each Tenant.
  • We will have to have a function that allows us to access a certain position of our Composed AT field looking for the value of our new AT.
  • It can directly impact on the access performance. Imagine that we have a lot of fields in the same satellite. A function that accesses each position for a certain attribute in a string, it will need a lot of computation resources (you think on each attribute for each position, with a lots of satellites).

Example of accessing function (Snowflake): split_part(String_With_Composed_AT,String_token,Number_of_position_token_to_retrieve)

  • Refactoring Order vs Labeling - Concept:

If we want to extend that method in all of our satellites, we will have to keep in mind two factors: Refactoring order vs Labeling, when there are changes on our satellites. The Refactoring order will be to keep an homogeneous order in the field across all tenants. Labeling is produced when we have a lot of different types of alias for the same position on the field value depending on the tenant.

  • Refactoring Order — example:

For that Satellite:

Top 3 rows extract from our Satellite

The Composed AT has one type of labeling. Although the COD_TYPE_LAST_BUYMENT field only applies on UK tenant, you can re-use it due to all fields existing in the same order.

Query on our sattelite, accesing on our attributes

If we would want to add some 2 new fields on the existing satellite (for SP + IT tenants), and we don’t want to modify our business query, we will to Refactor our process and the history of our data:

Check that the previously rows for our tenants have been changed!

The previous query will work, anyway:

If we want to select the new fields on our query:

  • Labeling — example:

Now, you have to imagine that it doesn’t exist the 100 % of the fields in every tenant. There are 20 different field values with a discrepancy of 1 or 2 fields of the difference between tenants. There are always a few fields that are in the same position but their meaning is different, or it doesn’t exist.

Our scenario, the variation of the previous satellite has in the position 2 different meaning business on each tenant :

This will convert 1 only field value position into 3 different values. You will have one labeling for each tenant in the same satellite, there will be 3 label methods in total for access. This will increase the complexity to manage every position meaning for each tenant. If you add more tenants, or more than 1 position value different, it can be hard to manage your consumption of the information. The possibility to rearrange the values will be discarded, it is due we don’t want to refactor the current information.

If this happens the Data Governance team will want to speak with you about that system.

  • As an alternative to concatenate fields, you can use an array if the technology that you are using can do it. In any case, you will have the similar difficulties, but maybe you can perform the information access efficiency.

SCENARIO Flexible(Attribute) Satellite

It is based on the principle of having all AT in one single field in semi-structured format. The most common method used is JSON.

In this scenario our mission is:

  • Incorporate new Tenant Information.
  • It will incorporate the new AT in our existing SAT: it will manage the new AT information in the current DESC_FLEX_ATTRIBUTES field. It will remain together with the existing ones. Check the FS3 behavior flowing into DESC_FLEX_ATTRIBUTES .
    In the following diagram, we show how our architecture should behave:
Flexible(Attribute) Satellite

The main challenges of Flexible AT will be to enrich our process with the new AT.

  • Our process will have semi-structured logic implementation. It depends on the technology you use, there exists special functionalities to ease with that logic.
  • We will store that information in appropriate field format. For example, in Snowflake you can use Variant.
  • The main advantage of using a variant, to store JSON format values, is that it doesn’t mind the order of the field values and how you store them. It will always store them in the same way. As you can see below on Snowflake:
It doesn’t mind the order! The hash will be the same!
  • We can use that advantage in order to implement our EDW_HASH_DIFF field from our DESC_FLEX_ATTRIBUTES.
  • The Labeling complexity won’t exist. We query the field we want to access, if that one doesn’t exist it will show you a null value.
Simple query on our top 3 rows on our satellite
Query to access on our attributes in Flexible Satellite

Considerations for Satellites Scenarios

Watch out for the order how you store your AT fields, it must always be the same one if you want to re-use it to build your HASH Diff field, your process should manage it. In addition, in the case for Composed it will be also important to identify the correct attributes for each tenant

  • Complex AT: Refactor vs Labeling, it will add some extra complexity in the current information process. The order of the attributes to build your Composite attribute is important. The decision you take will affect Scalability, Efficiency and Governance.
  • Complex AT: Anyway, it will really affect the performance of access to the information.
  • Flexible AT: It will easily manage the changes in your sources with no impact in your DV structure. It will be easy to generate your EDW_HASH_DIFF if you use that field (in case that automatically orders in the same way).
  • Flexible AT: It will not really affect the performance of access. Anyway I recommend, you should check your performance resources. It will depend on the structure of your satellite or how it is currently accessing the information. In the case of Snowflake being used as your database, you will monitor your warehouse.
  • Single AT: It will not really affect the performance but it will add an important factor of refactoring in our model, with its risk associated.

Comments: If you want to know more about that series, please follow on the next parts:

--

--

Cesar Segura
SDG Group

SME @ SDG Group || Snowflake Architect || Snowflake Squad Spotlight Member || CDVP Data Vault