Designing More Flexible and Scalable DataVault Components — Evaluation (Part 3)

Cesar Segura
SDG Group

--

DataVault is undoubtedly one of the best candidates to consider when it comes to determining which methodology to adopt in your data architecture. At its core, Datavault offers an agile way to design and create efficient data warehouses.

However, managing and defining data within a data architecture comes with a number of challenges and questions. For example:

  • What happens when we work with several countries/corporate segments that need to provide information to our model?
  • What would happen if suddenly and unexpectedly (as often happens) we need to incorporate the same business concept but with a completely different paradigm? (HUBs scenarios in Part 1)
  • How should I apportion fields in the satellite? How many fields should I use? (FSC Segmentation in Part 2)
  • How would we solve our components if our sources of origin incorporate new information periodically and we want to efficiently manage the number of satellites? (SATs scenarios in Part 2)
  • What is the best strategy to make my model more Scalable, Efficient and with the less cost of evolution on my Data Architecture in the Raw DataVault model?

In the previous stories, we have gone diving into some of the various design approaches for DataVault models and the various challenges associated with them.
In addition, in this history review we have walked through some of the main parts you have to take into consideration in our DataVault Architecture: the main pillars HUBs and Satellites.

  • We have reviewed different scenarios and how to implement them.
  • We also saw how to implement a good strategy of our satellites in order to evolve them in a scalable way.
  • And finally, we will evaluate all the different design scenarios with pros/cons to come to the conclusion what it would be the best and why.

Evaluation of the different Evolution Scenarios

In order to assess which is the best scenario, we are thinking what should be the common goals to dimension an architecture? Let’s think about the SET acronym. It is defined by MAXIMIZE Scalability and Efficiency, and REDUCE the “T” that means Total Operating Cost (TOC). And TOC is segmented in three pillars: Cost, Infrastructure and Maintenance. All three together are Base factors used in many Data and IT architectures.

Here below a brief definition of each one:

Scalability: We highlight this aspect due to the factor of allowing our model to be scaled with as little impact as possible. Our model can grow without being affected by changes in the studied approach. When there are multiple sources of information, and many more emerge over time, and even the existing ones evolve and incorporate new information frequently, we need an architecture with a model that allows adaptation with the least possible impact. We will really penalyze the effect to change physical structures.

Efficiency: It is an important factor to consider, if not one of the most important when it comes to managing and incorporating new information, as well as its consumption. The fact of directly having the data in columnar form allows us a higher performance, thus affecting a lower cost and less demanding needs of our architecture.

Cost: Here we put the factors that mainly affect: the calculation degree cost of the data management processes, the necessary cost to be able to provide adequate comfort for the consumption of information, and the level of maintenance of our processes due to the dispersion of different complexities.

Infrastructure: Necessary requirements to support a DataVault architecture defined in our scenario at the technology / hardware level. Our DV components we will need a special treatment for semistructured information. We need to scale our resources fastly, whether the amount of information will be increased suddenly or, we need to increase the computational power to permit finishing queries and balance their new existing complexity.

Maintenance: The fact of having homogeneous structures of data components (processes, data management, data objects… ) will ease the government of our data architecture, so it will allow us to modify, analyze and consume our data easily. For example, we don’t want to have a greater number of processes with dispersed logics so it will be harder to understand in what case applies to each one. Another point we will consider will be: how do we quickly identify the business information in our system, and how do we apply that information in each case when we want to exploit our data.

Below the matrix Evaluation Comments Detail for SET

Matrix Evaluation Comments Detail for SET

Below the matrix Evaluation Category Detail for SET

Once we have the different considerations with the previous matrix, we will make a matrix category evaluation of the different factors.

We will keep in mind that TOC scoring is based on Better is Lower.

Evaluation Category Detail for SET

Below the matrix Evaluation Category Summarized for SET

Our final goal will be to get a final summarized evaluation of what scenario is better with a note.

If we would want summarize in final terms of SET, we will assign the below Scoring Points Matrix:

Evaluation Matrix Score

If we will apply that scoring, it will be like this:

Evaluation Category Detail for SET with scoring

We will check the below indicators:

Scalability vs Efficiency
It will be the average: (Scalability + Efficiency)/2

TOCp (TOC in positive perspective)
It will be the formula evaluation below: 10 — (Cost + Infrastructure + Maintenance)/3

SET — Scalability vs Efficiency vs TOCp
It will be the formula evaluation below: (Scalability + Efficiency+ TOCp)/3

Applying the “Evaluation Score” from the Evaluation Matrix Score:

Summarized Evaluation SET for the different Scenarios

Conclusions

On many occasions we use the Single Scenario case, or failing that, include a new component so as not to alter the current DV architecture. But we can see, the fact of using a Flexible scenario gives us a scenario with a lot of benefits. The SET of the Flexible scenario can provide the needed confidence to any company, despite it requires more demanding infrastructure than Single. The Flexible scenario will be able to scale without any problem and will even be able to process semi-structured information with no altering the current DV structure. If you make a good FSC segmentation and apply a Flexible scenario you will get Scalable Data Vault Components in your RAW architecture.

Finally, and regardless of the analysis factor on which we base ourselves, I will always recommend that DataOps techniques be used in your DataVault architecture. DataOps driven by metadata, it will perform your SET matrix in your Data Architecture.

SET is usually the most concerning factor of the data projects in the companies in which I have collaborated, it is based on my 15 years+ of experience. I am sure you agree with me, in the SET aspect, aren’t you?

Comments: If you want to know more about that series, please follow on the next parts:

--

--

Cesar Segura
SDG Group

SME @ SDG Group || Snowflake Architect || Snowflake Squad Spotlight Member || CDVP Data Vault