The Modern Scientist

The Modern Scientist aspires to connect builders & the curious to forward-thinking ideas. Either you are novice or expert, TMS will share contents that fulfils your ambition and interest. Write with us: shorturl.at/hjO39

Middle out your Data Strategy on Data Vault and Apache Iceberg

--

If we were to swap out each component of a data warehouse over time until every component was replaced, is it still the same data warehouse? — adapted from Theseus’ Paradox

Something profound is happening across the data analytics landscape; yet another open-source project is leading commercial enterprises to adopt an open-source standard and this time it is for a table format called Apache Iceberg. We discussed how you can easily use Snowflake to build and query a data vault on Iceberg here; the focus for this article is the wider implication to your cloud data architecture strategy whether you’re using data vault or not.

Despite the computing industry being decades old, the fundamental components of a computing are the same,

  • A method to process instructions (code) — CPUs, GPUs.
  • A need to store data ephemerally or permanently — memory and disk storage.
  • A communication medium to transmit data and instructions — network.
  • Software to manage it all — operating system, protocols, drivers, security and so on.

Fast forward to today and these components are available for rent on the cloud, software is scaled out and containerised as microservices and there have been attempts to scale out analytical data as well. What we have learnt however, is that data has special considerations before it can scale-out,

“Data is not a Microservice” — Chad Sanderson. bit.ly/3VBanNt

Data it not the new oil either, data has properties not comparable to oil;

  • Unlike oil, data has no scarcity. If it is shared then others will have copies of that data.
  • Data on its own has no value unless it is given context and refined to extract meaningful insight, whereas oil has intrinsic value as raw material.
  • Data has sensitivity and requires classification for its use.
  • Data is ephemeral, it can be as easily destroyed out of existence. Oil cannot be deleted but instead transformed into another state (gasoline, plastics, clothing…etc.).
  • The value of data depreciates sharply over time and is only as useful as it is accurate and enriched with more data.
  • Data is inherently biased.

Data captures state information about business objects and their interactions at a point in time, however, to analyse that data, the data must be accumulated and stored in a columnar data store. The inertia in the data analytics industry today has settled on a common table format and that is Apache Iceberg,

Apache Iceberg, Polaris and the myriads of compatible tools

The beauty of Apache Iceberg is in its management of files. The opposite is either:

  • Tying in to a particular vendor’s proprietary block-based storage; it’s expensive and difficult to maintain with nuanced administrative tasks to manage block-based data, referential integrity and efforts to reclaim storage.
  • Resorting to cumbersome file management as table constructs through a metastore as Apache Hive have done it.

What made Big Data accessible to the wider scope of data professionals was the introduction of Hive and Spark SQL turning Big Data into a Big Data Warehouse, but they are both band-aids designed to apply relational algebra as table semantics on data organised into files and managed through a metastore. Hive and Spark SQL are limited in their ANSI SQL compliance and what makes Iceberg particularly attractive is that there is no technical debt imposed on the SQL utilising that data. Users working on data supported by Apache Parquet for example, must include a part_date column in their SQL queries, ostensibly a table modelled as an SCD Type 2 dimension deployed as a parquet table does not perform well because a query across time must scan every file that constitutes parquet table. Platform administrators are forced to repartition a table intended to track only true changes; something Databricks performs for you automatically but not for free.

The point is, with Apache Iceberg, analysts can work with the analytical data dimensionally instead of incorporating technology limitations into their SQL. This brings Big Data table structures on par with the foundational relational algebra SQL is based on and vendor SQL semantics already provide. In other words, Apache Iceberg provides uninterrupted SQL (or debt-free SQL) through hidden partitioning.

The momentum with Apache Iceberg ultimately means the industry is settling on a table format for the AI Data Cloud. One that is universally agreed on and all vendors that matter are embracing to support. The game has shifted from going all-in on a particular vendor to instead choosing a vendor for a fundamental computing component (CPU, memory, storage, networking) of the data warehouse it is designed to do well.

The Apache Iceberg committee have recognised this trend and have in response developed a REST API specification for scale-free portability of Iceberg with any vendor tool.

Apache Polaris is built on this specification.

Apache Polaris

Layers of an Iceberg

The Apache Iceberg documentation includes a REST API specification and Apache Polaris (a project donated by Snowflake) defines a data catalogue on that specification to include:

  • Defining catalogues (databases & schemas) to organise Iceberg tables and views under custom and nested namespaces up to 32 levels deep managed by Polaris or by an external catalogue provider.
  • Connectivity configuration defined as service principals.
  • Access control via catalogue roles (assigned privileges) and principal roles (grouping service principals and users).
  • Tables and views that map to Iceberg tables and views.

Apache Polaris can be hosted in the infrastructure of your choosing and of course available within Snowflake’s secure architecture as an open catalogue.

The folly of avoiding vendor lock-in

Avoiding vendor lock-in is an admirable IT architect’s trait. However, doing so obstinately ignores the strengths for which that vendor was chosen in the first place, or would qualify ahead of other vendors. For example, dbt does not support Snowflake’s streams but it does support Snowflake’s dynamic tables. Because dbt did not adopt Snowflake streams it means that dbt customers using Snowflake as a data platform must develop a cumbersome framework for managing Snowflake stream offsets if they choose to use them. By applying a strategy to be data platform agnostic, dbt can never be masters of any data platform.

Consider:

  • Apache Iceberg is an open-source project, and by adopting Apache Iceberg as your chosen table format you’re committed to an open-sourced solution, is it any different to locking into a data platform or vendor tool?

Yes, you do not get any software support, and you must rely on the open-source community and the Apache foundation to prioritise change requests and bug fixes. Recall that this ticket has still not been resolved in Hive and Spark SQL respectively (I first mentioned this particular shortcoming back in 2019).

  • Choosing a cloud service provider (CSP-AWS, Azure, GCP) locks you into how that CSP provides its infrastructure. Is it worth the cost and replication of going into a multi-cloud solution?

That depends on the CSP’s region to your customers and your regulatory needs and how some of the CSP’s features you may deem more attractive than others.

The trend today is to not adopt a single vendor’s capabilities for all components of your computing platform; thus, you should consider a mix of fundamental computing components that serves as:

  • A processing engine (compute) for analytical workloads and machine learning.
  • Data medium (storage) best suited for your use cases.
  • Channels for transferring data (network) as in batched and streaming workloads and the connectivity of custom and vendor tools to your data.
  • Orchestration software to manage and automate it all which depending on the service model may even include managing an operating system.

A goal of solution architects is how easily would it be if we needed to swap any of these components out, a solution architect grades a tool’s stickiness (options), the lower the stickiness the easier it is to swap out components of your data architecture. Therefore, choosing open-source and vendor tools are graded on how easy it is to remove them. From the perspective of commercial tools, they must continue to show ongoing business value so solution architects wouldn’t even consider an alternative vendor for a fundamental computing component.

A great resource for thinking about vendor lock-in and thinking of data strategy metaphorically as options is the work published by Gregor Hohpe you will find here:

If you want to know what popcorn analytics is about please take a moment to read this article “The OBT Fallacy & Popcorn Analytics” here, bit.ly/3Pfhbgz

“Uncertainty increases an option’s value” — Gregor Hohpe

Onward to part two of this article…

Data Vault Data Products

“Non-openness and collecting rents impede the success of a standard, because it impedes adoption.” — Auren Hoffman and Will Lansing

What we saw above is the decentralisation of technology concentrated into one vendor tool, each tool is chosen for what they are good at and therefore the goal of vendor tools is interoperability. Leveraging these components around data architecture is what we will discuss next.

Data vault only has three table types and they are modelled over the three elements every business looks for in their data:

  • Business objects / business entities — how to uniquely identify a business object and how to describe it (glossary). Hub tables.
  • What other business objects do your business objects interact with either as a transaction, a business event or other activity recorded interactions from business processes. Link tables.
  • The information state of business objects and their interactions. Hub and link satellite tables.
Data vault model building blocks

Independently, each table is not a data product, however the product of these tables combined offer support of consumer data products. Aggregations of data vault tables support a data product either horizontally by joining data together or vertically providing the corporate memory we can replay at any point in time and a downstream data product can leverage.

Anatomy of a data product

Data Contract

A data product does not exist without a data contract, otherwise what guarantees do you have that the data product meets any expectations. The cornerstone of a data mesh is to treat data-as-a-product; therefore, a data contract will address:

  • Discoverability — a data product should be registered and discoverable within an enterprise data catalogue.
  • Addressability — a data product should have a unique address following a global convention that helps its users to programmatically or manually access it.
  • Understandability — by providing the information (schemas, metadata etc) for a consumer to be able to use a data product

o Purpose — of the data product (originally)

o Business terms — describing the data assets, business entities, relationships and state information.

o Semantics — logical data model — constraints and relationships.

o Domain — maps to domain map, business capability and ownership.

o Data product type — operational (source-aligned), analytical (aggregate-aligned), engagement (consumer-aligned).

o Schema — attributes and data types, mandatory and optional fields.

  • Trustworthy and truthful — domain owners will guarantee and communicate their data product characteristics and service-level objectives (SLOs). These include data quality attributes such as:

o Interval of change — how often changes in the data are reflected, batch, near-realtime or realtime.

o Timeliness — the skew between the time that a business fact occurs and becomes available to the data users.

o Completeness — degrees of availability of all the necessary information

o Statistical shape of data — its distribution, range, volume.

o Lineage — the data transformation journey from source to here.

o Usage — popularity

o Domain ownership — subject matter expert, custodian

o Version — denotes schema evolution and business rule evolution.

o Explainability — if the product includes machine learning features and how trustworthy they are.

  • Natively accessible — users can access data product outputs with their tools of choice.
  • Valuable on its own — a key consideration for data product owners in defining a data product to ensure that a data product will deliver value which may be for its own or other domains.
  • Secure — restricting access and classifying its content

o Encryption — at rest, in transit

o Role-based access control — authorisation

o Data classifications — personal data, sensitivity classes

o Data retention policy — life cycle and expiration

o Service level agreements and thresholds

Components of a Data Vault Data Product

Data contracts are change-managed and depending on your philosophy, they are owned either by the data consumers or data producers. Something for you to consider when deciding for your data strategy is that an upstream software domain owner has his/her own set of priorities to ensure their data products deliver their intended business service, they know little about downstream data needs of data engineering pipelines. Only when business cases are defined, and the source-data needs are identified does any of the data the source-aligned data product features show any downstream value. These data elements are called critical data elements (CDEs) and once identified are then included in a data contract as the features that need to meet certain guarantees and service level agreements (SLAs). As a guide consider this stepwise process:

  • Identify data and information needs.
  • Identify source-domains that meet those needs, if they do exist.
  • Identify stakeholders and software domains owners.
  • Determine if the source-software is a custom application, vendor tool or legacy platform.
  • Start the process of modelling the needs and identifying the gaps, timelines which helps determine the timeliness of delivery.
  • Set expectations for data quality, SLAs, frequency for change.. etc.

Once the data has been identified and the attributes describing a business entity or business event (interaction between entities) located; the method of ingestion is then designed and will aggregate the data into an analytical data product. There will be a need to historise that state change as slowly type 2 changing dimensions (SCD Type 2). The problem with SCD type 2 tables however is that they require an update operation to end-date the current active record in that dimension when its state does indeed change. This operation is not feasible on a file-based object store due to a file’s immutability and the execution of an update itself is expensive because that execution must locate the current active record in that SCD type 2 table backed by file-based partitions. Apache Iceberg has the option to operate in one of two update modes,

  • Copy-on-write mode that essentially will create new partitions with the updated record (end-dated) and the new record with the new state of that record (high-date). A single partition where this record was located will also contain other records that did not need an update at all but must be copied to the new partition alongside the new and updated records. Because the state of these records is co-located, a single file is read when needing to decipher what is the current state of a record. The downside however is that if for an Iceberg Table you are expecting high-frequency updates then you will experience high-churn in updates to those records. The same would occur if we needed to execute a delete operation.
  • Merge-on-read mode instead deploys remora-styled files that effectively act as markers to denote which records have been deleted (aka a tombstone marker) and new active records persisted into new partitions (files). Instead of churning entire new partitions to accurately capture information state and copy over records that are not updating, mini files are generated with these tombstones and are included in the read operation retrieving the object state. On the plus side this makes insert, update and delete operations faster but the reading of that data slower due to there being more files to read to get to the accurate state.

Data vault satellite tables are insert-only which means they do not have an end-date column and therefore do not need the execution of a delete operation. To find the current record we rely on a remora-styled construct of our own called the CPIT along with the execution of bloom filters to find the current active record in a SCD-type 2 styled table. The same operation is possible between staged content ready to be checked if it is a true change against that target satellite table. A satellite table is an SCD type 2 table without the end-date column and the same satellite table construct is used for high-churn inserts (metrics and facts) as well. Advantage data vault.

Let’s describe some common scenarios that could form the modelled foundations of data products, or at least the analytical data model patterns we see to support your data products backed by a data vault.

Aggregate Data Vault Products

“Gather together those things that change for the same reason, and separate those things that change for different reasons.” — Single-responsibility principle, Robert C Martin.

Aggregate in this sense does not strictly mean taking the lowest grain of data and transforming it into a sum, average or outlier of that data, no. It means taking all the data we have captured about a business object and its relationship to all other business objects and aggregating that to support the proposed business case. Of course, your analytical need may require metric aggregation, the term has a latitude of meaning in the world of data vault.

Each aggregate product of data vault artefacts fulfils an aggregate need as an engagement data product, enriching business objects (hub tables) with more and more business cases. New or existing business cases may choose to ingest more data already provided by other aggregate data product artefacts (hub, link and satellite tables) in the data vault.

The question here is, who now owns that shared data vault table? If a business case and aggregate data product becomes deprecated then the ownership of that artefact now falls with the remaining data product owners. A secondary question you might ask is, should a satellite table evolve (adding or removing columns) then what are the downstream implications if that satellite table is shared? If none then carry on; if it is significant then you may consider a satellite split.

Let’s demonstrate by illustrating common aggregates across a data vault.

Batched Transaction Product

Batched Transactions

Instinctively when data vault modellers hear the world ‘transactions’ they default to modelling that as a transactional-link (t-link for short); the reality is that this modelling form has been deprecated because loading batched transactions as a ink table breaks the original definition of a link table; and furthermore we can instead define batched transactions into a link-satellite table and either the transaction id or transaction date is then defined as the intra-day key (satellite with dependent child key pattern). The link parent remains the distinctive interactions between business objects which you would build further analytics on.

Furthermore, by deploying autoincrement columns in each satellite table needed for the downstream Kimball model you could essentially model a star schema based on an underlying data vault model by reusing those autoincrement ids as temporal surrogate keys to slice and dice your dimensional data model.

For a reference of the above schema click, bit.ly/3qHKHSS

Business Vault Data Products

Sparsely modelled, business vault transforms raw application data

Raw vault tables are the aggregated outcomes of the source data landscape, this means we have mapped business keys (as best we can), interactions between business keys and appropriately mapped satellite attributes to what it is describing from the business process domain automated as application software. Business vault (correctly) is sparsely modelled, and it must meet the same auditability standards as raw vault, let me explain.

  • Raw vault outcomes are capturing the business processes and rules as defined by source applications that may meet business requirements exactly or nearly.
  • Business vault is taking raw vault outcomes and applying business rule or process transformations we could not offload to the source application either to solve a source-system gap permanently or temporarily.

Therefore, raw vault is made up of hubs, links and satellites and business vault extends raw vault with links and satellites. This distinction is important because it promotes the fact that business logic is nothing more than business rules deployed as code and an asset on its own which should be broken down into a taxonomy to promote reusability, scalability and auditability. Business rule change will then be reflected in every artefact that uses that business rule. Business rule change has a timestamp, therefore deploying this as an SQL view is folly because you have then hardcoded business rule evolution into your rule when in fact a business rule should be autonomous to what was applicable before. Business rule logic change infers data change only if the outcome following the application of the business rule causes a difference value output. If the business rule evolves but no true change record is inserted into the business vault satellite table then there is no way of knowing that the new business rule is in place except by discovering that in the metadata of the business rule repository (git version control). If the rule does cause a true change then the record-source column must carry that change version. Now you could track all of it using a record tracking satellite table!

Business rule code is separate from the business vault link and satellite tables which means the logic itself can be deployed in any language or tool and the outcomes treated as a data source so that the data vault automation tool can simply load these outcomes as business vault link and satellite tables. We should not rule out business vault views entirely but understand the repercussions if you default to deploying business rules views, as a rule applied in that view is applicable across time, introducing business rule evolution into a rule ultimately means you will hard code a from date for certain rules. The other precaution is the danger of stacking views, should a change be needed in a stacked view the impact is felt in every dependent view.

Business rules should be categorised before they are implemented by a business rule automation tool. The taxonomy of business rules will include reuse of component code that does not paint the complete business rule, but its reusability means metrics and calculations are shared across analytical data products.

The last point needed to make physicalised business vault tables work is the adoption of a bitemporal data vault. Bitemporal data vault include two non-business date columns for tracking time, they are:

  • Load timestamp for when the record is inserted into the enterprise data warehouse
  • Applied date which is considered the state date of the data from the source application / or source-align data product. Aka, the extract date.

Applied date between raw and business vault must align and can also support loading the same data more than once by the same applied date to track business rule outcome change. In this case of course, the load date acts as a version identifier of that business rule outcome.

Apache Iceberg’s management of files underneath its table format as pointers also enables certain desirable characteristics for machine learning. Being able to track the state of the table at a point in time through time-travel you can create clones of the original table before data leakage occurred and restore table state to replay data ingestion in case of a corrupted load.

For a reference of the above schema click, bit.ly/3Zt6adM

Near-Realtime Data Products

Near-realtime analytics as a product

Before you consider near-realtime analytics determine if you need it. Most analytics is based on the current state of business objects and other analytics requires history to be processed in batch to infer outcomes based on business object behaviour that is learned. The real-time portion needs a history of data to infer outcomes which is accumulated over time — for these requirements the ingestion requirement is not realtime and in fact batched. The inference happens in the application layer when predicting customer behaviour in realtime as they interact with an application’s fronetend.

It is extremely important to qualify the business value upfront because it influences the complexity needed. We may decide to implement near-realtime streaming but the analytics calculated is still based on data at rest, whereas real-realtime analytics is based on data in motion. Most (if not all) analytics needs to aggregate data into a centralised store which implies data is moved and aggregated far away from the actual business event. Business events are raw events of interest, and like CDEs not all the raw data is that interesting. But those that are must be aggregated and stored to make realtime decisions and show the lineage of why a decision was made in the first place.

Business events are true changes by definition and therefore do not follow the regular satellite table ingestion pattern if the business need is for near-realtime ingestion. Non-historised link and satellite tables do not have a record hash because the upstream application manages true changes (exactly once semantics). The job of business vault in this sense is to turn that raw data into business insights as soon as possible, and we can do this in a repeatable pattern using the activity schema modelling methodology.

As far as Apache Iceberg support for near-realtime data is concerned, support is available to treat Iceberg as a sink for Kafka topics using Kafka Connect framework.

This is a great resource to reference when thinking about streaming analytics and event-based architecture: https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101

Business Key Mastering Service

One key to rule them all, hub tables map to the business ontology

Hub tables in a data vault reflect the business objects integrated by business key, the immutable identifier used by a business to track everything we need to know about that business object. As a multitude of relationships and state information is gathered as link and satellite tables accumulate around a hub table we need to ensure:

  • A universal business key is used to identify a business entity always and forever.
  • No unrelated data is ever associated with the wrong entity
  • Passively integrate across the software automation landscape (bounded contexts) by that business key, that universal identifier for that business entity.
  • The hub table is one-to-one with what the business considers a business object and nothing else.
Patterns we see to resolve Passive Integration

For a large organisation integrating a multitude of source applications, the value is bringing that data together, following the shift-left philosophy (move tech debt to source where we can) and using what we must take advantage of as the capabilities of an OLAP platform.

How do we efficiently get the data out?

For example, everything we know of a customer, account, financial transaction (and other business concepts you can think of) may span a multitude of source-applications you have built or acquired, they are aggregated into a 360 view of that entity. Data vault includes query assistance structures we call point-in-time (PITs) and bridges to join raw and business vault attributes (features) around an entity or relation. These structures serve two purposes (and are not business vault structures):

  • Hide the complexity of joining related satellite tables (RV & BV).
  • Efficiently use hash-join optimisation from the underlying OLAP compute engine.

(How do you accomplish both if BV is not physicalised?)

Not too long ago, Bill Inmon updated his definition of what a data warehouse is as:

“A data warehouse is a subject-oriented, integrated (by business key), time-variant and non-volatile collection of data in support of management’s decision-making process, and/or in support of auditability as a system-of-record.’ — Bill Inmon, father of the data warehouse updated definition @ wwdvc 2019

Resolving business keys is an imperative! Constructing PIT tables with these objectives in mind must satisfy analytics,

  • Per consumer domain
  • Per the semantic model
  • Per cadence of the business date requirements

Of course, this means PIT tables are ephemeral and are moldable to any business requirement on the day. The business key mastering service is the careful governance of business keys upwards to the business and downwards across the software (source-system) domain.

Upper- and lower-case business keys should not differentiate between business entities, but it has happened. Salesforce recognised this error and introduced case safe ids to mitigate. However, when we discuss the mastering of business entities into a single entity it goes further than this. Each source application built or bought to automate part of or all of a business process needs to work with the same business entities. Because each software package is cheaper to acquire (they are the best at the portions of the business process they automate) we must somehow synchronise these business entities across the software landscape. That ‘somehow’ is the job of master data management (MDM) as it assesses business processes in realtime (one method of MDM implementation) and applies business rules to determine that we are managing data about the same business entity. In realtime, MDM injects an MDM id and match-merges against those business rules to collapse multiple MDM ids into a single MDM id for all related data around a business object.

Data vault is not involved in MDM, but it does treat the MDM id as a source because it is the single universal id to reliably (as defined by the business) integrate this data in realtime based on defined match-merging rules. It is not the id we use to represent the business object, that is up to the business, but it is the source to help us define what is the universal business key.

Optimised Information Delivery

Snapshots of keys and dates from Raw and Business Vault

Following the principle that query assistance structures such as PITs and Bridges are not business vault artefacts you can envision that these table structures are merely used to:

  • Reduce data vault querying complexity and,
  • Making full use of the ‘Build and Probe’ (hash-join) optimisation of the OLAP platform.

The loading of the raw and business vault artefacts are their own autonomous aggregate artefacts serving their respective business cases, taking a snapshot of the keys and dates is an aggregate artefact on its own and should be treated as such. In other words, the operation to retrieve and persist satellite table keys and dates to populate PIT tables is independent from the loading of these satellite tables. This is achievable while utilising a data platform’s ACID isolation level, Snowflake’s uses READ COMMITTED for its proprietary table format and complies with Iceberg’s Serializable isolation level.

The applied date in business vault will have the same applied date as its parent raw vault tables, and in fact may include other business vault intermediary artefacts as a source. Should a business rule outcome not produce a business vault record, the design of a PIT will retrieve the nearest applicable applied date anyway. This outcome will be the same if you are using a SNOPIT table instead.

For a deep dive into implementation detail, visit this recently published blog post: “Simple PIT Constructs”, https://medium.com/the-modern-scientist/simple-pit-table-constructs-ca40ee9305d5

Complicated subsystem data products

From time to time a complex business scenario may present itself that will require independent complex components and/or modules. We always want to solve complex scenarios at the source application but from to time it is either not feasible or not possible to do so and the scenario itself requires the orchestration of complicated subsystems and record the outcome in an auditable business vault artefact. Take the data model below as an example:

Solving the application versus business view conundrum

Raw vault is used to capture the polyglot data sources into hub, link and satellite tables; however, how the business views the business process does not match how it is depicted in the source application. To support this, we have deployed the business logic (using Spark GraphX’s Pregel API in this case) in between the resolving of the joins between the raw vault link and hub table and the loading of the Spark module output as a business vault link table. Modularising the components in this way also frees us up to version the module code or even replace it should a better suited and less complex solution become available.

Modularise components making everything in the data vault portable and versionable

This implementation is complex on its own and has its own article describing the business problem and how it was solved.

By the way Snowflake does support recursive common table expressions (CTE), the platform in question where this solution was developed was using Spark.

The Connected Data Vault Universe

The beauty of materialising your information landscape as a data vault model is that the more use cases you bring to fill in the canonical enterprise data model the more the enterprise corporate memory becomes physicalised and auditable. Data vault hub tables are strictly based on the business objects the business cares about (based on core business capabilities). All other data vault artefacts are built to serve the business cases they were designed to serve efficiently as link and satellite tables. All that aside let’s take a different approach to data vault modelling, let’s say there is no existing data vault model to integrate to, what would you do to design and develop an enterprise canonical data model?

  • Identify the business entities.
  • How these business entities interact
  • Do the attributes we get describe a specific business entity or are they descriptive of the interaction between business entities.
  • Define business view relationships as constraints in the physical model and therefore ingraining the business processes as referential integrity.

These imply a traditional data modelling approach of:

a. Define the conceptual data model (CDM).

b. Define the logical data model (LDM).

c. Define the physical data model (PDM).

A data vault does not strictly require this stepwise approach because the business objects are already known as they are based on an enterprise’s business capabilities, this covers our CDM.

Business processes and rules are resolved as link table structures, if new relationships are needed to be depicted in the canonical data vault model then we simply just add the new link table structure, there is no need to change any existing table structures. That covers LDM (mostly).

Raw vault satellite table attributes are identical to the source application that supplied it, therefore the task of defining data types for these attributes are already resolved, this covers PDM.

Aggregate of aggregates — canonical data vault model

Stream-aligned data teams integrate their independent data vault models by business key to the canonical data vault model. A business key is the start and end of all business objects, data vault’s flexibility gives a business object context. The same context we desire on how to treat those business objects.

Flexibility is defined as:

  • Non-destructive to change data vault satellite tables if new attributes are needed. Apache Iceberg supports schema-evolution.
  • Deprecated data is simply switched off and if needed archived which has zero impact to the existing canonical data vault model.
  • New business object types are vetted and added to the canonical data model.
  • Polyglot data sources are conformed to the three table types implying that retrieval of this data follows repeatable query patterns over this data.
  • Modularising logical artefacts (code) from physical artefacts (tables) allows us to independently upgrade or version either component without necessarily impacting the other.

Context is defined as:

  • Enriching data with business terms and definitions achieved by using a solution like reference data management (RDM), and an enterprise business catalogue tool.
  • Bounded contexts within the enterprise model implies that the canonical model itself brings multiple contexts together into the corporate memory. It also means, portions of the overall model should be used for the correct purpose and to achieve that requires the mandatory documentation all business cases must provide as well as utilising that enterprise catalogue tool. What’s common between bounded contexts are those business objects.
  • New business cases which add to the overall canonical model can also enrich their own use cases with data already being ingested into data vault.
  • Separating the identifying attributes into its own satellite (i.e. personally identifiable records), allows for the rest of the contextual data to be utilised without reidentifying those specific entities.

Records associated with a business object can be expired as well. Although machine learning needs as much data as possible for training, there are a multitude of reasons why the training data itself can become polluted. Flexibility to change and replay business rule outputs as well as inexpensive methods to version and correct data as well as code means data scientists can experiment more with less.

“Any organisation that designs a system will inevitably produce a design whose structure is a copy of the organisation’s communication structure.” -Melvin E. Conway

The Metadata Asset Layer Cake

Make no mistake, although each stream-aligned team in an organisation is building their own data vaults even within the same data platform, the hub tables for an enterprise are indeed shared. We called this concept a “shared kernel”, a term borrowed from domain driven design (DDD) to emphasise a shared responsibility for a common artefact which in turn implies collaboration and governance around that artefact’s growth and outcome, a conceptual discussion on this is presented here.

As more assets are created, updated and used in engagement data products, the metadata of these assets must be tracked to efficiently measure usage, lineage and access. All these endpoints are available from a data product’s anatomy as a data contract, the data product is an end-product of the data assets that are aggregated to support it and must be tracked and traced as well. This is the function of DataOps and tracked and traced using enterprise business data catalog tools.

Unified Data Platform

Onward to part three of this article…

Data Lakehouse meets Data Cloud Architecture

The product of aggregating data vault tables and code is an information mart. Both are tangible and portable components used to deliver data lakehouse based data products (Apache Iceberg tables and views). A data cloud architecture scales analytical data and logic (code) across fundamental computing components managed by your and your customers and partners data and code assets. As we have seen in the previous sections of this article, we have scaled out the fundamental computing components for computing, memory, storage and software. We have also highlighted how these concepts on the cloud gives the architect options.

The Data Cloud Architecture is not single vendor based and promotes the interoperability of compatible fundamental computing components as managed by an enterprise’s preferred operating model. The operating model will influence what and how domain data and code is managed and shared. The four operating models (according to “Enterprise Architecture As Strategy” by Jeanne W. Ross) are:

  • Unified — (high standardisation, high integration)
  • Replicated — (high standardisation, low integration)
  • Coordinated — (low standardisation, high integration)
  • Diversified — (low standardisation, low integration); diverse and independent business units or subsidiary corporations.

Before we delve into each of these operating models, let’s set up some fundamental concepts key to data cloud architecture.

Data Sharing

No-data movement, secure data sharing

Data sharing is the platform capability to share data with customers, partners and suppliers without moving data at all. A unique capability of the data cloud as you have effectively rented blob-storage from the cloud service provider and secured that data through encryption keys, access control and authentication. Data sharing operates by authenticating users outside your business unit or organisation to have realtime read-only access to that data. The options for data sharing include,

  • Direct private data shares — negotiated sharing of realtime data between a data producer and data consumer. Data products are treated as inbound data for downstream processing by the data consumer.
  • Data sharing marketplace — data products are placed in private or public marketplace for free or negotiated consumption terms
  • Data clean rooms — a type of private share in a walled garden and only authorised queries are executed and results return are non-identifying. Techniques such as differential privacy are used to inject noise in aggregated results between producers and consumers. Go to this link to see how a data clean room framework work, bit.ly/3IeDzSu

Business Logic Sharing

Like the sharing of data was depicted above, business logic sharing follows a similar pattern,

  • Direct business application sharing — applications shared between organisations or business units and the consumer pays for its support to utilise that logic on their own data. No data is sent back to the producer except for authorised usage statistics to enable bug fixing and application improvement.
  • Business application marketplace — applications are advertised in the marketplace and available as a service or as a one-time purchase. The same framework is applied an no data is sent back to the producer.

Replication and Multi-Cloud

Data and application sharing without movement is possible if all parties who desire to share data and code operate on the same cloud service provider’s (CSP — AWS, Azure and GCP) physical data centre otherwise known as an availability zone (AZ). A single AZ may operate two or more data centres and depending on how you configure your software and data this can be replicated across data centres within the AZ on your behalf. To share your assets across AZs or between CSPs, replication is required, i.e. automated data and code movement which implies a degree of latency which is usually also configurable by you. Considerations for enterprise asset replication include:

  • Will replication meet the desired business outcome and service level? Where are our customers and how stale can the data be?
  • Does replication of data have the potential to violate regulatory or contractual sovereignty concerns.
  • What is the cost of replication? How can we minimise that?
  • The type of operating model may negate replication.
  • Replication is also a great way to ensure business continuity and disaster recovery.
  • Vendor lock-in, although if there is a capability overlap this may imply a high implementation cost.

Let’s take what we have learnt and apply it to the four enterprise operating models we highlighted earlier.

Let’s wrap this discussion up…

Middle-out

Why middle-out? Rarely does an industry adopt a pattern or template universally as the opportunity we face today. Apache Iceberg is that momentum and multiple vendors and hardcore data engineers have realised this and have seen something that is almost as flexible for data as git version control is flexible for code. A single table format is available to support tabular analytics as well as the needs of machine learning. Every organisation and its data teams are nuanced and opinionated and therefore will have a preferred suite of practices and tools. For a single vendor to attempt to cater to everything in data analytics, they risk sacrificing the expertise and nuance needed to excel in any one of the fundamental components needed for analytics. More so, decades of data warehousing have littered the internet of stories of slow and costly failure which solution architects are judged on today. It is an imperative for them not to repeat the mistakes of the past and thus pick an option that will have the lowest strike price in the future (if you choose to pick it).

Programming code is declarative whereas data is naturally imperative. Think about it, code is applicable for all time whereas data is inherently applicable from a point in time (unless you decide to rebuild it). Code evolves (learns) based on data, the rules today that you apply as code is not relevant to the data of the past. This is why the treatment of data and code should forever be managed separately, a manipulation of past data on current versioned code is misinformation and in most cases fails the regulatory compliance of that data.

From the fundamental computing components perspective, we are seeing a reversal of the days of Big Data, which has taken data processing back to what it was before, uninterrupted SQL and relational algebra. The global participation of innovation has meant that no single player in the analytics domain will dominate analytics, but instead by turning the database inside out, all fundamental computing components are instead portable and available off the shelf for solution architects to choose and mix. Gone are the days when choosing a vendor meant that all the fundamental computing components were locked into a single vendor with multi-year financial commitments.

Other References

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

The Modern Scientist
The Modern Scientist

Published in The Modern Scientist

The Modern Scientist aspires to connect builders & the curious to forward-thinking ideas. Either you are novice or expert, TMS will share contents that fulfils your ambition and interest. Write with us: shorturl.at/hjO39

Patrick Cuba
Patrick Cuba

Written by Patrick Cuba

A Data Vault 2.0 Expert, Snowflake Solution Architect

No responses yet