Time to virtualize virtualization — Why Schema Description Languages are the start of data asset management

Steve Jones
Collaborative Data Ecosystems
7 min readJan 11, 2022

I’ve said that I’m not a fan of Data Virtualization, but there is an architectural pattern within virtualization that I think needs to be surfaced and its one that addresses two core problems in modern data infrastructures

Foreign Keys and relationships

The first challenge is relationships, to use old school SQL terms, Foreign Keys. File based approaches like Parquet, CSV etc and data warehousing technologies like Azure Synapse don’t have the concept, within their infrastructure, of relationships. It means that while we’ve put the data in a single place, it isn’t actually connected. This makes it a pain for business users, it also makes it a pain to do data quality checks. There are good, performance, reasons why these things aren’t implemented in the base storage levels, but as end-users we need this stuff

This is the Schema Description Language — the SDL

Store access definitions

The second thing we need is a way to describe the stores, where they are, what they are, the security and access management for the store (unless everything is properly linked to a single IAM solution).

This is the Store Description Language — also confusingly an SDL, but lets say SyDL for System Description Language

Virtualization is built on these two things, so lets virtualize virtualization

Virtualization technologies do a good job of these things, but then decide to virtualize the data. In a world where 100 Terabytes isn’t copying but caching we can use these things and data catalogs to ‘virtualize virtualization’, and what do I mean by that?

Well I mean that we have two description languages

  1. A storage description language
  2. A schema description language
Simple SDL/SyDL model

Nothing complex here and we’ve all probably written these a bunch of times. Microsoft’s CDM is actually built on just such at thing and this enables PowerBI to access stores using a Dataverse. If we take that model one stage further and link it to a data catalog then we get the following

  1. An operational business data catalog which enables us to describe data products and where they come from
  2. A mechanism for describing those data products across data stores
  3. A way to automate the provisioning of change for data products to new business specific data lakehouses

This is an absolute requirement for us to realize a data mesh and to enable a more systematic way of incorporating data products outside of single technology solutions. An advantage of this sort of approach is that you can not only include unstructured data, but you can include ad-hoc information as well. But have it provisioned into a store for analytical performance, or choose to access it remotely if you are in one of those cases where virtualization genuinely does work.

Tool vendors hear my cry

Having this sort of description, including the store security model, means we can automate this provisioning. I’ve spent a bunch of years with companies industrializing ingestion and movement, because I firmly believe it has zero value and if I had the above approach, especially with the connection between a business catalog and an SDL & SyDL tool that enables the selection and provisioning of data products into stores, using an SDL I can also pre-build relationships between products so if I import customers and orders then the relationships between those two products are already in another SDL, thus importing the two SDLs naturally imports a third.

Data Products referencing Stores

I want the sources to be allowed to be files, databases, SAP, Mainframes, etc etc, I want to be able to create my own SyDL and SDLs for systems I have that aren’t ‘normal’ (like if someone has a Bull Mainframe or other fringe technology).

Then what I want to be able to do two things:

  1. Be able to construct new data products at the catalog level based on underlying models
  2. Enable people to use those product (and source) schemas to construct their own business specific stores, those will have a target technology

So at this stage I don’t want anything to exist, I might say a virtualization tool at the design/sampling level could be nice, but that is a precursor to the next bit. When I business person hits ‘provision’ then I want the following to happen:

  1. Using the various tiers of transformations provision a business specific store
  2. Keep it up to date using change data capture

Now to do this it might be good to materialize certain things, like the enterprise state and the data product layer, speeding up onwards consumption by always having this data available, including the streams and reconciliation parts, so in other words instead of virtualizing we are going to use change data capture to replicate data, and that means we’ve sort of automatically created our operational reporting layer

Using CDC and meta-models to physicalize virtual schemas

My point here is that mappings and transformations are something we do a lot in data, and meta-data driven transformations are quite a cool and efficient way to do this. This is effectively the pipeline that we build today, its what we build in virtualization tools. I’m just suggesting we publish those elements, including their associated security models, and then physically create them into stores, the base layers into some sort of data lake/mesh architecture and then the higher layers into business specific stores. That replicated state layer in particular can be really useful to support operational analytics without thrashing traditional databases.

So in this I use a catalog to share both data assets and business data products, and use that to construct new business stores, but manage the entire provisioning and security process. Performance wise if I can have relatively low latency from sources (which in a streaming CDC transformation engine can be made possible) I’m creating an extremely active data estate, managed through an explicit set of definitions

Testing is Cool

The other advantage here is that if I have these meta-models I can also do something that low/no-code platforms don’t do, namely

Testing

Because I can do three things

  1. Build in reconciliation to ensure data consistency
  2. Use the schema description language to generate test data, for instance using Synthetic Data Generation that is within bounds to help with secure managed development and prototyping of data-driven
  3. Use the schema description language to generate test data that is out of bounds to validate data quality and governance processes (e.g. date of birth that is in the future, or a country code that doesn’t exist)

The point here is that by having this sort of formal description of a system provides a platform for build a lot more, extremely useful tools. This is the data driven testing for traditional processes. If you are building event and streaming solutions you’ll need an async testing process in addition, but this can help at least create the data and streams that can be used for testing.

Collaboration needs these sorts of asset definitions

Now part of the reason that I think this is important isn’t just that I’m sick and tired of building out the base layers of data infrastructure and lots of this data being trapped in physical models, Sharepoint or *shudder* Excel spreadsheets. As Collaborative Supply Chains and Collaborative Data Ecosystems become more critical to organizations then we really need ways to describe the data assets that we are sharing, including their security, that is more sophisticated than either “its that table there” or “look at this website”.

As I incorporate external data into my organization I want to track where it goes, lineage is automatic in the model I’m talking about, right now to the field dependency level and indeed field flow and calculation level. So if I get a revocation request I can actually fully propagate that right the way through.

Pretty Please?

My point here is that lots of data work is a commodity that can be automated, virtualization does this by putting a layer across, but that has known performance limitations, and those performance limitations don’t need to exist in a world where we can store petabytes of data in high performance stores at relatively low cost.

These descriptions of data which enable it to be defined, including its security model and then automatically provisioned provide the foundation for other elements such as commercial models, quality models, testing plans to be attached at the data asset level. By having languages that work outside of the stores, and leveraging catalogues as the mechanism of interaction we can start shifting towards actual data asset management, and turn provisioning into a dial-tone for the business, both internally and for collaboration.

So if someone could just tool that up for me that would be great, or help me raise some startup cash, I think I might go and do it myself.

--

--

Steve Jones
Collaborative Data Ecosystems

My job is to make exciting technology dull, because dull means it works. All opinions my own.