Data Virtualization isn’t magic

Steve Jones
Nerd For Tech
Published in
5 min readDec 10, 2021

--

I don’t like data virtualization. Why?

Because I’ve used data virtualization

I don’t totally hate it, I think it has some great uses in prototyping, discovery, and sometimes if you use it as a smart cache against a known query and dashboard set it really can work well across disparate data sources. However, what I’ve found is that even when I’ve implemented it and its successful as the project goes live, 6 months or so later I get a call:

“Steve, the data virtualization performance is terrible, can you help?”

Off I then trundle and find, without exception, that free-form queries have been allowed, these tried to do complex joins across data sets that couldn’t leverage the cache and I’m quickly able to identify the problem:

“Your issue is right there governor, the speed of light”

The client, concerned asks me if I can fix the issue and the answer is always the same

“You’ve got two choices gov, one, bring these disparate data sets together in one place to enable complex joins to be done in a reasonable time without latency being a problem”

This they don’t want to do, after all it means admitting that they’ve used the virtualization technology incorrectly. So they ask for the 2nd

“Well that involves either solving the problem of quantum entanglement at scale, or some sort of 11 dimensional solution that violates the standard model of physics as we know it”

The point is that if you have large scale data sets in multiple places and want to allow free-form joins between them then the laws of physics are not going to be your friend. Data Virtualization is not a magic bullet that you can just slap over a data estate and have it magically work for every scenario. It is a technology that can add value but where you’ve got to be super careful about how you manage it and what you are doing it for. I had one client who very successfully used a data virtualization solution, because their disparate datasets were, in total, only about 40GB when compressed. In other words the virtualization solution was really just an in-memory cache. Another client used it as the x-ref holder and holder of PII information, again a very cacheable data set and a nice way to add a consistent security model across multiple sources while leaving the transactional data in those sources.

Data Virtualization solutions are normally sold as ‘magic’ a single pane of glass through which to view everything, and they demo REALLY well, particularly if doing key based queries between different data sets (Data Virtualization works well with good MDM x-refs) but when it becomes a platform for self-service my experience consistently is that people start doing queries where latency between sources becomes an issue. For instance a virtualization solution that is designed to do revenue roll-ups across regional data warehouses, works great as the pushdown is pretty simple. You can even do ranking of product sales as its a simple post aggregation.

It’s working great, then someone decides to roll it out for ‘self-service’. the business users see a nice flat view, “awesome they think” and don’t think about the need to break their queries down into regional pushdown with late aggregation. So someone decides what they want to do is look at basket affinity by local time of day and identify regional differences in baskets, in other words joining the transactional data between regions based on local time (not UTC stamps) to identify affinities in one region that don’t exist in another. In each regional view those basket line items number in the billions, caching is not going to cut it.

Kerrrr THUNK!

Speed of light times between DV and stores example

DV works when it can make a SINGLE query and push it down to the store and get a single response, it tries to do this my caching things like keys and core data from those stores so it can do the query at the DV level and just send single requests to the various stores. The problem is, like with REST, when a query becomes “chatty” and starts creating a large number of requests, so a nested query that starts with an outer join for instance where that join result can’t be cached so you have to start taking data from one store and sending it to another store. The DV tool can do this automatically, the problem is that very quickly the speed of light becomes your issue. I’m not even talking about latency or bandwidth here, I’m talking about the theoretical maximum transmission speed through a vacuum assuming infinite bandwidth.

There are some queries that fundamentally won’t scale, and some queries where they could be made to scale if someone designed them in a way that could scale. Most of the time the issues are in the latter group, but that doesn’t make them less of an issue. I had one client where someone did a perfectly reasonable query, based on the single schema they had in front of them. They didn’t know it was on federated data, and as a business user they certainly didn’t have the depth of SQL experience required to restructure their query to be efficient in that scenario. This for me is the biggest challenge with DV, in many cases I’ve found that theoretically it could have worked in the scenario but that in reality the user community did not have the deep, deep SQL experience required to construct queries in the ‘right’ way for DV, and certainly didn’t have the desire to gain those skills.

Today techniques like Federated Analytics are being used for extremely complex scenarios where the data is kept separate, often for legal reasons. The teams doing this normally do have very deep data experience, and are addressing significantly more challenging problems. Most of these problems wouldn’t allow a DV solution for privacy, regulation and IP protection reasons, but are enabling federated data sets to collaborate. As Collaborative Data Ecosystems and data sharing becomes more of a ‘thing’ we’ve got to be careful in the technologies and techniques that we use.

Data Virtualization can be a valuable tool, use it for specific cases, for specific problems, and don’t try and use it as a general fix all, and remember, in this day and age, 100TB isn’t copying, its caching

--

--

Steve Jones
Nerd For Tech

My job is to make exciting technology dull, because dull means it works. All opinions my own.