100 Terabytes isn’t duplication, its caching

Published in

Nerd For Tech

4 min readNov 30, 2021

A common mistake I see in enterprises is the ‘fear of replication’ when it comes to transactional data. So you end up with people fully committed to a rule where data is stored in only one place. Then they use a virtualization technology to connect two different data stores, then they find that the laws of physics are not their friend.

As we get to 2022 its time to stop thinking of enterprise data replication between stores as being something to avoid. I can go to Costco and buy 4 TB of SSD storage for $499, you can get a 1TB micro-SD card or even my phone. A 144TB Home NAS comes in at under $5k.

In 2022 1 TB of storage is nothing

It is $23 on S3 a month, its $20 a month on BigQuery, its $18.40 a month on ADLS in ‘hot’. I’ve said before that data is cloud’s dirty little secret, and yet people don’t seem to want to take advantage of what that means. So lets be clear:

If you have multiple stores and can copy data between them, you should

“can” here means that there is no legal, regulatory or other reason not to do it. The default should be to replicate data between stores. “Oh no” I hear people scream, but this means duplication, it means security problems, ‘what if’ scenarios start being created.

This is why CDC is so fundamental to a robust internal data ecosystem. What CDC needs to mean is not just the propagation of change, but also the propagation of security models, and of revocation. If I build these mechanisms into my data marketplace, a core part of my internal collaborative data ecosystem, then I build those mechanisms to be part of the standard facilities in the organization. I’m forced to look at security, and revocation, as a fundamental part of the architecture. If I pretend that data won’t be replicated I con myself that I’ve handled security through isolation.

The reason I say ‘con’ is that at some stage you’ll find out that you need to bring the data together, or you’ll need to do a federated learning solution, and while I love federated learning, its not something I’d recommend as a ‘normal’ approach if you can just copy the data, because federated learning is hard, a table join isn’t. So at that stage you end up doing the worst of all things: a one of replication of data, or even worse a one off copy with a batch feed. Odds are when the need arises you’ll forget about all the excuses you used not to duplicate it, and you’ll not have the infrastructure in place to automate the security and revocation mechanisms.

A key here is I’m talking about transactional data, I’m not talking about logs, IoT streams, video feeds or other elements that can rapidly run into Petabytes and where value is basically to be extracted, I’m talking about transactional data, data that often has very good compression in formats like Parquet so a 100TB of data might take up less that 10TB of actual storage. If you then add storage tiering you could reduce the costs still further. Even if its a ‘whole’ $200 for that 100TB you’d probably spend more than that on the first Zoom call discussing whether you can even get the data into the other environment. It means that 1PB of data can be cached for only $2000 a month, $24,000 isn’t nothing, but when that data is required you’ll spend more than that in the meetings and doing the bulk upload of 1PB of data will tend to incur much larger charges than if you’d spread it over multiple months.

My point here is that in a data sharing and data collaboration world, its more about getting the sharing and security model right that the cost of raw storage. Then if someone requests access to “customer shipping information” they are requesting a subscription to that feed into their environment, if you have already done the engineering to ensure the data is there, that security is automatic, that revocation is built, and that the historical copy, by far the most costly in terms of transfer time and bandwidth, doesn’t need to be done then the speed to value is massively compressed.

So when architecting a modern data infrastructure, not only should you not use “only one copy” as a mantra, you should actually actively consider it as part of a pre-emptive caching strategy. One where you concentrate on the security and revocation challenges and make provisioning the simplest, rather than the the most complex, part of the problem. If you need to provision a new store, for instance you get IoT data and need to match it against the shipping data, then you’ve already got the mechanism to make that happen.

By focusing on enabling sharing rather than considering it something to avoid you can construct and internal ecosystem that is better able to collaborate. This prepares you to collaborate externally, where the security and technical challenges will require these foundations, and add more complexity.

100 Terabytes isn’t duplication, its caching

Written by Steve Jones