By Justin Norton from Bristol, United Kingdom (uganda2009–49) [CC BY 2.0 (], via Wikimedia Commons

3 Rules to Follow When Adopting Big Data in the Cloud

The gloves came off this week. At Gartner’s Catalyst conference in San Diego the message in the big data track was loud and clear: go to the cloud *now*. From Carlie Idoine’s session, “From Data to Insight to Action: Building a Modern End-to-End Data Architecture”, to Svetlana Sicular’s “Implementing Data Lakes and the Ocean of Big Data in the Cloud”, to John Hagerty’s “Architecting Business Analytics in the Cloud: SaaS, PaaS and IaaS”, the opinion of many esteemed analysts was distinctly clear and consistent: Cloud is no longer an option in your data architecture; it is a requirement.

John Hagerty tweets about going to the cloud NOW during Gartner Catalyst 2016

But for many enterprises, the question is how? The classic (and justified) answer from most analysts is: “It depends”. Let’s set aside the start-ups of the world, who are starting from scratch and likely already heavily using the cloud. For most enterprises, that’s not the reality and there is no easy switch to flip and suddenly operate a data lake in the cloud.

Data lakes in the cloud were all the buzz at Gartner Catalyst 2016

Most of the enterprises I speak to and caught up with at the Pivotal booth at the conference have some kind of legacy architecture on premises. One of the most pressing pains is that they are using an appliance-based data warehouse. It’s a capital-intensive and inelastic environment — the complete antithesis to the cloud. But changing is a risky proposition and without a clear plan, it’s hard to see how the clouds on the horizon aren’t mirages in the desert.

For those enterprises, I recommended a “bridge” solution that would give them a path forward, without completely unwinding critical business practices:

  1. Obey the laws of data gravity: As Gartner analyst Angelina Troy articulated in her session, “One of the hardest things you can do in your hybrid cloud architecture is move data around”. Having worked at the leading WAN optimization vendor, I know that network bandwidth should not be taken lightly. Troy has even calculated the “ship it” threshold for moving data in bulk to the cloud — as in, when you put disks on a truck and physically drive it to the cloud provider. Net takeaway: Leave the data where it is. If it’s born in the cloud, great. If not, leave it on prem. The mix will change over time, so architect for both.
The point at which it makes more sense to FedEx your data to the cloud

2) Adopt software data warehousing: The nice thing about software is that it can generally run anywhere, so long as it’s hardware requirements are met. You can adopt software on premises and move it to the cloud. You may be able to completely replace an on-premise data warehouse appliance with data warehouse software that runs on commodity hardware, or you can at least start to move new and less sensitive workloads to the new software-based system.

Wait, wouldn’t I just move those new workloads to a data lake? Well, “it depends”, but as the slide below from Sicular’s presentation illustrates (deliciously, I might add), the data warehouse remains relevant and complementary to data lakes. But since data warehouses appliances are so acutely un-cloud, it’s worth building a strategy for those workloads, which do have different requirements and will remain valuable.

Gartner, “Implementing Data Lakes and the Ocean of Big Data in the Cloud”, Svetlana Sicular, presented at Gartner Catalyst, August 2016

3) Mind the compute-storage debate for cloud-based data lakes: Sicular made an unapologetic prediction that the future of data lakes would entail the separation of compute and storage. Cloud offerings like Amazon S3 epitomize this type of architecture, so this prediction goes hand in hand with the call to go to the cloud *now*, but it also implies that Hadoop does not necessarily have to be the basis of that data lake. In fact, my colleague Jeff Kelly just wrote about how we are observing (and supporting) the rise of the S3-based data lake.

So, does that mean the future of data lakes is NOT Hadoop? Again, that depends. Another colleague, Jagdish Mirani, recently covered the debate between local versus network attached disk in Hadoop. There is a case to be made on both sides within Hadoop, which means the separation of compute and storage — in the public cloud or not — does not require abandoning Hadoop.

Gartner, “Implementing Data Lakes and the Ocean of Big Data in the Cloud”, Svetlana Sicular, presented at Gartner Catalyst, August 2016

With changes afoot, it’s time to evaluate your options and find a bridge from your current architecture to a future state architecture. Expect changes and seek out solutions that offer flexibility in what and how you deploy data management software. Find partners with experience across domains to de-risk your transitions. Pivotal is here to help.