Data Lakes: The Lure of the Cloud

Data Lakes Look to the Clouds

You can be excused for conflating the concept of a data lake with the set of open source technologies collectively called Apache™ Hadoop®. Since Pentaho CTO James Dixon coined the term data lake in 2010, the vast majority of real-world implementations have been built using Hadoop, after all. And it’s not uncommon for a technology-related concept to become synonymous with the dominant technology used to implement it (e.g. server virtualization and VMware.)

But that’s just what a data lake is — a concept. The key requirements of a data lake, in my opinion, are (1) it is capable of economically storing vast amounts of raw data, from numerous sources, in their native formats, (2) it provides some mechanism for data scientists and data analysts to relatively easily explore, analyze and extract the data ,and (3) it provides a complete and effective mechanism for data governance. You can check out other definitions from Gartner, Whatis.com, and KDNuggets. Note that none of them say data lakes must be built using Hadoop.

Having said that, most existing data lakes are deployed on-premises in corporate data centers and are based on Hadoop, including the Hadoop Distributed File System (HDFS) for data storage. There’s good reason for this. Hadoop’s strengths directly align to the key requirements of a data lake as detailed above. Hadoop is relatively inexpensive (at least compared to traditional, proprietary data warehouses), it can store virtually any type and volume of data, and there a number of tools for analyzing and extracting data stored in HDFS. There’s a ton of value in this approach, and I expect many enterprises will continue to use Hadoop and HDFS as the foundations of their on-premises data lakes.

There are other options out there, however, and it appears more than a few enterprise data pros are starting to explore them. According to a new report from 451 Research, more and more enterprises are turning to public cloud providers and cloud-based storage services to support data lakes. I can say, anecdotally at least, we at Pivotal are also seeing this trend develop, with a number of our customers using cloud-based storage alternatives to on-premises HDFS deployments. Some Pivotal customers leverage Greenplum’s external table support for Amazon Web Services’ S3 to connect S3-based data lakes with internal data for analysis. DataXu, a Boston-based marketing analytics company and Pivotal Greenplum customer, for example, maintains its 20 petabyte data lake in the cloud on S3.

The Lure of the Cloud

There are a number of reasons why companies like DataXu look to the cloud to support data lake deployments. For one, it allows enterprises to trade large, upfront CapEx for smaller, ongoing OpEx (like pretty much all cloud services.) It also removes the need to hire large teams of Hadoop specialists, who are in perpetual short supply, to deploy and administer on-premises deployments.

Perhaps the most compelling reason to leverage the cloud to support data lake deployments is that as more data is created in the cloud, it is simply easier to explore and analyze it there rather than moving it to an on-premises data lake. This is sometimes referred to as the Law of Data Gravity, and it is growing stronger every day. IDC estimates the SaaS market will top $50 billion by 2018, making up over a quarter of the entire enterprise applications market. That’s a lot of applications generating data in the cloud.

But why not just used Hadoop and HDFS in the cloud to store and analyze all that data? As the 451 report states, more and more data pros are also turning to alternative cloud-based storage services, such as S3, Microsoft’s Azure Blob Storage, and Google Cloud Storage rather than HDFS in the cloud, for their data lakes. 451 expects this trend to continue largely for two reasons. The first is cost. From the report:

“… it costs significantly less to store data in S3 (to use AWS as an example) than HDFS running on EC2, not least because HDFS requires storing three copies of each block of data for resiliency, while S3 offers additional cost benefits in terms of automated backups and file compression. The separation of compute and storage also has the potential to reduce compute costs since users only pay for the compute resources they consume as and when they analyze the data.”

The second reason is scalability. Again from the report:

“While Hadoop is inherently scalable, HDFS relies on local storage, which means that scaling HDFS in the cloud requires manual configuration and management of associated storage. In comparison, cloud storage is designed to automatically scale as more data is added, without any direct user involvement.”

Both are solid, reasoned arguments for using cloud-based storage services rather than HDFS in the cloud for data lake deployments. And as cloud data gravity grows stronger, we’ll see more and more data lakes and associated analytics processes move to the cloud too.

There’s Got to Be a Catch

But there are also drawbacks to this approach.

One is that there are also performance trade-offs when choosing cloud-based storage services over HDFS for data lake purposes, as the 451 report points out. Unlike Hadoop and HDFS, cloud-based storage services such as S3 separate compute from storage. From the 451 report:

“… there are some potential advantages of using HDFS in the cloud, rather than analyzing data in cloud storage. Performance is the primary reason, with the separation of compute and storage naturally eradicating any data locality benefits that come with HDFS’s exploitation of local storage. Analyzing data in cloud storage therefore involves lower throughput and higher latency.”

For use cases that require high performance throughput for analytical queries, cloud-based storage services may not fit the bill.

Another drawback is that there are not nearly as many tools for analyzing data natively in cloud-based storage services as there are for analyzing data in Hadoop. In the Hadoop ecosystem, there are dozens of SQL-on-Hadoop engines and databases for analyzing data stored in HDFS, for example, each with its own strengths. Hive, the original data warehouse framework for Hadoop, is adept at large-scale ETL, for example, while Apache HAWQ is ideal for data science and machine learning workloads at scale.

The most popular cloud-based storage service, AWS S3, integrates with AWS’s own analytics tools and a handful of third-party analytics services, but lacks the robust analytics ecosystem of Apache Hadoop. Data scientists want choice when it comes to the tools they use to analyze and explore data.

Choose the Best Approach for the Job

So where does this leave us? Clearly there is no one-size-fits-all approach to supporting data lakes, and each approach has its advantages and disadvantages. I’m a firm believer in “the best tool for the job” philosophy and suggest enterprise data pros use the data lake approach that makes the most sense for each particular use case. Some criteria to consider are:

  • Where does the majority of data that you plan to fill the data lake with live?
  • Where will future sources of data most likely come from?
  • What are the analytical query performance requirements?
  • Which analytics tools do your data scientists typically use?
  • Are there security or compliance rules restricting how or where the data in question is stored?

Depending on the answers to these and other questions, enterprise data pros should be able to make informed decisions about which deployment options to go with to support data lakes: on-premises Hadoop, cloud-based Hadoop or cloud-based storage services. In most large enterprises, I suspect we will see a mix of all three deployment styles, as storage and analytics requirements can vary, sometimes wildly, by use case.

To learn more about how to get the best of both worlds — cloud-based data lakes with powerful on-premises analytical databases — check this blog post on Pivotal Greenplum integration with AWS S3.