Why and When to Avoid S3 as a Data Platform for Data Lakes

Farhan Siddiqui
Towards Data Science
8 min readMar 24, 2020

--

Data lakes are all the rage these days in large enterprises. A data lake is a single store for both raw copies of source system data and transformed data for use in tasks like reporting, visualization, advanced analytics, and machine learning.

Figure 1: Data lake ecosystem

Object stores (like S3) are becoming the platform of choice for data lakes because of two main reasons:

  • They provide cheap, durable and virtually unlimited storage in the cloud
  • They enable separation of compute and storage, allowing either one to be scaled independently

In this blog post, I will dig deeper into some of the advantages of object-stores that are responsible for their popularity as a platform for data lakes. I will also examine some of the often-underestimated challenges that plague the use of object stores for many data lake use cases.

Object store benefit: durable, cheap, and virtually unlimited storage

Object stores like S3 provide eleven 9’s of durability (99.999999999%) and four 9’s of availability (99.99%), and they manage to do it on a virtually unlimited scale, at the unbelievably low price of around $23/TB/month. Contrast this with on-prem data warehouse appliances (DWA) that were quite popular just a few years back. DWA cost tens of thousands of dollars per terabyte, excluding enterprise support. Multi-million-dollar contracts for DWAs which supported only a few hundred terabytes were quite common.

When IT leaders mull over data platform choices for their data lakes, object stores’ $23/TB/month price tag is just too good to resist. It makes sense to use the cheapest storage available for the large volumes of data (from hundreds of terabytes to petabytes) that data lakes are expected to hold. Object stores like S3 seem (incorrectly, as we’ll see later in this post) to represent a thousand-fold pricing advantage over the DWA still in use at many large enterprises.

Object store benefit: separation of storage and compute

The scale of storage required for a data lake makes it prohibitively expensive to use architectures like DWA in which storage and compute are coupled together in a single package. Decoupling storage and compute allows us to bring the right amount of on-demand compute to bear on the data that needs to be analyzed, at any given time. This significantly reduces the overall cost of data analytics solutions.

Figure 2: Separation of storage and compute

All of these advantages are understandably crucial to fueling the popularity of S3 and other object stores as a platform for data lakes. But object stores come with many challenges that don’t get enough attention. This is especially true for RDBMS-sourced and frequently-refreshed (daily/hourly) data, which forms the bulk of high-quality data in an enterprise.

Object store drawback: immutability

All object stores, including S3, GCS, and Azure Blob Storage, are immutable. This means that files, once written to the object store, can never be edited. Users can only hard delete the old file and create a new one, or logically delete the old file and create a new one (versioning).

When using S3 as a data platform for RDBMS-sourced, frequently-refreshed data, this leads to the creation of an unwieldy number of small files for each table.

Figure 3: The problem of many small files for RDBMS-sourced data

As inserts, updates, and deletes pile up over time, trying to derive the current state of the table becomes exponentially more time- and compute-intensive. Most data scientists balk at this complex undertaking and instead request direct access to source systems, defeating the purpose of using a data lake in the first place.

Figure 4: Problems with using raw changesets on S3

U = Update, I = Insert, D = Delete

Solution, part 1: partition the data

One solution to lift the responsibility of merging changes from end-users is to partition the data and then re-write the partition(s) that were targeted by the latest inserts, updates, and deletes. This eases the burden on end-users somewhat. However, performance problems remain, especially if the table has a large number of columns and only a subset of those columns are needed for the analysis.

Figure 5: Using partitions to merge changesets

Solution, part 2: use columnar storage

The above solution can be improved by using a columnar format like Apache Parquet or Apache ORC. Columnar formats improve performance significantly through better compression of data and limiting I/O to only the columns needed for the analysis. However, reading Parquet files from languages and tools (like Python, R, or Tableau) remains challenging.

Figure 6: Columnar storage helps with performance

Solution, part 3: simplify access with SQL interfaces

To further build on this solution, many engineers add a SQL interface (like AWS Athena, Presto, or Spark SQL) over raw Parquet files. This makes data access much more streamlined for end-users, who can now issue SQL queries across their favorite programming languages and tools (like Python, R, or Tableau).

Figure 7: SQL interfaces simplify access to data in a data lake

Solution, part 4: add capabilities with Delta Lake

The above solution can be improved once more by using an open-source storage layer like Delta Lake. Delta Lake further improves on the Parquet format by adding support for ACID (atomicity, consistency, isolation, durability) transactions, lambda architecture to support both streaming and batch use cases, and the ability to access data as it was on a prior refresh date/time (time travel).

Figure 8: Delta Lake adds transactions, simultaneous batch and streaming use cases, and time travel

Problem solved?

Not so fast! The above architecture does represent a workable solution, and many enterprises pat themselves on the back for being able to engineer and operationalize such a solution. To be fair, it is quite an accomplishment to be able to pull this off at scale. However, this architecture is still plagued by many problems and has a lot of room left for improvement. Key issues with Delta Lake on top of S3 as the platform for a data lake include:

  • The architecture does not address the creation of changesets, which can be quite challenging to create
  • It is quite complex to implement and support enterprise-grade, resilient Extract, transform and Load (ETL) solutions
  • Writing Parquet and Delta files requires additional compute, as well as technical know-how, to configure and operationalize, at scale, cluster compute platforms like Apache Spark
  • SQL interface access (via technologies like AWS Athena, Presto, or Spark SQL) requires additional compute infrastructure thereby adding to overall complexity and cost of the solution
  • The complexity of the solution makes it expensive to support
  • S3 provides limited metadata and tagging capabilities
  • Integrating table- or row-level security on objects in S3, especially for large and complex enterprises, can be quite challenging
  • Last but not least, the performance of such a platform falls far behind the performance of the data warehouse appliances it set out to replace

Given the hidden compute and support costs, security integration, and performance issues, S3 as a data platform for RDBMS-sourced, frequently-refreshed data is a far cry from its $23/TB/month promise. Once we add up all the costs, it starts to creep into the range of thousands of dollars per TB per month. For that kind of money, there are much better alternatives available.

Cloud-scale managed analytics databases such as Snowflake, Google BigQuery or Azure Synapse Analytics offer the best of both worlds. By separating storage and compute, they provide S3-comparable storage costs along with a managed data platform that abstracts away the complexity of implementing cloud-scale analytics solutions. They offer a similar TCO to S3-based Parquet/ORC/Delta Lake with an AWS Athena/Presto/Spark SQL interface, while boasting better performance, security integration and schema support. They also reduce operational overhead while transferring technical & talent risk to third party vendor.

Figure 9: Advantages of a managed analytics DBs over the “object-store + Delta Lake + SQL interfaces” solution

What about RDBMS-sourced, mostly static data?

RDBMS-sourced, mostly static data (i.e., it does not change for weeks or months) does not incur as much ETL compute and support overhead as RDBMS-sourced, frequently-refreshed data. However, my recommendation would be to prefer cloud-scale managed analytics databases over S3-based Parquet/ORC/Delta Lake storage for such use cases, since all the challenges and costs surrounding metadata management, security integration and performance still remain.

What about semi-structured data?

Most semi-structured data coming into the enterprise (via formats like XML, JSON, and CSV) have a reasonably stable schema and can be ingested into relational tables. Most such data in large enterprises are frequently ingested in analytics databases like AWS Redshift or accessed via SQL interfaces like AWS Athena, Presto, or Spark SQL over S3-based Parquet/ORC/Delta Lake storage. For this type of use case, my recommendation would be to consider managed analytics databases that separate storage and compute.

TCO should be your north star

In the end, solutions should be judged based on the total cost of ownership (TCO), considering the capabilities they bring to the table and risks inherent in the solution. If two solutions have similar TCO, but one of them provides better capabilities, it should be a no-brainer to align with that solution. Moreover, careful consideration should be given to technical and talent risks associated with solutions developed in-house. Generally, for large enterprises, it makes more sense to offload technology and talent risks to reputable vendor products, where reasonable.

Figure 10: Balancing TCO, performance, features, and risk

So when are object stores useful as data lake platforms?

Object stores (like S3) remain an excellent data platform for other use cases like semi-structured and unstructured data that cannot or should not (for cost or utility reasons) be ingested into cloud-scale analytics databases. For example, it wouldn’t make sense to ingest images, audio files, videos, emails, PowerPoint presentations, Word documents, or PDFs into managed analytics databases. Additionally, many of these cloud-scale distributed databases use object stores (like S3) as their data ingestion interface, and some even use object stores as internally-managed storage platforms behind the scene.

Table 1: Recommendations

In a future post, we will discuss in-depth why cloud-scale managed analytics databases that separate storage and compute (like Snowflake, Google BigQuery or Azure Synapse Analytics), coupled with purpose-built CDC tools (like Qlik Replicate, Oracle GoldenGate or HVR CDC), are a better fit for RDBMS-sourced, frequently-refreshed data in enterprise data lakes.

Disclaimer: This is a personal blog post. Opinions expressed here represent my own and not those of my current or former employers. All content is provided for educational purposes only, without any warranty of suitability.

--

--

Technology leader with over 20 years of experience delivering analytics solutions for fortune 100 companies.