dataxu’s journey from an Enterprise MPP database to a cloud-native data warehouse, Part 1

This is part 1 of a series of blogs on dataxu’s efforts to build out a cloud-native data warehouse and our learnings in that process.

At dataxu, we deal with data collection, storage, processing, analysis, and consumption at massive scale. For this reason, we were an early adopter of the Hadoop framework. We quickly discovered that Hadoop and Hive alone were not sufficient for the growing needs of interactive analysis and querying. And so, about five years ago, we incorporated an MPP database as our warehouse solution.

The on-premise solution served us well as the cluster size expanded 16 fold over the course of five years. However, even with the addition of an MPP, we started to run into significant operational challenges:

  • While it is possible to expand the MPP database, it takes months of planning and execution. The capacity planning is particularly tricky. If the business experienced unexpected growth, it would be difficult to bring additional capacity online in a timely fashion. If the business growth slowed, we could be stuck with an over-sized cluster for a number of months, before the volume eventually caught up.
  • The database requires constant maintenance, both in terms of hardware (like replacing failed disks) and software (like vacuuming catalog). Moreover, the database constantly experienced failed processes, which requires DBA to perform recovery operations.
  • Ad-hoc query users constantly compete with production ETL loads for the fixed capacity, leading to unpredictable load times and SLA misses.

The MPP solution was clearly not a sustainable option for serving dataxu’s business needs. As such, we started to look for an AWS cloud-native solution. After reviewing several competing solutions, we settled on Apache Spark on EMR as our primary ETL solution and AWS Athena as primary query solution.

In this blog, we will discuss the comparison of a cloud-native warehouse vs. MPP, with some focus on Spark as an ETL solution.

Cloud-native warehouse vs. MPP

First and foremost, the primary reason to choose Spark is not for performance. As it currently stands, even if a Spark cluster is configured with an equivalent amount of CPU, RAM, and disk capacity, it is unlikely to beat the query performance of the MPP solution. There are many reasons why an MPP database will “beat” Apache Spark on paper:

  • EMR clusters run on VMs, while MPP on-prem runs on highly tuned bare metal servers.
  • EMR clusters run in VPC, while MPP on-prem has a dedicated network switch and 100Gbit throughput.
  • EMR clusters use S3 as storage, while MPP on-prem has superior I/O performance with direct attached disks on RAID10.
  • MPP database has years of query optimization expertise, while Spark has a lot to catch up on.
  • MPP databases allow for data locality. That means the data can be split into shards by a key — so each fixed node “owns” a shard. A well chosen distribution key cuts down on so-called broadcasts (data movements over network across the nodes).

Reasons for choosing Spark

While an MPP may look like the better option on the surface, there are a number of reasons why here at dataxu we chose to implement Spark as our primary ETL solution.

  1. Separation of Compute and Storage. The Storage tier is S3, with all the durability, infinite scalability, and recoverability built-in. The Compute is EMR with a wide range of possible instance families to match CPU and RAM needs. We also use transient and dedicated EMR clusters for ETL loads, where there is no sharing of workloads. A single cluster is dedicated for only one load, which enables a predictable execution time and guarantees SLA. We never have to worry about having the right balance of storage and compute. We never have to worry about running out of storage or having enough compute.
  2. Separation of Metadata and Data. Hive metastore has evolved as the de facto open source standard for managing schema objects, supporting tables, view, partitions, and UDFs. We no longer need to vacuum data catalog which also removes a common locking condition on MPP databases.
  3. Dynamic Elasticity. The ability to scale the computing resources to dynamically match the business and technical requirements. In addition, with EMR auto-scaling features, a cluster can be scaled up and down even when active queries are executing and newly provisioned capacity can be put to in-flight query instantly.
  4. Reduction of Maintenance. There is no longer a physical database to manage. No more backups, no more disk failures and replacement, and no more shutting down databases for hardware/network maintenance.
  5. Resilience and Fault Tolerance. Any tiny issues in MPP will cause a query to fail, while Spark will re-try multiple times. This leads to significantly increased resilience, particularly in the hostile cloud environment.
  6. Cost Efficiency. Spot pricing and Instance Fleets yield significant savings over a fixed-cost, on-prem solution.
  7. Open Source. As the most active Big Data open source project, the amount of contributions for Spark hugely dwarfs that for any other MPP solutions out there.

Electrical wiring: A visual analogy

For simplicity’s sake, we can compare MPP vs. the cloud-native warehouse to different methods of electrical wiring:

The left side is a serial circuit with no switches. The capacity is fixed, has to size for peak capacity and is always on. Bringing in a new workload meaning less throughput to all existing workloads, such is the case with an MPP solution.

On the other hand, the right side is a parallel circuit with switches. The capacity is elastic — you can use and pay when needed. Bringing on a new workload does not diminish or interfere with any existing workload, such is the case with our cloud-native warehouse.

Cloud-native warehouse: Endless possibilities

With cloud-native warehouse, there is no longer a monolithic database stack. The database is decomposed into

  • Storage (open file formats on S3). File formats are open, like Parquet or ORC, no longer locked into proprietary formats, support queries via many different platforms and query engines.
  • Compute (EMR, Athena, Redshift-Spectrum), with pluggable query engines like Spark or Presto, also platforms like Qubole.
  • Metadata (schema) layer, can be hosted on RDS-Aurora or the AWS Glue-Data catalog.

Each of the component layers has many options that allow you to piece together different systems with different characteristics, suitable for a variety of use cases.

dataxu’s Cloud-native warehouse:

Performance is better with Spark

And finally, let’s take a look at the performance benefits Spark brings to the table:

  1. As we iterate over versions of Spark, from 1.6, to 2.0 and 2.2, the common thread is a significant performance improvement with each release, see discussions here and here. With this background, we are confident that the Spark project will continue on this trajectory.
  2. The separation of compute and storage is huge. Running a single query as the sole workload will perform better on MPP over Spark on EMR, yet, in reality, the MPP database is almost always shared among a variety of competing workloads, resulting in much slower performance than a dedicated EMR cluster or Athena.
  3. And last but not least, with dynamic elasticity, the option to scale out to meet SLA/performance requirements is always available at your fingertips.

Keep an eye out for the second post in this series, where we will take a deeper dive into how we “rewired the house” at dataxu.

Please post your feedback in the comments — what kind of solutions do you employ? If you found this post useful, please feel free to “applause” and share!