When Linear Scaling is Too Slow — Isolate your workloads.

Paige Roberts
6 min readFeb 6, 2024

--

Agenda image with 1. What is the worst strategy to get performance at scale? 2. Useful strategies for achieving high performance at extreme scale. 3. A practical example of these strategies in use. 4. Takeaways, Next Steps, and Q and A.
Agenda for the Data Day Texas talk.

This is part 2 of my series of posts on software architecture strategies for high performance at extreme scale, based on a talk I did at Data Day Texas recently. The previous post talked about the one strategy that should be your last resort, but is usually the first and sometimes the only strategy in software designed for high scale data processing.

I used knowledge from many years in this industry, including 5 years working at Vertica. Plus, a fair amount of knowledge gained from working on an O’Reilly book for Aerospike alongside the architect, and a lead database engineer from The TradeDesk, a company that uses both Vertica and Aerospike.

The main question the talk sought to answer was: What strategies do cutting edge database technologies use to get eye-popping performance at petabyte scale?

The first good strategy I presented was workload isolation.

A long time ago, Vertica just did normal BI analytical workloads like most data warehouse type of databases. But it’s query engine also handled data lake data, and a lot of Vertica’s customers wanted to be able to use its fast, parallel engine to do machine learning. This was early days, when in-database machine learning was still considered blasphemy and the term “data lakehouse” had recently been coined, but wasn’t widely used yet. (We called it Unified Analytics for a long time.)

Vertica product managers listened to their customers and built some pretty robust end-to-end machine learning functions into the database.

And no one used them.

Not even the customers who asked for them.

The Vertica PMs went back to those customers and said, WTH? Why aren’t you using the capabilities you specifically asked for?

Here’s the thing: Since they were already using Vertica to power bread-and-butter applications and/or BI dashboards that all the executives and a fair number of other co-workers used, the last thing they could afford to do was miss their SLAs and bog down their boss’ dashboard to train a machine learning model.

No one was willing to take the chance that the different workloads would interfere with each other. The few that did take that chance, quickly backed out of there when they started seeing the problems.

Lesson learned.

If you have more than one workload on the same data processing engine, isolate those workloads so they don’t interfere with each other.

The thing about workloads and users is that neither one wants to stand in line to use your software one at a time.

A bunch of people standing in line.
People do not stand in line politely to use software one at a time.

Vertica used subclustering, separating out different compute for each workload, while sharing only the data in an object store or HDFS, or some other storage location.

Once Vertica got the subclustering for workload isolation working, some other follow-on strategies made it even more powerful. Because when you isolate workloads, you no longer have to give everything a generic middle-of-the-road kind of compute, you can give each workload the compute it ideally requires.

Now that you have all your workloads separated, each on its ideal compute infrastructure, you can tag incoming workloads in one way or another, and route them to exactly the right compute automatically. The level of efficiency, cost and energy usage reduction for workload routing, is outstanding with that. The product management guys estimated a 3–5X ROI increase for the whole product from that one improvement.

Machine learning workloads work better on a few large nodes with lots of RAM. Dashboards work better on lots of small nodes to provide for more concurrent users. Not only does isolating each workload on its own separate compute provide non-interference to maintain SLA’s, and ideal compute for each workload, but also the ability to scale independently.

Illustration of a cluster with a single shared object storage data store and 4 different workloads: 1. ETL/data ingestion and transformation 2. powering a dashboard 3. application support 4. advanced analytics and machine learning. Each subcluster has a different number and type of nodes, and the workload spikiness is also illustrated as different for each.
Isolate workloads on best compute for workload type and scale them independently.

Need more nodes to scale the data ingestion and processing sub-cluster because you added a new stream? Add some, just to that subcluster. No need to scale anything else.

Your application that the database is driving is suddenly way more popular? Scale just that sub-cluster up to handle as many users as needed.

Scale only what needs to scale.

As one of the folks attending my talk at Data Day pointed out, Vertica also used aggressive caching here, since object storage calls are not known for their performance. Caches of just the data needed for each workload on the compute used for that workload made that perform far better as well.

Concurrent workloads are only part of the story, though. When scaling becomes an issue, it can also be due to a big increase in number of concurrent users. You can’t isolate users from each other, obviously, but you can make sure that the part of the database that drives the application for those users is isolated from every other demand. And, you can give it optimal compute infrastructure.

Shows response time comparisons between Snowflake and Vertica for 200 and 500 simultaneous users. It separates the easiest 95% of queries from the most complex top 95 percentile group. Vertica times for the easiest are 2–3 seconds, regardless of number of concurrent users, and 6–7.5 seconds for the hardest 5%. Snowflake takes 25–35 seconds for the easier queries, and 35–52 seconds for the tough ones.
Published POC results comparing Snowflake and Vertica query response times on a cloud application with 200 and 500 simultaneous users.

A popular web application found out that the more popular it became, the more Google BigQuery couldn’t handle the number of concurrent users. Wanting to stick with Google cloud, it tried Snowflake and Vertica, which both work fine on GCP. But, it was important that they kept the majority of queries under 3 seconds, no matter how many users were hitting it simultaneously. Even the 5% of really complex searches needed to return in less than 10 seconds.

Obviously, Vertica outperformed Snowflake, even with hundreds of users. (If it didn’t this POC wouldn’t be published since no one ever shows you when they don’t do well.)

As a call-back to my previous post, I can’t help but point out that Snowflake was using 24 nodes, while Vertica used 6. Snowflake autoscales, which is cool. But that’s all it does. That’s all it can do if there are more users, more workloads, or more data — just add more nodes. As I pointed out in the previous post, adding more nodes is very much NOT the best strategy for scaling, especially if its your ONLY strategy.

Lest you think I’m just dogging on Snowflake, its good software, and beautifully easy to use. I’ve got no beef with Snowflake (since I don’t work for Vertica anymore) except that it has a vulnerability that lots of software has right now. It bundles hardware and software and sells them both to you, like an old school appliance.

If I were a developer for Snowflake, and developed a cool way to use 1/5 as many nodes and get 10X the performance so Snowflake would win that POC above easily, I’d get fired.

It would be catastrophic for Snowflake’s business model to make their software more efficient.

I’ve expressed this concern before, but this is not a Snowflake problem. It’s a problem inherent in the SaaS software model. I know, it’s easy and nice for folks who don’t want to have to do their own database administration. It’s great to just have to pay one bill, but long term, it’s a trap.

SaaS apps buy the hardware from a public cloud like Amazon, Google, Azure, Alibaba, whatever, then they sell their software plus the hardware plus a nice markup on that hardware. The more infrastructure their software needs, the more money they make.

So, there is a strong counter-incentive for SaaS software to use any of the strategies these posts discuss. Throw more nodes at it works great for your bottom line if you charge per node.

It doesn’t work so great, though, if your customers require high performance at scale.

Tune in next time for a strategy that is foundational for high performance data processing at extreme scale.

Same bat time, same bat channel.

(Yes, I’m old.)

--

--

Paige Roberts

27 yrs in data mgmt: engineer, trainer, PM, PMM, consultant. Co-Author of O’Reilly’s : "Accelerate Machine Learning" “97 Things Every Data Engineer Should Know”