When Linear Scaling is Too Slow — The last strategy you should use.

Paige Roberts
4 min readFeb 5, 2024

--

Agenda image with 1. What is the worst strategy to get performance at scale? 2. Useful strategies for achieving high performance at extreme scale. 3. A practical example of these strategies in use. 4. Takeaways, Next Steps, and Q and A.
Agenda for the Data Day Texas talk

I spoke at Data Day Texas in Austin last weekend, and as always it was an honor, as much to give the speech as to listen to all the other great presentations and meet and hang out with a lot of very cool folks in the data space. (Looking at you, Jesse Anderson, Hala Nelson, …) Since I’m making a job jump to ThatDot, I’ll be talking more about streaming graph in the future, and this talk will most likely never see the light of day again. Since folks who attended seemed to get a lot out of it, I thought I’d see how much of the essence of the talk I could put into an article or several, since a 40 minute talk does not fit so well in a single blog post. I’ll break it up into one main point per post over the next few days.

I got all this insight from working for 5 years with a high scale analytics database that was called Vertica, (but got bought by OpenText so will have a new name soon) and working on an O’Reilly book with some cool folks from Aerospike including the architect, Srini V. Srinivasan, plus the extra cool factor of one of the largest customers for both Vertica and Aerospike being the TradeDesk. One of the lead database engineers from The TradeDesk, Albert Autin, is helping us write the Aerospike book. With all that knowledge to draw on, I tackled a big question:

How do you design for data processing performance at extreme scale?

Sometimes linear scaling just isn’t good enough. What strategies do cutting-edge technologies use to get eye-popping performance in petabyte scale databases?

First off, I want you to think about some strategies for performance at scale that you know of. Consider the challenge. You have a customer who is scaling up to hundreds or thousands of users, or hundreds or thousands of terabytes of data, or both, and needs to execute highly demanding workloads at high speed. What strategies can you build into your software that will handle that?

Take your time, I’ll wait.

But, of course, you probably already peeked at the next section headline.

Everyone’s first thought for scaling — Throw more nodes at it.

If one of the strategies you thought of was “add more nodes” (or cloud instances, whatever), I am metaphorically slapping your wrist. That is possibly the worst strategy available to you, and should be your last resort, not your first thought.

“My cluster is bigger than your cluster.” — Techbro bragging line.

Once upon a time, I was briefing an analyst and mentioned that Vertica handled very high scale with exceptional performance for an analytical database. He asked, “How much data does your largest customer use with Vertica?”

I mentioned the TradeDesk. They were analyzing over a dozen petabytes with Vertica at the time. They’re over 15PB last I heard.

He said, “Yeah, but I mean, how many nodes?”

About 640 on Amazon when they use them all at once, why?

“I talk to people with thousands of nodes.”

Note: Not more data. Not more users. Not more demanding workloads.

Just more nodes.

Me, “Um … you know that’s not a good thing, right?”

Using huge numbers of nodes is inefficient, expensive, and SLOW.

If I told you I built a calculator with 5000 lines of code, you’d laugh at me.

If someone tells you they’re using 5000 nodes to manage a half petabyte of data, you need to be laughing at them just as loud.

More nodes = higher network bandwidth required, so scaling and network traffic will hit performance limits.

As an example, there was a company having performance issues, and they wanted to know if Vertica could help. (I’m not naming that company because embarrassing customers is a career-limiting move.)

The Vertica guys were like, of course, let’s see what you have. Because sales engineers always say that. But they were very confident on this one because the company was using 278 Spark nodes for their analytics.

When the company asked how many nodes they needed to set up for the proof of concept, the Vertica guys said, “Maybe 10?”

“Wait, no, you misunderstood. We want you to do a real proof of concept. Use ALL of our data, and run ALL of the analytics queries we’ve run in the last 6 months, so we can compare performance levels.”

“Right. 10 nodes.”

They ended up only needing 9, and far exceeding current data analytics performance levels.

An image of 278 little boxes, contrasted with 9 little boxes to show the vast difference between 278 nodes vs 9. Also, summary of cost for 278 AWS nodes over a year: $365,292 versus 9 similar AWS nodes per year: $14,833.
Big clusters are the opposite of efficient

So, the company was spending 25 times as much money as they needed to be, and burning up 30 times as much energy. (I hear the rainforest crying.)

All for lousy performance.

This is what I call node bloat.

More nodes = higher cost, more energy burned.

More nodes is almost never the right solution, not unless you’ve already used every other strategy you can think of. Just like using efficient code, efficient use of nodes is the sign of a software architect who knows what they’re doing.

Now that we know what NOT to do, next post, let’s look at some good strategies that work.

--

--

Paige Roberts

27 yrs in data mgmt: engineer, trainer, PM, PMM, consultant. Co-Author of O’Reilly’s : "Accelerate Machine Learning" “97 Things Every Data Engineer Should Know”