When Linear Scaling is Too Slow — Share nothing.

Paige Roberts
3 min readFeb 7, 2024

--

Agenda slide. 1. What is the worst strategy to get performance at scale? 2. Useful strategies for achieving high performance at extreme scale. 3. A practical example of thes strategies in use. 4. Takeaways, Next Steps, and Q and A.
Agenda for talk at Data Day Texas 2024

(Skip the first paragraph if you’ve already read the earlier posts in this series. Jump down to “Start”)

This is part 3 of my series on strategies for high performance data processing at extreme scale based on a talk I did at Data Day Texas 2024. The previous 2 posts talked about the one strategy that should be your last resort, but is usually the first and sometimes the only strategy in software designed for high scale data processing. The second post launched into the first good strategy, workload isolation.

The main question the talk sought to answer was: What strategies do cutting edge database technologies use to get eye-popping performance at petabyte scale?

Start

The next strategy I recommended at Data Day when designing software for high performance at extreme scale is a shared nothing architecture. Both Vertica and Aerospike have it, and frankly, any distributed data processing system built since Hadoop should have this foundational strategy. They don’t all have it, but they should. Hadoop taught us that edge, master, leader, whatever you want to call it nodes were not helpful. I have been surpised over the years to find otherwise smart distributed data processing systems like Presto still making that mistake.

One lesson learned from the last 15 years was that differentiated nodes are bottlenecks that can limit scaling.

Aerospike shared nothing architecture — Every node is identical

In a shared nothing architecture, every node is identical. Every node is a peer to every other node, so any node can initiate things like data ingestion or data queries. That means if you need more parallelism, when you add more nodes, you get exactly that much more parallelism. More users can query because there’s more nodes for them to query. More data can be written to the database because there are more nodes to handle that. All the data trying to go in or come out is not channeled through any one particular node.

Shared nothing architectures give you true linear data volume and concurrency scaling.

Both Vertica and Aerospike use a shared-nothing architecture. In one configuration, Vertica shares the data from a single attached storage location such as an S3 bucket. Aggressive caching is used to compensate for sharing that one thing, data, but all the rest of the architecture is still shared nothing.

Aerospike doesn’t even share that. Every node has the information to tell it where all the data is, but the data itself is spread across the cluster. That information is just used to guide retrieval to whichever node has the data. When Aerospike says the “correct” node in the illustration above, what they mean is, the node that has the data you want. Computationally, they all work the same.

When every node can do everything that every other node can do, that’s the foundation of an application that can genuinely scale without limits.

That’s cool, but this blog post series is about how to go beyond linear scaling, stay tuned for the next post on reverse linear scaling and how to get it.

--

--

Paige Roberts

27 yrs in data mgmt: engineer, trainer, PM, PMM, consultant. Co-Author of O’Reilly’s : "Accelerate Machine Learning" “97 Things Every Data Engineer Should Know”