When Linear Scaling is Too Slow — Compress your data.

Paige Roberts
7 min readFeb 8, 2024

--

Agenda slide: 1. What is the worst strategy to get performance at scale? 2. Useful strategies for achieving high performance at extreme scale. 3. A practical example of these strategies in use. 4. Takeaways, next steps, and Q and A.
Agenda from my talk at Data Day Texas 2024

(Skip the first paragraph if you’ve already read the earlier posts in this series. Jump down to “Start”)

This is part 4 of my series on strategies for high performance data processing at extreme scale based on a talk I did at Data Day Texas 2024. The previous 3 posts talked about the one strategy that should be your last resort, but is usually the first and sometimes the only strategy in software designed for high scale data processing. The second post launched into the first good strategy, workload isolation. The third post focused on building a foundation that starts with true linear scaling, a shared nothing architecture.

The main question the talk sought to answer was: What strategies do cutting edge database technologies use to get eye-popping performance at petabyte scale?

Start

The strategy that really takes an application past linear scaling and into reverse linear scaling is one that Vertica (or whatever OpenText is going to call it now) has mastered — aggressive data compression.

Data compression can take your application way beyond linear scaling to reverse linear scaling.

This isn’t something that you think about often when you think about data processing performance, but how the data is stored makes a massive difference in how fast it can be retrieved.

Before I dive into that, though, we need a quick definition of terms. Most folks are familiar with near linear scaling and linear scaling. But when the Vertica developer team years back told me their application had reverse linear scaling, I had to get them to define it for me.

And then I called them liars.

For years, I’d heard linear scaling as the goal, the end all, be all of distributed architecture. Everyone was striving to go from near linear scaling to true linear scaling. The idea that anything could leave that in the dust just didn’t compute. I had to see the data from some benchmarks before I believed them.

What exactly is reverse linear scaling?

Linear scaling means that the ratio between performance and scale are always the same when compute infrastructure stays constant. If it takes 1 second to query 1 million rows, it’s going to take 10 seconds to query 10 million rows, and 100 seconds to query 100 million rows. That perfect, stable ratio is the goal for most software.

Most folks are familiar with near linear scaling. Basically, it comes close to being identical to linear scaling until the scale starts getting really large. Then it gets worse and worse until it just can’t handle any higher scale. It might take 1 second to process 1 million rows, 11 seconds to process 10 million rows, and 120 seconds to process 100 million rows. The problem comes in when you get to very high scale. It might take 1800 seconds to process 1 billion rows, and just flat never complete for 10 billion rows. Don’t even think about trying to use it for 100 billion or 1 trillion rows. The software will likely choke and crash.

Reverse linear scaling is the exact opposite of near linear scaling.

Graph: “Performance in time to complete” on the vertical axis. Various data volumes on the horizontal axis — 1 million rows, 10 million rows and so on. 3 lines are on the graph. 1. straight 45 degree angle — Linear. 2. Upward curve, following the line for some ways, then curving more and more upward as the data volumes get higher — Near Linear. 3. Downward curve, following the line fairly closely at first, but curving toward flat as data volumes increase.
Near linear, linear, and reverse linear scaling.

With reverse linear scaling, at low scale, it looks pretty much like linear, but the higher scale you go, the better the performance increments become until more scale makes little to no difference. For example, it might take 1 second to execute a query at 1 million rows, 9.9 seconds at 10 million rows, 17 seconds at 100 million rows, and 19 seconds at 1 billion rows. By the time you get to a trillion rows, performance is fairly flat. It may take 20 seconds for 10 billion rows, 21 seconds for 100 billion rows, and 21.5 seconds for 1 trillion. The higher the scale, the less the difference matters to performance.

Not possible, right?

That was my initial reaction.

But here’s how it works.

Most databases store data to optimize storage space. They use a schema like snowflake or star or something that minimizes storing the same data more than once.

Old school data modelling star schema.

Vertica started with a fundamental shift in thinking. Data in an analytical database shouldn’t be stored for optimal storage efficiency, it should be stored for optimal analytical efficiency.

The one thing Vertica is most known for is pioneering the storage of data in columns, rather than the historical way of storing data — in rows. This keeps from having to read over a lot of columns the query just doesn’t need, and saves massive amounts of time. Any analytical data format worth its salt now stores data in columns, like Parquet, ORC, or nearly every analytical database invented since 2007 when Vertica first came out of stealth.

But none of the other software got reverse linear scaling from just switching to columnar data storage. There’s more to it. You still have to do all these compute-intensive (aka SLOW) joins and sorts and such every time you do a query.

What storage model gives the best analytical performance? Data marts.

Data marts or materialized views (or whatever the particular software you’re using calls the same concept) are designed to serve data pre-prepared for a particular type of query to provide optimal performance. They’re optimized for that analytical demand.

Vertica basically decided it was silly to store all the data in one way, then have to pull it out and re-arrange it to make it do performant analytics. That’s what an analytical database is FOR.

So, Vertica stores all its data that way.

Instead of tables, it stores data in objects called projections — collections of columns pre-optimized for a particular type of analysis. Each one is a pre-sorted collection of ALL the columns needed for a particular type of analysis and ONLY the columns needed for those analytics. That way, everything needed to answer the question is already there and sorted. No need to read anything that won’t be used, and no need to do a slow join with another data storage object to get the full answer.

Projections — data storage model in Vertica.

If the data need to be aggregated, another slow compute-intensive requirement, then they’re aggregated as they’re ingested and stored. If you need a ton of columns from a bunch of different tables to train a machine learning model or something, then they’re all stored together.

Naturally, no DBA is trained in this type of data modelling, so they have a little AI-driven data modeler that does most of the work for you.

If all this sounds like they’re storing the exact same data sorted different ways, or in different collections of columns multiple times — you’re right. It’s definitely NOT optimized to keep you from storing the same data over and over.

That sounds like Vertica probably needs an order of magnitude more data storage space than other databases. In practice, Vertica often uses 1/10th the data storage space of other databases. It almost always uses less than half.

Aggressive compression makes projections efficient.

One advantage of storing all data in columns of the same data type, and pre-sorting that data, is that using the right algorithm, you can compress that data to extreme levels. Add an AI to always choose the ideal compression algorithm, and you get some seriously small data.

Aggressive compression is the key.

Consider a date column. There are only 365 days in a year. The right compression algorithm can reduce 1 million rows of data to around 200, every date used. 10 million rows might have almost every date, around 300 rows. 1 billion rows of data can be reduced to 365 plus some markers, so say 400 rows of data. If you bump that up to 1 trillion rows, you’re still looking at around 400 rows once compressed. 10 trillion, the same. 100 trillion, still the same. No matter how much data you have, it’s no longer getting appreciably any bigger in compressed form.

The query completion time on a date field is pretty much the same for 1 billion rows as it is for 100 trillion.

Hence, reverse linear scaling. The more data, the less it makes any difference in performance.

This shifts the performance limitations from disk I/O, which isn’t getting much faster to CPU speed, which is getting faster every day.

Still, if a few hundred milliseconds or a few seconds at very high scale, normal response speed for a basic Vertica query, is far too slow for your customers, then spinning disk I/O, no matter how optimized isn’t going to cut it.

Check out the next post for a data processing strategy that lets the software that uses it regularly beat a one microsecond SLA, even at Petabyte scale.

--

--

Paige Roberts

27 yrs in data mgmt: engineer, trainer, PM, PMM, consultant. Co-Author of O’Reilly’s : "Accelerate Machine Learning" “97 Things Every Data Engineer Should Know”