Computing’s journey from ownership TO short-term-rental

Ilai Malka
Jul 21 · 8 min read

By: Ilai Malka, Opher Dubrovsky and Itai Yaffe

At Nielsen Marketing Cloud, we use Serverless computing and cloud storage as a core component of our data processing platform. In this blog series, we will share what we believe to be high-value insights on how the computing world is changing and how you can leverage that to build incredible systems.

Photo by Dawid Zawiła on Unsplash

Specifically, we’ll explore the computing world’s journey from an ownership model TO a leasing model and eventually TO a short-term-rental model.

The change is a ground-breaking shift equivalent to the move from propeller airplanes to jets.
In some ways, it’s analogous to the move from buying an office, to renting an office, to renting office-space per day from WeWork and only when that space is needed.

As we will show, the change is even more profound in the computing world. Due to the fact compute requirements of most systems go up and down over time periods by orders of magnitude, the very short-term-rental model is extremely economic while at the same time very powerful.

The economic model change has immense implications to software design. The benefits it unleashes are:

  • Unbelievable cost reduction
  • Infinite scalability
  • Reduction in processing times
  • New services can be implemented in a cost-effective way, which for some was not possible in the past.

Benefiting from the above requires rearchitecting your systems to take advantage of these new capabilities. We’ve already done it in our systems, and we’ll show you how.
We’ll discuss the approaches and the benefits to the cost and performance.

We’ll show you how we evolved our systems using this new paradigm and achieved 10x performance gains with simultaneous 10x cost reductions.

Ownership model problems — performance, performance, performance

Just a few years ago, the computing world was based on the ownership model. To run your software, you had to purchase machines with computational power and storage, install your software on them and run it.
The software was tied to the available computational power and storage.

In this model, if you installed a database that you predicted will eventually need 400GB of storage (in a few years) but only needed 50GB initially, you would have purchased a machine with 400GB or more. Your machine would have been utilized in the first year by 16% (50/400). In the second year if your data went up to 120GB it would have been utilized by 30%.

By the time your machine would have gotten close to 100% utilization (probably even before when it reached about 80% utilization) you would have gone out and ordered a new, bigger machine to support the increased load.

As a consequence, the storage on that machine was never 100% utilized — on average it would have been utilized at around 40% over the life of that machine.

The leasing model solution is just a few clicks away...

In today’s world, the cloud allows one to work using a leasing model, i.e lease machines easily and quickly and return them when done. This ability to scale up and out fast, opens the door to amazing opportunities but introduces a new problem: runaway costs.

Here is a true story about how an application that initially cost us $1,500 a month, ended up costing $15,000 a month without us even noticing.
A few years ago we had a Spark application running on an on-premise cluster.
Over time the volume of the incoming data grew significantly and the app execution time got longer.
Scaling the cluster up or out in an on-premise cluster is a real hassle — you need to order new machines, physically set them up, install all relevant software, etc — this usually takes months.
This wasn’t a reasonable option, so we made an effort to find ways to make the application more efficient and continue to run on the same cluster. We succeeded and since the hardware stayed fixed, the application costs obviously stayed constant.

But, when we had a similar problem with another application running in the cloud on a 40 machine r3.4xlarge EMR cluster on AWS, we acted differently.
On the cloud, it’s very easy to scale the machines up or out — all it takes is a few clicks.
Instead of “wasting” time on brainstorming, implementing and testing new ideas, we did what most developers would have done and simply increased the cluster size. That was the quickest and easiest choice to make. The consequences followed, slowly.
We increased the cluster to 45 machines and the costs went up from $1.5K to $1.7K a month — “no big deal” we thought.
After a few weeks, we had to increase the cluster to 60 machines and the costs went up again — to $2.4K.
Iteratively applying this method, our cluster grew eventually to 150 machines and costs reached $15K a month — 10X the initial cost!

Notice how in these two examples we had the same initial problem. The difference was the setting we were operating in. One was on an owned cluster and the other one was on a cluster in the cloud.
The different settings caused us to take an entirely different approach solely due to psychological reasons: in the cloud, it’s easy to scale-up a cluster. In an on-prem cluster, it’s not!
The core difference between the 2 situations dramatically affected our judgment and the effect on the costs was resounding.

The opportunity of the short-term-rental model — cutting 90% off of costs

The best way to use the cloud is to alter your architecture to take advantage of the benefits the cloud offers.
The cloud lets you break down your system into functions and you can run each function on a container specifically suitable for it.
This in effect is a short-term-rental model. You only use the functions when they are needed. You only pay for the functions when they are used…
You can see where we are going with this….

Let’s explain this with a simple analogy — a bus example:
In this example, we have two kinds of vehicles — a bus and a taxi. Each one is better optimized for transporting a different amount of riders. Each one costs a different amount:

  1. The bus is expensive at $200/hour and can carry 50 people. This works out to $4 per rider, if the bus is full.
  2. The taxi is cheaper overall at $20/hour. But can only carry 4 people. It costs $5 per rider when the taxi is full.

As we shall see, if we use a system that mixes buses and taxis we can reach a lower cost of ridership even though we are adding the use of taxis, which have a higher cost per rider than a bus. The overall costs still come out much lower.

The daily transportation costs come out to:

  • Bus only$1200/day for carrying 88 people. That comes out to $13.6 per rider.
  • Bus and Taxis$480/day per 88 people. That works out to just $5.5 per rider.

The reason the mixed model is so much cheaper (a savings of $720/day, 60% cheaper) is that we use the right vehicle for the number of riders at each hour of the day. As a result, we are able to achieve much better efficiency and better costs.

This table shows an example of this:

The costs over the whole day can be easily viewed in the graph below.
Notice the waste (in red) occurring in the bus model during most of the hours of the day.

Scalability - the most important feature of the cloud

The computing solutions that the cloud offers allow us to take advantage of the same characteristics as in the short term rental model and as we’ve seen in the bus/taxi example.
Let’s assume we have a service that process files. Files are being streamed to the service with varied volume throughout the day and the service has to process them.
In the “old world”, we probably would have set up an on-prem cluster.
The cluster would have contained a predetermined number of machines, based on the maximum load we would have expected to handle (or a bit less if we were willing to compromise on performance).
In this architecture, it’s very likely that the cluster would have been nearly idle most of the day — as in the database example above.
However, just switching to the cloud with the same architecture wouldn’t have reduced the system’s costs. We would still have had to pay for the idle time of the machines.

Possible solutions that would drive the costs down in this kind of a system are:

  • Using Auto-Scale capabilities which allow us to define policies to automatically increase or decrease the number of machines in the cluster.
  • Using serverless computing which is scalable by nature.
  • A combination of the 2 solutions, for example: have a few machines that are constantly at high utilization by having them process a fixed and steady volume and use serverless computing to handle additional traffic spikes when they occur.

Using such solutions, allows us to avoid paying for idle machines — and gets us closer to paying for computing power as needed.

The full potential of the cloud doesn’t end with mere scalability.
A cloud architecture can also be used to deal with outliers. Going back to the file processing service example above, we can leverage the cloud to process special cases. For example, if we mostly get files with a similar size (~50MB each), but once in a while receive an outlier — a huge file (of over 5GB).
In addition to the standard handling service “normal-size” files, we can have a separate service (also based on a serverless function) which will be dedicated to deal with extremely large files, and will only be invoked (and paid for) when it is needed.
An addition like this can make our system more flexible, predictable and cost-effective.

Summary

If you have read this far, we assume that you either have cloud-based systems and are concerned about their costs, or you are considering moving to the cloud (and are still concerned about the costs…).

If that is the case, you are not alone. We too had several pricey cloud-based systems and managed to dramatically reduce their costs.

By now we have done this on multiple systems we run, saving Nielsen over
2 million dollars a year!

In most of our projects, we managed to achieve 80%-95% cost reduction while at the same time improving the latency and speed!

We’ve found there is a common pattern in these projects, which can be easily reproduced.

Check out our next posts in the series to find out how you too can achieve significant system improvements and huge cost savings by applying the same techniques to your projects.

NMC-TechBlog

A publication by the Nielsen Marketing Cloud Engineering team, where we talk about what we do and how we do things

Ilai Malka

Written by

NMC-TechBlog

A publication by the Nielsen Marketing Cloud Engineering team, where we talk about what we do and how we do things

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade