What is an optimal node in GCP?

Published in

DoubleVerify Engineering

7 min readJan 24, 2024

Working with cloud services gives you amazing technical opportunities. You can start/stop clusters on different continents in a few clicks, create pre-configured databases, configure your own databases based on your specific needs, manage your company IAM policies. However, such a high number of possibilities may also be frustrating. Tools you used to be familiar with may have different naming conventions, server types may split into families that are hard to understand, and cost models can be strange and unpredictable.

So what is “optimal node type”? First of all, let’s redefine the question: “What is the optimal node type FOR OUR COMPANY?” Now it sounds much better, and the answer, like everything in the engineering world, is “It depends!”.

The optimal node type depends on your company’s requirements and your financial possibilities. Let’s take a look at 3 main characteristics that can help you make the decision:

Performance
Cost
Availability

You will have to define your KPI metrics. For us, they were:

RPS per second with 16 cores CPU on 60% load
Max latency of 20ms
More than 200 nodes of the selected type should be available (per-zone).

Optimal nodes based on performance

Performance could be an excellent entry point for our research. Let’s take a look at what we have:

Before choosing what covers our requirements, it is better to remove all we do not need: E2 is bad for us because it uses the technology of a shared CPU, which can affect latency and throughput; M1…M3 are optimized for databases but not for calculations; A2 — is optimized for GPU calculations; and if your code is not compiled for Arm architecture T2A is also irrelevant.

Now that the list looks cleaner, we have N1, N2, N2D, C2, C2D, and T2D.

So what now? We have 6 families, which one should we choose? Inexperienced users (as we were) will say, “Let’s take the average CPU frequency of each family and divide by cost. The machine with a higher final score wins.” Easy, no?

And here we come to the first pitfall — CPU speed doesn’t truly represent full CPU potential. So, what does? As main properties: CPU speed, power consumption (which could cause throttling), CPU cache, instructions pipeline size, IPC (instructions per clock), vCPU provisioning model (threads or cores), CPU cache speed, etc. You can find more info about different CPU architectures at https://en.wikichip.org/wiki.

N1 uses the oldest generations of processors with old and less performant architecture. With this information, and our company’s 2 main KPIs (performance and massive throughput), N1 is definitely not our best option. Therefore, we can remove N1 from the list.

We can now turn our attention to a discussion about platforms. You can see that N2 and N2D have 2 CPU types. Why? What are they? CPU types (platforms) represent different generations of specific CPU families. For example, N2 Cascade Lake is the 2nd generation Intel Xeon, and N2 Ice Lake is the 3rd generation, which can give you around 20%-30% performance improvement for the same price as Cascade Lake. Later, in the “availability section”, we will discuss why you do not always want to choose the latest generation as your main CPU platform.

In addition, N2 Ice Lake has better performance than C2 for a lower price. This is a good reason to remove C2 from our list, not only because it has weaker performance in comparison to N2 Ice Lake, but also because C2’s current availability in GCP (we will talk about it later).

*The Sysbench test shows us that avg. latency of N2 Ice Lake is much lower, and the amount of processed events by N2 is much higher than C2.*

Here we come to the moment where we have to run our test based on company KPIs. However, before starting, you should know one small but extremely important thing. Each CPU can run with or without multithreading. What does that mean for you? Hyperthreading creates the “illusion” of additional logical thread in your CPU, which allows you to process ~40% more calculations per cycle. However, such power comes with the price of speed (a drop of CPU speed by around 20%-30%).

When is it appropriate to enable hyperthreading?:

When you have many parallel threads that change in high frequency (sleeping, waiting for a response, etc.), and
When the ability to process more requests/data/instructions is much more important to you than the ability to process faster (mostly relevant for web servers).

When should you consider disabling hyperthreading?:

When you have a heavy computation process,
When your application heavily relies on synchronization (read about sync/async applications), and
When in a latency vs. throughput dilemma, latency is more important.

As you can see, the number of processed events with hyperthreading is higher; however, without hyperthreading, avg. latency is lower.

Optimal nodes based on cost

1. Cost based on provision type

Optimization rating based on the cost in GCP changes each and every day. The cheapest workload on Friday morning could not be that good in the next 12 hours. For the beginning, let’s see what types of nodes we can get:

When choosing, you must decide what is good for your company/application. If you are running a stateless application and have a solid backup plan, you can use spot instances. This is the approach we decided to move forward with.

The second factor that could affect the cost is predefined or custom resource configuration. If you decide to use custom resources, it will cost you extra for the same resource. Below is an example of N2D-standard, which has 16 CPU and 64GB RAM, vs. N2D-custom with the same amount of resources.

That’s a $20 difference. Now think about the average high-tech company workload with 100 servers; you would have $2,000 extra monthly and $24,000 annually.

2. Cost based on location

The last factor is the region where you want to run your workload. Let’s take a look at the same node (N2D-standard-16) in 2 different zones:

Both regions are located on the West Coast, but the cost difference is almost $80! $80! Based on our previous calculations, the annual difference is $96,000. Impressive numbers, don’t you think?

Ok, now let us choose one of the regions and compare the CPUs that are still on our list (N2, N2D, T2D).

As you can see, N2 costs more than N2D, and based on our tests, N2 Ice Lake is still faster with much more stable latency than N2D. However, they can both process the same amount of RPS, and N2D latency results satisfy us. As a result,, N2 can be removed from the list.

Well, now our last 2 candidates, N2D and T2D. The first thing you probably notice is that T2D with 16 CPUs can process 2x RPS more than N2D, and the cost difference is only 35%!

The reason is the vCPU provisioning model. N2D with 16 vCPU gets 16 logical threads, while T2D with 16 vCPU gets 16 cores (making it 32 threads with enabled hyperthreading).

So, this time it should be obvious — T2D is our choice, but before jumping to that conclusion let’s talk about node availability.

Optimal nodes based on availability

Different node types in different regions have different availability. You can check it here (https://cloud.google.com/compute/docs/regions-zones).
One of our business requirements is low response time from clients all over the world, and network time could be a game-changer. For that reason, we scale our clusters in different regions to enable us to spin workloads closer to clients. In addition, for availability improvement, we decided to use sole-tenant nodes for base workload + spot for traffic peaks, and not all regions support T2D sole-tenant node pools.

Finally! To summarize

In our case, we decided to stick with N2D (Rome and Milan for higher availability) and continue to watch T2D as a future potential candidate. However, as you can see, “change is the only constant in our lives”, and therefore our next steps will be:

Creating automated tests for each node type that will run once in a while.
Preparing a plan for moving fast from zone to zone.
Researching how we can run generic workloads with different machine types.