Beyond Shared Infrastructure: Exploring the World of Sole-Tenant Nodes in GCP
Sole-Tenant Nodes: Google Cloud’s Dedicated Compute Option
To run on shared infrastructure, or not to run on shared infrastructure?
That is the question. In Google Cloud, you can choose to run your virtual machines on shared infrastructure (the default) or dedicated infrastructure. Sole-Tenants nodes (STN) are dedicated infrastructure. Since STNs are the same servers used for Google Compute Engine, you get all the benefits of Compute Engine plus the benefits of dedicated hardware, more control, and the potential to capitalize the lease.
What are the benefits of Compute Engine?
Google Compute Engine is Purpose-Built and Secure
- Purpose-Built Infrastructure: Google builds custom servers that are optimized for flexibility, performance, and security to run all its services at scale. With Compute Engine, you have access to the same servers that Google uses to run its services.
- Secure by Design: Google’s custom servers are secure. They don’t include unnecessary components that can introduce vulnerabilities. Google pioneered the idea of zero-trust, which is the idea to trust nothing and verify everything, so all servers include a Titan chip, a Google designed custom security chip which establishes a hardware root of trust from the hardware to the applications. Custom data center designs, which include multiple layers or physical and logical protection, also ensure that the servers are protected like gold in Fort Knoxx.
- Software Defined Infrastructure: All the servers are connected using Google’s advanced Jupiter software defined network. This network gurantees sub 100us latency which means adding incremental servers or storage on Google’s network delivers a proportional increase in capacity and capabilities. Google’s research shows that its Jupiter network 40% less power, improves throughput by 30% and delivers 50x less downtime than the best alternatives Google is aware of.
Google Compute Engine has Differentiating Features
- Custom Machine Types: You can choose the specific amount of vCPU, vMem, and vDisk that you want. This gives you lots of flexibility to fit different sized VMs on your STNs.
- Right Size Recommender: Google monitors your VM usage and recommends a larger or smaller VM so you’re always using the right sized VM.
- Live Migration: This feature is not emphasized enough and it took center stage during the Spectre and Meltdown fiasco. According to this blog post by Ben Treynor Sloss, Google Engineering VP, VM Live Migration technology installed critical updates with no user impact, no forced maintenance windows, and no required restarts.
- VM Manager: OS patches, OS configurations, and OS inventory for your fleet of VMs can be managed easily with VM Manager.
Google Compute Engine Operations Follow SRE Principles
Given how heavily used Google services are, is it surprising that its services rarely go down?
Google introduced the concept of Site Reliability Engineering (SRE) to the industry and the way I like to describe SRE is “how Google does operations”. It’s a methodology and a culture. All of Google’s servers are operated using SRE principles so you can expect enhanced reliability, efficiency, and performance.
If you’re curious to learn more about SRE, Google published all its ideas in several books.
Sole Tenant Node Pricing is Compute Engine Pricing + 10% Premium
You pay for all the vCPU and memory resources just like you would for a Compute Engine VM.
The difference is that you pay for all the resources upfront, which makes sense since you are reserving all those resources in advance . There is also a 10% STN premium. So really, you’re only paying a 10% premium for the additonal benefits of dedicated hardware and more control.
STNs also get the same compute engine benefits of per-second billing, sustained use discounts, and committed use discounts.
Advanced Maintenance Control is only available on STN
You can control when maintenance events occur on your STN hosts and there are several options.
One option is the Migrate within a node group maintenance policy. This option is good where licenses are tied to hardware. Maintenance is limited to a fixed set of physical servers similar to how maintenance would occur in an on-prem environment. To ensure enough capacity for live migration, compute engine reserves 1 “spare” node for every 20 nodes in the group.
Another interesting feature is the ability to simulate a host maintenance event to test how VMs would behave during a maintenance event.
Overcommit CPUs on Sole-Tenant Nodes
Sole tenant nodes offer the ability to overcommit the CPU.
Since you have control over the hardware, you can schedule more vCPUs than are physically available. Resource usage potentially increases since bursty workloads can consume the vCPUs of idle VMs. There is also the potential to reduce the per-VM licensing costs if you’re using a per-socket or per-core licensing model since you run more VMs on the same node.
The overcommit feature is a big cost savings factor in sole-tenant nodes.
Sole-Tenant Nodes can have Tax Implications
There is potential to classify STNs as CAPEX.
Classifying STNs as CAPEX is a popular option for organizations that prefer CAPEX over OPEX. Since STNs are dedicated hardware, you can get a unique ID to identify the unique server and trace its lineage. Google provides a FAQ but recommends that you consult with an accountant on how to classify your payment.
Note: Google does not provide guidance about accounting.
Configuring Sole-Tenant Nodes in the Cloud Console is Easy
Select Create Node Group.
Give it a Name and Select a Region and Zone.
Select a Create Node Template.
Give the Node Template a Name and select a node type. The different node options are described in the documentation. The node types take the form of: XX-node-YY-ZZ. For example c2-node-60–240.
- XX = machine type.
- YY = number of vCPUs.
- ZZ = amount of memory in GB.
The different machine types can be quickly identified by the first letter.
- c = compute optimized. This node has a high ratio of vCPU to memory.
- m = memory optimized. This node has a high ratio of memory to vCPU.
- n = general purpose Intel. This node has an Intel CPU and a balanced ratio of vCPU to memory.
- n2d = general purpose AMD. This node has an AMD CPU and a balanced ratio of vCPU to memory.
- g = graphics optimized. This node supports GPUs.
Specify the number of nodes in your node group. There is an option to enable autoscaling of nodes. While auto-scaling is generally recommended with shared infrastructure, you might want to think twice about enabling this for STNs. Because of licensing and/or accounting reasons, you might need to have static infrastructure instead of elastic infrastructure.
You have the option to control the maintenance settings. I will stick with the recommended Default maintenance policy.
According to the best practices, if you have more than one workload, then create separate node group for different environments (ie. dev and prod) in dedicated projects. Then you would share the node group groups with individual projects. This design pattern simplifies access control, optimizes resource utilization, and maintains separate between environments.
Once the node group is provisioned, it will show up on the main page. Click on the node-group-name.
In the node group page, you can start to provisioning VMs on your new node group.
Summary
Most people think that all servers are the same and that they’re a commodity.
I think Google Compute Engine proves that it’s still possible to innovate with servers. Google Compute Engine offers custom built servers that are the same servers used to run all of Google’s services. They are secure, have differentiated features, and are managed by Site Reliability Engineers. Sole-Tenant Nodes take Compute Engine a step further and offer dedicated hardware, more control, and the potential to capitlalize.
For organizations needing dedicated hardware, predictable performance, and enhanced control, STNs provide a compelling solution within Google Cloud’s portfolio.
Resources
- Wikipedia: Google Compute Engine
- Wikipedia: Live Migration
- Google shares data center security and design best practices
- Titan in depth: Security in plaintext
- Answering your questions about “Meltdown” and “Spectre”
- What Google Cloud, G Suite and Chrome customers need to know about the industry-wide CPU vulnerability
- Google: Site Reliability Engineering
- Sole-tenant Node Pricing
- Sole-tenancy accounting FAQ