Using any Windows Performance Metric in Google Cloud and Managed Instance Groups
Mike Francis, Cloud Transformation Architect, Google Cloud
Recently I was faced with a problem ‘how can we help a customer reduce their costs of running their Windows Terminal services by using Google Cloud’.
Reviewing the customers current solution, I discovered it was quite a typical architecture. The customer used Windows Terminal Services for remote office workers that required access to traditional thick Windows client applications. The customer used a VPN to secure the connection to their data center; they had a grouping of Windows Terminal Servers that their end users connected to.
From an operational standpoint, there appeared to be a number of opportunities where they could reduce their operational costs. A few that stood out were:
- Eliminate the need for management of multiple Windows Servers.
- Automate the lifecycle management of the Windows Terminal Servers.
- Have the end users connected to some form of load balancing that would connect them to a Terminal Server rather than have specific connections. This would simplify the supportability of the client.
From another cost perspective, the question surfaced ‘how could we use the elasticity and inherent automation of Google Cloud to provide the Windows Terminal Services on-demand; so the customer only paid for what they used’.
Conceptually I thought of the capability Google Cloud has to automate the scaling of an application using a combination of Google Cloud Load Balancers and Managed Instance Groups to automatically scale out a set of Virtual Machines based on demand, and then scale them back in when not required. The diagram below illustrates the conceptual model:
This concept is what led me to understand that some of the abilities to collect performance data from Windows machines and publish them into Google Cloud Operations is a bit limiting. Let me explain.
I wished to conceptually follow the model I mentioned earlier where the load balancer is tracking the number of application requests and based upon the breach of a metric threshold, the Managed Instance Group expands a set of Virtual Machines to support higher loads. And then when the load diminishes, it reduces the number of Virtual Machines.
So I thought, all I need to do is to track how many connections are made to a Terminal Server, and when that connection exceeds a threshold, the Managed Instance group will add another Terminal Server. That sounds conceptually something that could be done; and I could use a Google Cloud Load Balancer to front end the Terminal Servers to automatically balance the new sessions based upon the least used backend server.
Well that was when I discovered that Google Cloud Operations has some limits on what Windows metrics it will monitor; it has a list of specific metrics; and Terminal Server Active Sessions was not one of them. So then came the research into how I could make this happen. How could I monitor a windows metric that is not captured and if I can capture it, can I use it to trigger a Managed Instance Group to change scale.
I discovered that we support Custom Metrics in Google Cloud, and there is a couple of options:
- Using OpenCensus (now OpenTelemetry)
- Use the Cloud Monitoring API directly
I elected to go with the latter as it seemed like the simplest path to get what I wanted at the time. In hindsight, I would probably dig into OpenTelemetry more as reviewing it again, it appears that it may be simpler than I had first thought.
Now that I had determined to use the Cloud Monitoring API directly, I needed to use a language that was supported. As I wanted ultimately a windows service, I chose to use C#. I have never developed anything in C# and I thought ‘how hard could it be, I was quite good at C in 1994, surely I can pick this up’.
As the customer wished to see a proof of concept of the architecture in the short term, I created a C# application called GCPAnyMetric that used the Google Cloud dot Net libraries to authenticate to Google Cloud Operations and push the metric data. In my code I defined what metrics I want to collect from Windows in the registry along with the label for that metric that would appear in Google Cloud Operations. For the Windows Terminal Services Active Sessions I mapped that to a Google Cloud custom metric label of TerminalServices/ActiveConnections as shown below.
The code in GCPAnyMetric will allow me to define any Windows metric and define a Google Cloud Operation custom metric label. You can see this configuration in the registry of my test Windows machine below.
The Windows metric in the screenshot is Processor\% User Time. As this counter has multiple instances the ‘/’ denotes the instance in this case _Total. The data element of the registry key is the Google Cloud Operations Custom Metric label.
With the Terminal Services Active Sessions metric now being forwarded to Google Cloud Operations I created a Terminal Server enabled Windows template on Google Compute Engine and configured a Managed Instance Group. This is shown below:
You can see in the screenshot above that I have defined my Custom Metric as the AutoScaling metric. In this particular proof of concept I set the high-water mark for automated scaling to be 0.5 (in other words 1 session would breach the threshold) as I wanted to illustrate the autoscaling.
With my template defined, with GCPAnyMetric installed, and configured to forward the Terminal Services Active Sessions metric, and with a Managed Instance group defined with AutoScaling the solution was almost complete.
I then added a Google Cloud TCP Load Balancer that listened on 3389 and forwarded to my backend Managed Instance Group. The configuration of the Load Balancer is shown below:
The Outcomes
The proof of concept was able to demonstrate the following:
- Automated scaling of Windows Terminal Services based upon load.
- Single front end destination IP to simplify the client configuration.
- Single Windows Terminal Server template that can be updated and patched and generally managed which the Managed Instance group will then roll out as needed.
- This model means any Windows metric can be used to automatically scale a Managed Instance Group.
Business Outcomes
Using this solution the customer could achieve:
- Reduced capital costs by eliminating physical servers.
- Reduced Windows licensing costs using Google Compute Engine on-demand licensing.
- Reduce operational effort leveraging the automated upgrade feature of Managed Instance Groups.
GCPAnyMetric
I have made GCPAnyMetric available on Github, feel free to download and improve. You can find GCPAnyMetric here