Network Performance Diagnostics with GCP ‘Performance Dashboard’

Gauravmadan
Google Cloud - Community
8 min readApr 21, 2022

The network is the backbone of every deployment in the public cloud. Thus, it is no surprise that network performance is pivotal for every organization with workloads running in the public cloud. Network performance management is top of mind, to ensure high quality end user experiences and avoid any financial losses from suboptimal network performance or downtime .With such a pivotal role in ensuring table-stakes network health, network operators often leverage multiple tools in their daily workflows to monitor how the network is performing . Therefore we often see network operators in a situation where they spend a lot of money on procuring network performance tools & training their staff for working on these tools so that they can save themselves from spending money on penalties . If we look at this from a different perspective , performance tools can also be considered as tools to prove network innocence and not guilty of each issue in the infrastructure and apps environment of the customer . A typical network operator finds himself in the middle of following questions while trying to address network performance related issues.

Figure#1 : Typical Network Performance dilemmas

Before we go deeper , let’s gain a perspective on what network performance translates to, in context . Network performance monitoring is a process of collecting and analyzing data about network traffic that flows across your environment. The main purpose of network performance monitoring is to retrieve network level telemetry, in order to take meaningful actions to maintain network health . The parameters of importance in cloud networking performance monitoring are “Latency” and “packet loss” and these will be covered in this blog .

Simply put ; “Latency” focuses on the time spent in the successful transfer of a packet of data from one point to another within a network. “Packet loss” refers to the number of packets that were successfully sent out from one point in a network, but never got to their destination. This parameter provides a measure for determining network monitoring performance, as the lost packets are expressed as a percentage of the total number of sent packets

In Google Cloud Platform(GCP) , the module under Network Intelligence Centre , which is designed specifically to address visibility into performance parameters is called as ‘Performance dashboard’

At this moment the performance dashboard in GCP covers only “intra-GCP” performance parameters , which can help empower informed decisions on Packet loss and latency metrics for intra-zone, inter-zone, and inter-region traffic within Google Cloud

Figure # 2 : Current modules available in GCP ‘Network Intelligence Centre’

At a very high level ; the users of performance dashboard can look at the entire information in one of 2 broad buckets -

  1. Per project dashboard

Project Dashboard shows packet loss / latency metrics only for zones where you have project’s VM instances.

2. Global dashboard

Global Dashboard shows zone to zone packet loss and latency metrics across all Google Cloud

Figure#3 : Options under GCP ‘Performance Dashboard’

So, how can each dashboard help you during your network troubleshooting journey?

Project specific dashboard

Customers are given a choice to select up to 5 regions where the workloads are deployed. Once regions are selected , the per project dashboard allows customer to visualize and understand

  1. Summary of packet loss [ historical 6 weeks data available ]
  2. Average Packet loss between region pairs of the regions selected [ current statistics ]
  3. Average packet loss between zone pairs of selected regions [ current statistics ]
  4. Summary of latency [ historical 6 weeks data available ]
  5. Median latency between region pairs of the regions selected [ current statistics ]
  6. Median latency between zone pairs of the regions selected [ current statistics ]

Global Dashboard

Customers are given a choice to select any number of GCP regions . Once regions are selected , dashboard allows customer to visualize and understand :

  1. Summary of packet loss [ historical 6 weeks data available ] . This view is capable of showing up to 50 zone pairs with VM-to-VM packet loss in all of Google Cloud.
  2. Average packet loss between zone pairs [ current statistics ]
  3. Summary of latency [ historical 6 weeks data available ] .This view is capable of showing upto 50 zone pairs with VM-to-VM round trip time(RTT) in all of Google Cloud.
  4. Median latency between zone pairs [ current statistics ]

Now that we know what these dashboards can tell us , let’s look at where to use these 2 views and how we can make them useful in troubleshooting GCP network related issues.

Use case # 1 : Isolate application issue or network issue

Lets assume that Network operations center is troubleshooting an issue which is reported as follows

Application slowness is experienced since 10 AM in communication between VM1 and VM2. VM1 lies in us-central1 and VM2 lies in asia-east1”.

Let me take a methodical approach on how to take this issue -

Figure # 4 : Approach to troubleshoot slowness issues reported

At this stage of troubleshooting , we have ensured that troubleshooting is needed at application level or at network level . If troubleshooting is needed at application level, we can save hours of network engineers after this stage as we have concluded that the network was clean at time of issue reported . The following evidences can be shared with application team to help them understand troubleshooting done at network end -

Figure#5 : No packet loss reported in project at 10 AM
Figure # 6: No abnormal spike in Latency reported around 10AM

Use case # 2 : Isolate project specific network issue Vs Google cloud network issue

Let me take another example. Assume that Network operations center is troubleshooting a issue which is reported as follows

Application has experienced unexpected packet loss since 10 AM in communication between VM1 and VM2. VM1 lies in north-america-northeast2 and VM2 lies in australia-southeast1”.

The engineer attending this issue followed the procedure mentioned above and saw an unusual spike in packet loss numbers between the regions of VM1 and VM2 around the time of incident reporting . This is a good stage for him to go deeper to find out what in the network could have caused this .

As he continues his troubleshooting , one obvious question in his mind would be “Was it only my project ? Or other projects reported a similar spike ? “

I have jotted down a methodological approach to be taken to isolate if the problem was specific to customer’s project or multiple projects were impacted due to issue in Google cloud network

Figure # 7 : Approach to troubleshoot Project specific Vs Global Google network issue

If the results of comparison indicate that there was a abnormal spike in latency or packet loss numbers at a given point of time , for both Google cloud as well as for specific customer project ; we can say with surety that the reported issues wasn’t due to network abnormality of customer project . Such a abnormal increase in latency / packet loss is supposed to be very short lived and system should automatically recover from the same.

Figure # 8: Performance for Google cloud : spike in pkt loss between 2 selected regions
Figure#9 : Packet Loss for Google cloud between zones of selected regions

As we can see from the above reporting of Google cloud performance dashboard , there was a spike reported in packet loss for communication between two regions . Since > 1% packet loss is reported by the whole Google cloud between the 2 mentioned regions , it is very obvious that all projects serving traffic between VMs in these 2 regions would have been impacted . This is a right stage to stop any network troubleshooting further . Since this was an intermittent issue in the GCP network , which got fixed automatically shortly ; all projects reporting the packet loss in communication should also have ideally recovered automatically from the packet loss situation.

The above reporting charts should be good enough to be used while submitting a root cause analysis [RCA] report of issue

Use case # 3 : Day0 Planning of placement of workloads

A customer needs to deploy a micro service based application in GCP , where a micro-service1 should talk to micro-service2 within 150ms . He has decided to host micro-service1 in asia-south1 and reaches out to the Network team to find out if he can host micro-service2 in Europe ?

Figure # 10 : 6-week latency trend between GCP regions

Again , as explained in some of the above examples , Global GCP view of performance dashboard can give us historical trends of upto 6 weeks indicating the latency / packet loss between selected regions pairs of GCP.

By looking at the last 6 weeks of latency numbers , we can eliminate the Europe regions which have consistently taken more than 150 ms for communication with Asia-south1 region and hence will not be fit for our application deployment.

Once we have done that , next step will be look at zone-pairs to go one level deeper and conclude the region/zone of choice with more accuracy

Figure # 11 : Zone-pair latency numbers

And spending some more time on zone pairs and reported latencies , we see that communication between europe-west1-b and asia-south1-a is consistently less than 150 ms. Therefore this region/zone can be suitable for our application deployment

Figure # 12 : 6 week latency trend between selected zones

Closing Notes

Network performance monitoring is one of the critical functions which any Network operations team has to take in order to support the business and honor agreed SLAs. Therefore; the networks team needs to have a tool that can provide real time and historical insights into network performance in a single pane of glass. GCP performance dashboard offers great visibility into the performance of the entire Google Cloud network, as well as to the performance of customer project’s resources , which not only helps reduce MTTR for network issues but also offers great help in day 0 planning and taking decisions like location of application placement. Hence GCP performance dashboard should be a ‘must to have’ toolset for GCP customers and network operators managing the GCP network environment.

Disclaimer: This is to inform readers that the views, thoughts, and opinions expressed in the text belong solely to the author, and not necessarily to the author’s employer, organization, committee or other group or individual.

--

--