Tackling Latency Issues on Google Cloud Platform’s Cloud Interconnect

Bernie Ongewe
5 min readAug 23, 2024

--

Are you experiencing sluggish network performance or unexpected delays when using Google Cloud Interconnect? If so, you’re not alone. Latency issues can significantly impact the speed, reliability, and overall user experience of your cloud-based applications. In this article, we provide practical troubleshooting steps to help you identify common issues.

Cloud Interconnect is a high-performance network service that enables private, low-latency connectivity between your on-premises data center and Google Cloud Platform. While it offers numerous benefits, including increased bandwidth, security, and control, it’s not immune to latency problems.

Cloud Interconnect has become particularly vital tool for enterprises leveraging search products as Vertex AI to retrieve and train on their data, wherever it resides. By establishing a dedicated, high-bandwidth connection between their on-premises data centers or other cloud environments and GCP, users can efficiently retrieve large datasets stored in traditional data warehouses, databases, or cloud storage services. This direct connectivity eliminates the need for data transfer over the public internet, reducing latency and improving performance for Vertex AI workloads that require high-throughput, low-latency data access.

In this post we suggest steps to confirm transport latency when your users complain about a degraded experience. This, after you have confirmed that neither client nor API latency are contributors to the problem.

Familiarity with the following is assumed;

  • Cloud InterConnect: Provides low-latency, high-availability connections that enable you to reliably transfer data between your VPC networks and your other networks
  • WireShark: A free and open source protocol analyzer
  • Iperf: A tool for network performance measurement and tuning

Latency from your VPC to other networks

In this section we discuss how to identify issues with by capturing traffic from a in your VPC negotiating with a host on another network over Cloud InterConnect

You can capture the client using tcpdump on Linux clients using the filter below

tcpdump -nvvvi {interface} host {ip-address} port {port} -o {output-file}

Slow Throughput Due To Packet Reordering

To identify these in Wireshark, select packet from the sending direction and navigate through; Statistics -> TCP Stream Graphs -> Time Sequence (Stevens)

The slow start behavior may manifest as occasional plateaus in the Sequence number vs time gradient. A tcp.options.sack filter may show an accumulation of TCP SACKs around this period.

Slow Throughput Due To Insufficient TCP Window

In general;

Throughput <= TCPWindowSize / RoundTripTime

If the throttling is perceived over the interconnect, you can experiment with increasing either the TCP receive or send window

On Linux hosts, the receive buffer size influenced by increasing these kernel settings:

net.core.rmem_default
net.core.rmem_max

As an example, you can query and set the value for net.core.rmem_default respectively with;

sudo sysctl net.core.rmem_default
sudo sysctl net.core.rmem_default={new_value}

The send buffer size influenced by increasing these settings:

net.core.wmem_default
net.core.wmem_max

Latency from other networks to Google APIs over InterConnect

Often it’s easier to take a packet capture from a client on the remote side of the InterConnect to Google’s API.

For latency issues we need to see the time between frames. This is difficult to do with the default view in WireShark. As such, you must first add a `Delta Time` column under Wireshark -> Preferences -> Appearance -> Columns

If there capture device is a multi-NIC (eg LACP switch), a discrepancy between the timestamp at the NIC and when the packet is written to file may cause apparent negative time stamps between packets. This may also hint that packet reordering is an issue, in which case you may increase the TCP window size as discussed in the section above.

If packaet reordering is not a concern, you can fix the packet orders in the capture file with reordercap, which gets installed with WireShark. The usage is as follows;

reordercap $input_file $output_file

Analysis: Inspect wide time gaps between frames

  • Sort by ‘Delta Time’ to find the largest gap
  • Select that, then sort by ‘time’ to inspect surrounding frames
  • In the screenshot below, we see a PSH ACK from the client with the same sequence number in frame #2857.
  • Here, it seems like it was expecting a response from upstream

Analysis: TCP sequence number trend

  • ‘Statistics’->’TCP Stream Graphs’ -> ‘Time Sequence (Stevens)’ graph’ show other plateaus.
  • No progression in TCP stream for long periods

Analysis: Next Steps?

  • So far we see evidence of packet drops/delays upstream from the client
  • This may be an opportunity to open a ticket with the data center provider to investigate potential issues with the peer fabric on the InterConnect.

Note of performance measurements with Iperf vs NetPerf

As indicated in the documentation for Per-instance maximum egress bandwidth, maximum throughput is partially dependant on the number of CPUs per instance. This is important consideration for the tool to select when measuring throughput.

Many network administrators use Iperf due to ease-of-use and familiarity. However, the documentation reminds us that this “ will only be bound to a single CPU”. This is true even if you attempt run multiple instances of Iperf and indicate different CPUs with the -A flag. After the first Iperf process, subsequent attempts will be blocked with an error similar to the one below;

$ iperf3 -c 10.150.0.12 -t 3600
iperf3: error - the server is busy running a test. try again later

If you try to paralellize the throught test with the -P option, all clients are still bound to the same CPU. As such, on hosts with multiple CPUs, Iperf will likely not give you an accurate view of available bandwidth.

Netperf is an alternative to Iperf that can excersie multiple CPUs. Please see the manual for installation and usage instructions

Additional References

--

--

Bernie Ongewe

Passionate technologist helping organizations integrate production workloads with in Cloud and on premises. Personal views, not my employer's