So you experienced a large network outage, great... The networking brain-trust took a deep dive into the log files and quickly realized all they had to work with are generic connection timeout log messages. Anything beyond that was guess work. It’s obvious something went wrong, but without detailed host-to-host level TCP/IP traces the usual outcome from such events is a frustrating “let’s add detailed network monitoring and wait for it to happen again”.
The above scenario highlights a general weakness in the ability to troubleshoot network issues. Network traffic events are fleeting and leave few traces. There is a good reason for this: if all the network metrics in a typical Linux server were logged, the volume of data would be overwhelming. However, remote monitoring systems such as Stackdriver (SD) can help by collecting, organizing and aggregating a vast amount of metrics data. It also provide charts to easily detect outliers and warn you early on using alerts.
The SD service uses an agent installed on the monitored VMs, collecting and transmitting vital metrics to the SD servers running on Google Cloud. The agent is a customized version of the widely used collectd daemon. However not all the standard collectd plugins are included in the SD agent default installation.
In this post we will cover the process of building the collectd protocols plugin and getting it to work with our SD agent.
The protocols plugin captures network statistics from the host and turns them into SD metrics.
We will obtain the SD agent code, build the plugin, add the plugin to the SD agent configuration and observe the resulting network metrics emitted from the plugin. (Note that as of SD agent version 5.5.2–379 the protocols plugin binaries are included in the install package, if you’d like to skip the plugin build process jump to the “Configure the plugin” section)
Let’s get started:
Launch a Debian 8 VM on google cloud:
Navigate to console.cloud.google.com, (sign up for the $300 free credits if you haven’t already) and click on the Cloud Shell button. (BDW; using Cloud Shell is great because it’s preconfigured with your identity, project and the latest copy of the SDK and gcloud CLI. If you prefer you can use the Cloud Console GUI as well)
Enter the following command (adjust the Service Account name)
Next, SSH into the newly created VM and install the required components:
sudo apt-get update -y && sudo apt-get upgrade -y
sudo apt-get install -y build-essential git flex bison pkg-config automake libtool-bin
The protocols plugin build depends on the SD agent source. We will proceed by checking out the SD agent code:
Prepare and build the agent code which includes our desired plugin.
We now have a built version of the protocols plugin that is compatible with the SD agent. You can confirm that with the following command
ls -la ./src/.libs/protocols.so
Lets install the agent on the same host to confirm the plugin works correctly. These commands are from the SD install guide
curl -sSO https://dl.google.com/cloudagents/install-monitoring-agent.sh
sudo bash install-monitoring-agent.sh
We can now copy the plugin binary into the agent directory
sudo cp ./src/.libs/protocols.so /opt/stackdriver/collectd/lib/x86_64-linux-gnu/collectd/
Configure the plugin:
create a plugin configuration file that will auto-load when the agent starts
sudo nano /opt/stackdriver/collectd/etc/collectd.d/protocols.conf
And place the following text into the file:
Note the various “Value” regex statement inside the “Plugin” tag. These configure the collectd daemon to only process metrics with a select set of keywords indicative of a TCP/IP problem. The “IgnoreSelected false” tells collectd to ignore the non matching metrics (There are many others, see /proc/net/netstat file).
The “PreCache” section adds a “stackdriver_metric_type” MetaData tag. This configures the SD agent with these metrics. They will be included as custom metrics in our project. Without this metadata the SD agent will simply ignore these values. If interested in reading more about this syntax, the SD agent custom metric documentation page has all the details on how to configure custom metrics.
Now restart the agent:
sudo service stackdriver-agent restart
And confirm the plugin was successfully loaded
sudo journalctl -r -u stackdriver-agent.service \| grep "plugin_load: plugin \"protocols\""
At this point all that remains to do is distribute the plugin, configure the remaining monitored VMs and observe the resulting data.
As this is a bespoke version of the plugin it’s recommended to keep it in a safe place (GCS is a good option) and incorporate the installation into a startup script.
Back in Google Cloud Console, head to the Monitoring section.
Once inside the SD console, click “Dashboards” then “Create Dashboard”. Give the dashboard a name and click “add chart”. Start typing “protocols” and the newly added custom metric will appear. (another quick way to verify the networking stats have made it through is using Metrics Explorer, give that a try)
A nice feature in SD charts is the ability to focus on specific subset of the collected data and filter out the rest. In this case we want to focus only on TCP timeouts. We add a regex filter that displays all metrics with “timeout” in their name
And so; the agents are now configured with the protocols plugin, capturing vital network statistics. The data will be retained in Stackdriver for 6 weeks arming us with historical TCP/IP connection data. We can now travel back in time to the point where “something went wrong with the network” and do a deep inspection of what failed and how.
As an example, in the following case, 3 VMs will cause network congestion on a 4th VM to a point where packets are lost and the server cannot accept new connections. We will use iperf3 for this simulation:
On the server VM launch multiple servers: iperf3 -s -p [5201..5204]
On the client VMs: iperf3 -c 4th-VM-IP -t28800 -P128 -p[5201..5204]
To make it more interesting an intense PD disk activity (a network based disk service) was running alongside the iperf3 servers on that 4th VM.
The chart below shows us the three iperf3 clients were having TCP failures (three shallow humps), but it also clearly shows that the bankruptcy of the 4th VM network (the huge hump) is the root cause.
In conclusion: including detailed network monitoring of any network bound service will benefit your future self when it’s time to analyze a network outage. Stackdriver provides you with the option to collect a wide variety of TCP/IP metrics, graph them in an intuitive way and gain insight into prior network events.