Several months ago I was working for a client whom had deployed all of their application as a series of Docker Containers on ECS from AWS. When started contracting for them, they didn’t really have any insight into how their software was run, other than the occasional “finger in the air” metrics taken by looking at a given log file every now and then. I was hired to make noticeable performance improvements to their service, and to assist them with the design/architecture of several future projects.
Straight away, I was absolutely certain that we must get reliable long-term logging and metrics from the existing systems, to understand how they were performing and operating. Having logs and metrics is critical to being able to easily onboard a new team member on to a project and have them be straight away contribute meaningful improvements to performance, security and reliability.
After an initial conversation with the CTO of the company, whom I reported to, I received approval to implement metrics collection and improved logging. For logging, it was just a matter of making sure the basics where covered: just having standard apache style http request logs, and also making the logs machine parseable/readable, such that we could ingest them into a tool such as Kibana / the ELK stack.
We had to use push-based metrics, via statsd, as their containers were running inside of ECS, without any form of service discovery, besides having the containers registered to ELBs that handled the incoming HTTP traffic. That is, we had no way to discover all the individual containers and make direct HTTP calls against them, which is what a pull-based metrics system such as Prometheus requires.
As for the metrics server, we decided to use HostedGraphite, as it was the path of least resistance to get started with and required the least amount of operational maintenance in the mid-term future. Unfortunately DataDog was out of the question, as my client didn’t yet feel that the metrics produced enough value to justify the cost of DataDog, despite being an amazing service.
One of the metrics which the CTO requested were the container metrics from ECS, which he’d previously looked at in CloudWatch on occasion to see if something wasn’t right. Luckily, there wasn’t too much work to do here, as there’s already the docker-stats-statsd project that was deployable without too much work. The main challenge I faced with that project was how to interpret the metrics it produced into meaningful dashboards that could drive actionable changes.
Even though I implemented the collection of container level metrics for CPU and Memory, in the long-run, we never used these metrics for practical purposes. I’ve come to the conclusion that container-level metrics usually aren’t useful, instead preferring Application-level metrics.
Container-level metrics seem only important for monitoring resource utilisation and performing capacity planning of your container hosts. That is, to understand whether all your containers were using their 60% of their allocated memory and CPU usage. However, if I’m wanting to do that, then I’d really want to be comparing allocated resources of a Kubernetes service, versus the amount of resources that the containers for that service use regularly.
Monitoring things like CPU and Memory usage for the purposes of “seeing if something is wrong”, which was previously commonly done when we had servers or virtual machines, where there were multiple applications running on the one set of resources without any form of isolation.
However, in a container environment, you have that resource isolation and allocation, so the issue of noisy-neighbour type problems is greatly reduced, and instead you’ll first detect issues through healthchecks, remove the bad container from the load-balancer, and then either remove the container or tag it for later post-mortem analysis to determine the root cause of the issue.
This means that the resource utilisation monitoring becomes more of a problem for the cluster or container scheduler, rather than an dataset for a DevOps or Operations Engineer to query with when trying to diagnose issues.
This is now my current thinking on monitoring in container like environments; I’d definitely be interested in hearing other perspectives on monitoring and containers.
As for the client, after the month of working with me, we’d managed to improve response times for several of their business critical API endpoints by a factor of 3–7.5 times, and we where able to reduce database queries by a factor of 8–10 times.