Joyent CoPilot: Bringing Application Awareness to Cloud Infrastructure. Part V.
The story so far…
In Part I you were quickly introduced to the goal of the project as well as some of the technological concepts necessary to understand CoPilot. Part II covered major conceptual and structural changes and Part III focused on the sign up process, creation of a deployment group and deployment of services. In part IV we broke down some of the major CoPilot components, learning how services can be assessed from the perspective of performance and application architecture and how they can be scaled.
Up next — metrics and alerting. Get set, get ready, go!
Metrics, monitoring and alerting
To keep the infrastructure and the application running efficiently, you need to be aware of changes occurring within the system. Triton enables users to measure and monitor multitude of performance parameters, however with CoPilot we had to find a way to process and display these measurements, as well as provide users with minimal tools to analyse and act on them.
Effective monitoring of metrics means that users are able to:
- Identify sudden performance changes
- Identify slow performance changes that follow a certain trend
- Understand the overall performance of a service as well as performance if it’s individual instances
It is important to point out, that even though multiple instances of the same service are meant to perform equally, that isn’t necessarily always the case: in certain scenarios, some instances of a service might be either underused or running out of resources, where as the rest will be performing within the limits of what is considered to be optimal and it is important to be aware of these deviations.
This brings us to one of the challenges of service-centric approach — data aggregation. Data aggregation is the process of gathering and summarising raw data. Bear in mind, that we have to deal with two types of data aggregation:
- Time aggregation — a process of summarising all data points gathered from a single resource over a specified period of time. The result of this process is a single value that reflects the collected data. For example, imagine you have 1 instance of Nginx service and you’ve been measuring a single performance parameter (let’s say CPU usage) every 15 seconds (the highest frequency possible on the platform at the moment) for the duration of 1 hour. This would result in a data set constituted of 240 data points. Aggregating these data points would result in a single value.
- Spatial aggregation — a process of summarising all data points from a group of resources over a specified period of time. Just like with time aggregation, this results in a single value. Let’s go back to the example of a Nginx instance. Imagine you scaled up Nginx from 1 to 20 instances. Monitoring them for an hour will result in 20 data sets with 240 data points in each. To have a better understanding of the entire service, you might find it useful to summarise performance measurements of 20 instances into something more intelligible.
The challenge here lies with the method of aggregation. Summarising multiple data points into just one value can obscure important information.
Consider this example. Let’s say you have 3 containers. One of them is using 100MB of RAM, the other one — 200MB and the third one — 300MB. Let’s say that both 100 and 300MB are undesirable values, meaning that one the instances is being underused, the other one is overused and both of them require attention. If you would calculate an average value of the metric, it would be 200MB and you’d think that ‘Well, everything seems hunky-dory”.
Early user interviews helped us understand two important things: a) choosing the right methods of aggregation is imperative for effective monitoring and b) most users don’t really understand how aggregation works beyond calculating simple averages (which aren’t always helpful). So it was up to us to both find the right way to quantitatively describe sets of data and educate users about it.
With the help of a few very senior interviewees well-versed in descriptive statistics, we’ve decided on trying out box plots for spatial aggregation. Box plots is a neat way of graphically describing the distribution of data and identifying outliers — both of which can be indicative of unequal performance of instances. In a box plot, the data can be summarised in 5 values:
- minimum and maximum of all the data
- first quartile — a value that splits off the lowest 25% of data from the highest 75%
- third quartile — a value that splits of the highest 25% of data from the lowest 75%
- median — the value that separates the lower half of data set from higher half. Median does not get skewed by extremely small or large values as much as the average (or, in other words, the mean)
Box plots can also include outliers — values that lie an abnormal distance from other values.
With box plots in place, we can provide users with a greater and more accurate understanding of what’s happening with a service as well as prompt them to take actions. The median informs users of the most typical value of a metric and the distance between the minimum and maximum of data set allows to evaluate the distribution of data. The larger the distribution — the less uniform the performance of service instances is. On top of that, outliers can indicate even more severe performance anomalies.
With aggregation making performance monitoring more intelligible, the next step was to introduce a degree of automation (unless of course, you love spending hours staring at metrics charts) by designing an alerting system that would automatically notify users of critical changes.
In a nutshell, an alerting system consists of 3 major components:
- a service metric that is being monitored
- Conditions defined by the user, under which a change in performance is considered critical and requiring attention
- An alert — a brief message notifying of the critical change and promting a user to take action
Sounds pretty simple on the surface, however monitoring and alerting are quite complex subjects and have entire products built around them — meaning that whatever we would end up designing, had to provide a solid basic toolkit without attempting to compete with mastodon-sized stand-alone monitoring products.
Following assessment of various monitoring products, user interviews, several collaborative workshops with Joyent folks and considerable amount of conceptual hacking and slashing, we’ve identified the main objects and variables that would constitute our bare-bones alerting system:
- A service metric that is being measured and assessed
- Type of behaviour that a metric has to express. Sounds fancy, but in practice it means whether a metric is above or below a certain value
- Threshold — a value below or above which change of a metric is considered critical
- Time aggregation — a variable describing the behaviour of a metric in relation to the threshold during a specific period of time
- Alert name
- Alerting notification — means by which a user is notified of a critical change. Could be an notification inside the platform, an email or a message received via any other kind of messenger
Check out a walk-through video of an alert being created and managed:
So, what’s next?
As the work on CoPilot moves forward, there will be more things to discuss. In the meantime, for those of you who are dying to learn more about the project, let me introduce you to my colleague Alex, who has written some excellent articles on CoPilot .