Iter8: Take a look at the magic under the hood

Published in

iter8-tools

4 min readOct 1, 2020

In this article, we will take a peek at the analytic capabilities of iter8.

Iter8 is an open source toolkit for continuous experimentation on Kubernetes. In our previous blog post we said that iter8 is driven by its machine learning capabilities. In this post, we discuss the rich analytics insights surfaced by iter8 that are unique to iter8 experiments.

The assortment of iter8 metrics is visible and exposed to users through our human-in-the-loop experiments where users can actively participate in driving their experiments forward. You can watch this tutorial to learn more.

Statistical insights surfaced by iter8

Iter8 compares candidate versions of a microservice with each other and with a baseline version currently deployed and roll out or roll back to the best version that emerges from an experiment. During an experiment, iter8 makes optimal traffic decisions based on the behavior of the baseline and candidate versions. In addition, iter8 also exposes a suite of metrics that are generated for each version. These are described below.

Traffic split recommendations

One of the most useful recommendations that iter8 offers is the recommended traffic split to each version of the microservice under experimentation. These numbers are based on advanced Bayesian learning techniques coupled with multi-armed bandit approaches for statistical assessments and decision making. Currently, iter8 supports three traffic split strategies: progressive, top_2 and uniform.

If you’d like to learn more about iter8’s traffic split strategies, check this out.

A blog describing the algorithms behind iter8 is coming up soon!

Winner assessment

Iter8 experiments are broken down into iterations. At the end of each iteration, iter8 determines whether a winner can be declared among the competing versions. This is determined by the win_probability assigned to each microservice participating in the experiment. Win probabilities are posterior probabilities computed based on Bayesian estimations of each candidate version’s performance so far in the experiment. There can only be a single winner at a time and iter8 declares a winner only if it is confident in stating so.

Metric criteria

To help decide the outcome of an experiment, users can set strict SLOs for microservices in the form business and performance based metric criteria. Metric criteria can be of two kinds: absolute values where you can say, for example, that the latency of any of the candidate versions should be less than 20 milliseconds, or relative values where you can say, for example, that the latency of any of the candidate versions should be within 1.05 times that of the baseline version (in this case, you are willing to tolerate a 5% increase in latency relative to the baseline version).

In addition, users can optionally mark one of the metrics used in the criteria as a reward metric, which is interpreted by iter8 as follows: among all the versions that satisfy the SLOs, declare the version that optimizes the reward as the winner. If no version satisfies the SLOs, declare baseline as the winner. Under the hood, iter8 builds and updates belief distributions for versions based on their metric values and uses them to drive all the assessments described above.

This experiment, for example, is run with three versions of a microservice: one baseline and two candidate versions. Both candidate versions perform better than the baseline according the the results from the reward metric. In fact, Candidate 1 performs much better than Candidate 2 if we look only at the reward metric. But iter8 decides to pick Candidate 2 as the winner. This is because Candidate 1 does not pass the service level constraints set by the user and so, no matter how high the rewards are, it will not be selected by iter8.

As you can see, the combination of reward and SLOs expands the space of experiments that you can design using iter8. This feature makes it easy for you to roll out the best version of your microservice, where you have flexibility to express what best means to you in an experiment.

Credible interval and metric values

For each of the microservice versions, iter8 exposes the true observed value of each metric of interest (the ones that appear in the desired criteria). Internally iter8 uses this value to build various belief distributions for the metrics. The belief distributions also allow us to expose credible intervals for the criteria used in the experiment for each microservice. Credible interval is the range within which the metric is most likely to lie.

The variety of insights shipped by iter8 will help you make informed decisions before rolling out the best version of your microservice. As mentioned previously, you can choose to perform human-in-the-loop experiments where you can play an active role in driving iter8 experiments forward. You can also choose to perform automated experiments where these decisions are made for you.

What are the metrics, assessments, and recommendations you would like to see next as part of iter8’s experiments? Give iter8 a spin and let us know!