
…h we can control for this behavior by tweaking the model. This is not a fault in our model exactly. Actually it raises a really important question: what does our autoscaling service do when all the servers are unhealthy and it cannot spin up new ones? This actually happened to Buildkite a few years ago when some misconfigured settings triggered fail…
Anyway, let’s say we did all that and we decide on an initial state of 40% utilization. This is where we need to learn how to think about a system of systems as an algorithm instead of an environment. Although in the middle of an outage or complex failure it might seem like things are happening at random, in reality each system has a defined set of rules that govern its behavior and choices, and each system it hands off to has a defined set of rules that govern its behavior and choices. Just because the exact set of inputs is unexpected does not make the behavior random.
…ems (or at least to perfectly solve a single problem). We are misguided to these complex solutions. We’ve created complicated problems by not properly understanding the fundamentals of whatever we’re struggling with.