By working with these companies to get Titus working in their AWS accounts, we learned how we could better prepare Titus for being fully open sourced. Those experiences taught us how to disconnect Titus from internal Netflix systems, the level of documentation needed to get people started with Titus, and what hidden assumptions we relied on in our EC2 configuration.
To this end they wind up investing excessive engineering resources to build out complex CI pipelines and intricate local development environments to meet these goals. Soon enough, sustaining such an elaborate setup ends up requiring a team in its own right to build, maintain, troubleshoot and evolve the infrastructure. Bigger companies can afford this level of sophistication, but for the rest of us treating testing as what it really is — a best effort verification of the system — and making smart bets and tradeoffs given our needs appears to be the best way forward.
Whitebox monitoring is a fantastic source of data. Observability is our ability to easily and quickly find germane information from this data when we need it. Observability is more about what information we might require about the behavior of our system in production and whether we will be able to have access to this information. It doesn’t matter so much if this information is pre-processed or if it’s derived from the data on the fly. It’s also not about how we plan to process and use the raw data. This raw data we’re collecting could have various uses —
One way to think of this is as an inversion of the common pubsub mechanisms. In many queue systems, the topics store no real user data but the messages that are published to those topics contain rich data. For systems like etcd the keys (analogous to topics) store the real data while the messages (notifications of changes) contain no unique rich information. In other words, for queues the topics are simple and the messages rich while systems like etcd are the opposite.
By and large, the biggest advantage of metrics based monitoring over logs is the fact that unlike log generation and storage, metrics transfer and storage has a constant overhead. Unlike logs, the cost of metrics doesn’t increase in lockstep with user traffic or any other system activity that could result in a sharp uptick in observability data.