Blackbox Monitoring at Oscar

Oscar Health
Oscar Tech
Published in
4 min readNov 19, 2021

By Erin Landau and Young Zhang

Would you notice a minor change in your child’s health? Probably not. Would you want to get notified whenever their health does change so that you could prevent bigger issues from occurring? Absolutely. You might ask yourself, how can I easily monitor my child for small changes in their health and get alerted if something goes wrong?

While we haven’t invented a preventative health bot just yet, Oscar’s Tech team has asked the same question about our engineering systems. We’re always looking for better ways to proactively catch potential issues, which led us to an approach called blackbox monitoring. In blackbox monitoring, teams observe the system’s behavior from the outside to get a jump on problems as they arise.

Opportunities to improve our existing monitoring

Most teams at Oscar implement whitebox and blackbox monitoring using prometheus, an open-source monitoring system, including job level metrics, database metrics, queue metrics, etc. We also rely on sentry and splunk for error logging. Even though we leverage industry best practices for alerting, we can still miss edge cases — some data discrepancies are too subtle or rare to register concern.

Since Oscar’s architecture is micro-service oriented and we have many queue-based distributed systems, sporadic issues can sometimes happen — leading to, say, a failure to dispatch a few member emails or to call gRPC endpoints (e.g. as a result of data inconsistency between services). Unexpected data issues can also be introduced by internal users and stakeholders, resulting in abnormal behavior for downstream systems even in the absence of engineering errors. We realized that blackbox monitoring would help us identify and triage these types of sporadic issues quickly and efficiently, and began searching for a way to execute this new monitoring process.

Implementing a solution by adapting an existing homegrown tool

At Oscar we’re also big proponents of dog-fooding, or as we like to call it champagning, our own systems. Instead of investing heavily in external tech to implement blackbox monitoring, we turned to an internal tool called Automat.

Automat is an automated action generator that is used to execute numerous workflows across Oscar — from mass email marketing campaigns to administering zero dollar incentive grants for our Virtual Primary Care plans. Automat users create a “recipe” using a human-readable programming language called YAML, specifying the type of action to be fired, the population to target, the cron schedule to run on, and any relevant context required to customize the action. We now have over 700+ recipes running in Automat with actions like email, letters, fax, slack messages, SMS, gift cards, Jira tickets, and more. We enroll thousands and sometimes hundreds of thousands of members into the automation pipeline and generate a large volume of downstream actions everyday (up to millions a day).

To stand up our version of blackbox monitoring, we decided to leverage Automat to perform queries on our data lake to send ourselves Slack or Email alerts when there are apparent anomalies.

Let’s look at some examples

The Oscar team responsible for our letter mailing service uses Automat alerting to monitor member mailing volume. For example, their alerting recipe sends a slack message to the team when a member’s daily mail volume exceeds a certain threshold. This alert not only ensures that members are not being deluged with unexpected letters from Oscar, but also saves the company money by weeding out duplicates.

Letter Service Mail Volume Alert:

Image of an Automat alert to the team when a member’s daily mail volume exceeds a certain threshold.

Oscar’s Insurance Operations team uses Automat alerting to send emails to internal stakeholders and engineers when the provider identifiers within different parts of our claims system are out of sync. Having reliable TINs means we can accurately process claims and pay our in-network providers without delay.

Claims Service TIN Mismatch Alert:

Image of Automat alert to Oscar’s Insurance Operations team indicating that provider identifiers within different parts of our claims system are out of sync.

We even use Automat alerting to surface the feedback members share with us while using our web and mobile apps. Oscar employees use the feedback from this channel to identify emerging issues, share kudos with teammates on recent launches that are well received, and hone in on quick wins to help members get care more easily. In the example below, we responded quickly to a member request by spinning up an in-app notification letting our members know where they could get a flu shot at their local pharmacy.

Member Web App Feedback Alert:

Automat alert showing member web app feedback

Oscar Response:

Image of Oscar’s response to member’s question about the flu shot

These are just a few of the myriad black box monitoring use cases we’ve implemented thus far. Blackbox monitoring is low overhead and high value, and we’re continuously applying this process across more teams and systems to make it even easier to keep a pulse on the health of our tech and — most importantly — our members.

Erin Landau is a Senior Product Manager, working on Oscar’s communication and workflow automation platform.

Young Zhang is an Engineering Manager working on member engagement and concierge support. Follow Young on Medium

Want to talk more tech? Send our CEO, Mario, a tweet @mariots

--

--