Implementing Hashicorp Vault at Oscar

Published in

Oscar Tech

5 min readDec 4, 2017

Part of a Site Reliability Engineer’s job is to evaluate the right moment when a piece of infrastructure needs to be replaced with a new solution. Ideally, systematic estimation of each solution would allow us to plan ahead for its eventual replacement, but what usually drives these decisions is the reliability of a system or, more precisely, the lack thereof. When a technology does not scale, or when it becomes a source of discontent amongst operators and engineers, it is time to look for a better solution.

At Oscar, SREs try to incorporate a third, less reactive and more forward-looking consideration: intrinsic benefit to our members.

Oscar’s business depends on trust

From forging relationships with our members through our Concierge teams to storing their Protected Health Information (PHI) in secure databases, a trust chain weaves through all that we do at Oscar. The Site Reliability Engineering team at Oscar, while seemingly removed from our members, is an important link in that chain.

The infrastructural and platform layer decisions that we make impact the integrity, availability, and security of our services, and ultimately affect all interactions with members and providers. When the time came to evaluate how effectively our secrets management service was strengthening this commitment, we decided that we could do even better.

What is a secrets management service?

A secrets management service is a system to store and distribute secrets — such as passwords and tokens — needed to access services such as databases and APIs. For example, when a user requires access to a database, a request is sent to an administrator, who then creates the appropriate credentials in the database. The life of these credentials persists until the person leaves the company. When the individual eventually departs, an administrator would need to remove the user from the database.

The example described above is manual. In small organizations, secrets management is not a hard problem to solve. However, as an engineering organization grows to hundreds of employees or hundreds of services, manual secrets management starts to become an increased burden and liability. Accidental data leakage, disgruntled employees, and a larger attack surface are some of the risks associated with the proliferation of secrets. As Oscar’s engineering organization grew, we decided to mitigate these risks ahead of time by upgrading our secrets management solution.

Choosing a new solution

When we decided to upgrade our secrets management system, we looked into a number of options from vendors, and found that HashiCorp was working on a project that looked very promising. They had released an early version of Vault, their secrets management tool, and we knew that even though the product was young, Oscar could benefit from testing and evaluating its capabilities. The lead time gained by early testing would let the technology and our implementation of it reach a higher degree of maturity should we decide to run it in production.

At a high level, Vault acts as an authentication proxy. Clients can ask to be authenticated with specific backends, and Vault will verify their identity and ensure they are authorized to connect to the requested backend. Backends can be databases, cloud providers, SaaS vendors, or any technology that uses username/password or token combinations to control access.

Our immediate goal was to deprecate the old method of granting user and services access to databases, and instead, let Vault dispatch short-lived credentials. Ultimately, we knew that building a control plane capable of handling all secrets would be worth the initial cost of implementation. With Vault, we gained dynamic and granular access control and administration; in other words, “super powers.”

Use cases

The first use case we implemented in production was the dispatch of database credentials to users. Using our existing employee directory, stored in LDAP, we mapped our organization groups to an access control list (ACL) for each of the Vault backends. Users belonging to those ACLs are now able to access their temporary credentials by requesting them via our internal chatbot or a web UI. Their credentials expire in 12 hours. The payoff for this use case is substantial: onboarding and offboarding a team member no longer requires any direct manipulation of databases or other systems, and we can guarantee that employees have access to data only when they actually need it to perform their jobs. Moreover, employee credentials can be easily revoked with one chatbot command.

Integrating services with Vault for database backends was a more challenging effort. Traditionally, services are configured at boot time with a variety of environment variables, including static credentials. Vault’s method of dispatching transitory credentials forced us to handle runtime rotating credentials in every library used to manage database connections. Thankfully, Oscar runs a tight ship with respect to database access, and we have only a few libraries handling database connections across our services.

So, with some deft engineering by our Engineering Effectiveness team, we were able to augment the libraries to handle the temporal nature of credentials; holding two sets of credentials at any point in time allows the code to handle failures more gracefully. We manage authorization via the AppRole authentication backend method, and roles are namespaced using role names in our scheduler — another convention that makes things easy to automate. The resulting payoff has been seamless automation and integration with backend databases for all services deployed on our platform.

In the same fashion, we were able to seamlessly and securely integrate services and user access to our cloud services like Amazon S3. Management of static secrets has also been moved to Vault: at the moment, we have a simple process of syncing static secrets from a private, encrypted Git repository to Vault.

Oscar’s new superpowers

Vault gives us some awesome features right out of the box, my favorites being auditing of all access requests, and a break-glass procedure that gives us the ability to lock down all systems and make them inaccessible in case of a security breach.

But the biggest win became clear as we kicked off the process of architecting our new infrastructure that has the ability to host a variety of platforms and multiple product tenants. This is not a simple endeavor. Adding to the complexity, we’re going to build two overlapping environmental dimensions: infrastructure environments for promoting changes to the underlying server images and tenant environments for developing, testing and releasing products. These two overlapping dimensions have to work in conjunction, without duplicating the entire stack for every permutation of the platform, while at the same time ensuring that members’ data, and the systems that access it, is limited to authorized personnel only. As a result of our systems integration with Vault, a lot of these features will be much easier to implement.

Thanks to HashiCorp’s other products, we can integrate service discovery (Consul), scheduler (Nomad), and secrets management (Vault) across federated VPCs in a multi-tenant environment. We are now exploring the use of Nomad’s “regions” to manage our platform deployment, with its “data center” concept used to manage tenants. Finally, with Vault as our security control plane, we control access and prevent environmental cross-contamination. Namespacing regions and datacenters, in conjunction with Vault X.509 certificates backend, make a robust set of controls we can dynamically configure and deploy.

At Oscar, our members’ security and privacy is paramount. As SREs, we work on technology a few layers beneath the member products and on tools that are undetectable to members, but our organization is critical to strengthening the trust that binds us to them.

Implementing Hashicorp Vault at Oscar

Written by johnlouis petitbon