Bringing Sexy Back to SRE
Our Seed partnership with OpsLevel
Today I am excited to share the news of Vertex US’s partnership with OpsLevel’s founders John Laban and Ken Rose. I could wax poetic about John Laban and Ken Rose’s engineering prowess from experiences building innovative products at DevOps GOATs PagerDuty, Amazon, and Shopify. Instead, I’d like to focus on what I believe is one of the biggest problems lurking below the surface in every major engineering team.
How does an organization define who “owns” a service? When a new engineer joins, how will she figure out what components to use, how to operate them, and who to go to with questions? When it’s time to deprecate a component, how is that shared and transitioned to a new service? When rolling out a new tool or a new practice — like vulnerability scanning — how do you get this requirement in front of every team and incorporated into every service? Often these questions are the provenance of the Site Reliability Engineer (SRE), a position far more scarce than even a Data Scientist.
SREs and infrastructure teams often juggle 5+ monitoring systems, 3+ logging systems, and 20+ third-party vendors throwing their own data into the mix. The sprawl of tooling creates issues that flow downstream, making critical operations like finding the error in a misconfigured server or bug in CI/CD pipeline far more difficult. I saw these problems first hand when I led Facebook’s infrastructure team. As we grew from 100 to 1000+ engineers, we often found ourselves with multiple tools (and teams) building overlapping solutions. In the early years, our mantra was “Move Fast and Break Things,” which evolved to a less catchy “Move Fast with Stable Infra.” Defining service ownership was a key tenant of changing how our organization worked. As businesses migrate to the cloud they have an opportunity to improve their engineering throughput, imbue service ownership, and fundamentally improve site reliability!
This spring my colleagues Sandeep Bhadra, Madison Friedman, and I interviewed dozens of practitioners, triangulating market feedback and helping webmonsters cope with the reality of their convolutedness. Of the dozens of companies we spoke with, only two had not yet started their journey to microservices, and both were actively considering it. Large companies with established monoliths are keen to move to microservices, but costs are high, and the transition can take years. Madison summarized our research in Taming Microservice Monsters.
OpsLevel wants to make the Internet better by helping companies build more reliable and secure software. They’re building a Service Ownership Platform — starting with a Microservice Catalog — to make it easier for engineering teams to own and operate software in production. As enterprises reconstruct monolithic applications to cloud-based microservices, the idea of service ownership and wiring up applications in a durable, scalable fashion becomes massively complex. OpsLevel’s first product is akin to service scorecards, making it easy for engineers to track all the information about applications and services inside their organization — what it does, who owns it, and how to operate it.
With 1 to 10 engineers and 10 to 100 services this may not be a big deal. At scale, managing hundreds to thousands of services across dozens of teams is hard.
Microservice architectures increase the number of dependencies in systems. The concerns (CI/CD, version control, error tracking, failover, authorization, etc.) of SRE, DevOps, and Security teams then multiply, and if ignored, these concerns lead to downtime, lost data, and security gaps. OpsLevel’s catalog is the first, foundational move to eliminating questions about Service Ownership. Standardizing on a format for describing services and their properties places it at the heart of all other DevOps tools, meaning engineers don’t have to worry about keeping them updated or constantly rewiring when components are upgraded.
Apex tech giants like Atlassian, Facebook, LinkedIn, Microsoft, Slack, Uber, and at least a dozen others resorted to building homegrown microservice catalogs because there was nothing good enough to buy. OpsLevel brings the capabilities of the biggest, baddest internet giants directly to modern enterprise development teams.
Ultimately we want to solve an even more substantial problem: eliminating downtime. Downtime can be caused by an innumerable number of factors. However, it’s easy to understand the outcome of downtime: lost business, lost trust, and worse still, lost customers. We are thrilled to partner with OpsLevel’s founders John and Ken, and look forward to supporting them in building an enduring company.