Resolve outages with modern incident response
Technology outages burn bridges between businesses and consumers. We experience this everyday — whether from email and social media to banking and airline reservations, service downtime leads to lost revenue, drives away prospective and loyal customers, and provides an opportunity for competitors to steal market share with a better digital experience. According to a recent report from Atlassian, over 190,000 outages were reported by StatusPage customers in 2018 alone. While the current crop of monitoring and alerting tools act as a smoke alarm, there’s a gap in automating post-incident workflows using Site Reliability Engineering best practices to stop the blaze and provide actionable intelligence to prevent it from happening in the first place.
Today, we’re thrilled to announce our investment in FireHydrant, founded by Bobby Ross, Daniel Condomitti, and Dylan Nielsen, with a mission to help teams effectively respond to and learn from technology outages using their product, which launched publicly today.
Turbulence in the Cloud
The cloud and emerging application architectures make diagnosing issues that happen in production extremely complex. We’re seeing teams break down applications into microservices and transform legacy groups to agile teams with a ‘you build it, you own it’ mentality. While companies encourage code deployment velocity, new changes shipped to production are often the cause of outages. Web scale companies have learned these lessons the hard way and the best practices they’ve developed have culminated into the discipline of Site Reliability Engineering. It’s an area of growing interest from startups and corporations alike and encompasses the tools, processes, and organizational structure that web scale companies like Google use to scale their services and ensure uptime.
The legacy way of coordinating a response is getting everyone huddled on a conference call, which tends to lead to endless finger pointing, sleepless nights, and undue stress. It’s not surprising to hear that experienced employees leave companies from this miserable experience. Meanwhile, the rising use of real-time chat paired with increasing velocity of code deployments has both necessitated the need in the market and enabled this opportunity for FireHydrant. For example, a large media company CTO we spoke with last year talked about deploying Slack to his technology organization and forming their first ever SRE team. A leading financial services firm is taking many of their operations hires and reskilling them as Site Reliability Engineers. Growing every day are the 2,000+ open SRE roles across industries. As a result, the need to bring webscale expertise into this role through automation and workflow to the broader enterprise has never been more important.
A Product for SREs
Built by former SREs at companies like Namely, DigitalOcean, CoreOS, and Paperless Post, FireHydrant delivers a modern incident management process that will automate and apply industry best-practices allowing your team to put out fires faster, learn from your incidents, and focus on preventing them in the future. Key to FireHydrant’s approach is how it tracks and traces changes by monitoring deployments, which then point you to areas where problems started. From there, the tool automatically assigns roles and tasks based on FEMA’s Incident Commander framework, used and proven to tackle real life emergencies. Last and most importantly, the platform allows you to learn from your outages with analytics and an easy post mortem process. This process identifies root causes, thereby allowing teams to make the necessary fixes and helping managers understand the overall reliability of their systems.
Built by SREs
As we surveyed the landscape and spoke to SREs at startups and corporates, our friend Ed Burnett, formerly VP of Engineering at Namely, introduced us to Bobby Ross, who was at the time building an open source SRE tool during his nights and weekends. As an SRE himself at Namely, Bobby was familiar with all of the problems with current tools and processes — from production databases being dropped to Kubernetes upgrades gone wrong — and was passionate about solving them with FireHydrant. Together with Dan Condomitti as Head of Engineering and Dylan Nielsen as founding engineer, the team is passionate about streamlining incident response.
Bobby is well-respected and known in the open source community, speaking at conferences like the Spinnaker Summit, OSCON and even co-authoring a book on gRPC. In addition, in the past few months since we started working together, Bobby has already demonstrated incredible hustle, speaking at our Work-Bench Dev Lunch, NY Enterprise Tech Meetup and a Chaos Community Day we hosted. We’re excited to watch him and the team continue to lead and shape the SRE space.
The Journey Ahead
At Work-Bench, we’ve invested in companies like Algorithmia, CoreOS, CockroachDB, Scytale, and Dialpad who bring technology expertise and practices from webscale giants to the masses. We couldn’t be more excited to partner with Bobby, Dan and Dylan at FireHydrant to help companies of all sizes improve their operations using Site Reliability Engineering.
If you’re interested in learning more, check out the following: