3 CloudOps Companies That Want You To Destroy Kubernetes in Prod

I want them to work together so we do it well.

Molly Sheets
15 min readApr 14, 2022

In the last month, I investigated the portfolios of newer companies in devops and liveops because I had a hunch something interesting was happening in the world of reliability — is chaos engineering improving? And what does that look like? (Hint: Cultural innovation alongside incremental innovation).

I narrowed to 3 mid-stage startups where their blogs, demos, and technical documentation walked through what I believe teams need in 5 years to bring down production, repeatedly, and get it back up as fast as possible. If I searched for an answer and it did not exist, the company never made the list to dive deeper. This included the following requirements: Entities must have containers and Kubernetes references. They must work with Terraform, could not have “dated stack” (ex. Jenkins) as a focus, and should be running at difficulty.

To lower the recovery time and recovery points of infrastructure as a function, enterprises need (1) a liveops company that will help them see what’s on fire in real-time (for example, when did our containers start crashing in Kubernetes)(2) a company that will help them put it out through changes in infrastructure as code and (3) a cloud-agnostic CICD solution that manages stack deployment pipelines to get infrastructure out the door. And reliability engineering teams specifically need that flywheel working well. Sighs in former Amazonian.

A Note on The DevOps Hiring Gap

I focused on ops because retaining specialist devops talent, improving CICD through operational excellence, and observability are fundamental hards affecting the future of reliability. In a 2020 study in AIthority of 175 hiring managers, 93% reported difficulting finding devops talent with open-source skills with 74% offering to pay for certifications. GitLab in ’21 shared devops will grow 122% in 5 years. Between August ’20 - ’21, a 12 month period, this skillset had 300K jobs opened in the US alone. A quick search on LinkedIn shows ~145K devops roles open in the US today. Addressing this gap will be a new wave of students learning the future of real-time 3D while picking up infrastructure as code for the first time — but not in the way one would think. It’s going to be through C#, Python, and Go.

I feel lucky to witness this talent group coming online. It doesn’t surprise me Perforce acquired Puppet no matter how you feel about Perforce or Puppet because the people on both sides are valuable. Try hiring a team of 10 right now with devops skillsets and see what happens. I don’t want to be you because I am you — you’re competing to hire me hiring those people along with Nvidia, AWS, Microsoft, META, GCP, Unity, Unreal, Apple, a bunch of new startups, and the government. And we travel in packs because we’re cynics who don’t trust companies to build cultures of psychological safety.

When I say “If Honeycomb, Pulumi, and Spacelift created their own cloudops roundtable, I would be lucky to be in the room,” I mean it. They see the outcomes on a deep level. They are building that future of cloudops — where we fight the tools less, fix the problems more, and it is fun to put out fires.

Honeycomb ( https://www.honeycomb.io/ )

Meet Honeycomb — an observability solution that cares how engineers approach reliability and problem-solving, not how the data forces them to co-exist. I consider them to fall in liveops. You may be looking at your Elasticsearch-Datadog-Pagerduty situation and say “I can’t look at another.”

When I met with Honeycomb’s Amy Davis, Head of Games Vertical, at GDC, I remember saying candidly, “I don’t need to be convinced this company will do well. No one wants to build this and everyone hates what they have.” I love everything this company is doing to make liveops easier. The lift to have “full” observability is significant. The moment one realizes 100% visibility into containers orchestration is a moving target, tools like Honeycomb make that utopia believable even if good Kubernetes monitoring is early-stage Asgard.

For teams tasked with site reliability and incident management, visibility is never enough. Sometimes engineers do not know where the calls come from if they are lucky to be in a position where raw data has the right structure to be usable in volume. Knowing what to collect and how is an exercise, let alone properly showing business impact. You end up drowning.

Imagine an incident response tool that has service level objective (SLO) alarms built-in and when you investigate the incident, you know what third-party API call raised what alarms from inside Kubernetes. You understand who it was affecting and the business impact. Imagine you can build custom queries against high cardinality and high dimensionality data (data with rich context, complex attributes) in that same platform. The kind of charts you wish you had for your postmortems and RCA retrospectives.

This demo is accessible behind no gates because they understand the value of your time and how you make decisions at https://play.honeycomb.io/quickstart/datasets/tracing-tour/

Demos for Honeycomb aren’t behind annoying contact forms — you can debug a real production incident in a browser without having to create an account because they are transparent. They want to show how quickly they triage their own stack through dogfooding. I also recommend debugging a scenario where support opened a ticket for slow API calls and yet no alerts triggered. They didn’t ask me to write this — I’m very happy they naturally understand customers hate giving up so much personal information to decide if their time is worth giving as well.

The Team

I hate building on things when I think a company may fail because of terrible leadership. Honeycomb is the opposite. Imagine my surprise, as a woman who has been working in infrastructure and games, to find that Honeycomb has a diverse leadership team already instead of as an afterthought. CEO Christine Yen and CTO Charity Majors, both former engineers at Parse (acquired by Facebook now META), bring some serious technical and business acumen, but the rest of their team’s posts ensure my LinkedIn feed is not just a bunch of dudes talking about their latest blockchain venture.

Honeycomb recently raised a $50 million Series C round (Total $96.9M) to grow their customer programs, further investments in OpenTelemetry, and expand their toolchain partnerships with vendors like Cloudflare, LaunchDarkly, and CircleCI. They also added a Staff Engineer as an employee representative to their Board of Directors. Perhaps a company can simply care about being good.

The Mission & The Tech

Their Director of Solutions Architecture, Pierre Tessier, has intentionally brought down prod on a live stage and the team walks around in “I ❤ prod” shirts. If the mission isn’t to make sure you feel like you have visibility into how well your ship runs, then let it be the psychological safety their culture brings a team.

Currently, Honeycomb focuses on serving games customers by building observability features that hit at industry-specific challenges like slow builds, failed load tests, and all the fun that can actually go wrong in game development and liveops, but that is not where they plan on stopping. Honeycomb has customers like CCP, who uses them for distributed tracing and Behaviour Interactive, but I’m mostly impressed by this quote from Sr. Staff Software Engineer, Frank Chen, at Slack, “We implemented our first cross-service Honeycomb trace in the middle of the second day of a multi-day cascading failure incident. Two hours later, the incident was over — and the team could focus on fixing it and moving on.” I’m not sure if I’m more impressed by Honeycomb or Slack in that story or that it even matters.

Honeycomb is available on the AWS Marketplace with deep integration into Amazon CloudFront, AWS Lambda, AWS CloudTrail, Amazon S3, you name it which is great if AWS is your thing. Honeycomb also supports OpenTelemetry distros for Java and .NET. I’m specifically interested in the open-source SDKs and collection agents for Kubernetes — because if there was one question I still get it’s “Molly, how can I get more visibility into my Kubernetes stack?” I was excited to learn there is a Kubernetes Agent and you can install as a Helm package. The agent runs as a daemonset (one agent per node). There are also instructions for how to get log fields from Nginx into Honeycomb — because they knew you’d look for it. Add that to my to do.

Summary

I’ve barely scratched the surface and included what resonated with me to encourage you they are worth your time. Take an hour someday — do one of those demos. Let me know what you think.

Pulumi ( https://www.pulumi.com/ )

I’ve never fallen for a company’s go-to-market trap more than I have Pulumi’s. While I love devops, I learned Python before I ever learned infrastructure as code trying to build a game analytics stack from scratch. There is a whole group of “client-side talent who got trapped by the term full-stack” that knows C#, C++, and other languages but not infrastructure as code. They will more easily pick up IaC if their teams standardize on languages or they see a way to learn it that resonates. Terraform by HashiCorp may now come later for developers that start in game engines before the cloud.

But if you DID start in cloudops and backend development, it doesn’t matter anyway — because Pulumi already knew you would think “But how will I ever migrate all my Terraform projects” and built tf2pulumi, an open-source command-line tool, for just that as well as that matrix you’re gonna need for your boss comparing Pulumi and Terraform.

I hate the term disrupting. So instead, I’ll say Pulumi may literally refactor the future of infrastructure as code. I think a lot about standardization across teams. Add multi-hybrid-cloud problems to that and you get fun questions like “how do we get everyone using the same tags when we need to but then also have custom tags specifically for some stacks? And how do we make sure as a central team that observability for IaC deployments is managed in one place?”

This means companies working in infrastructure as code must approach solutions from an enterprise problem space that doesn’t force assumptions on customers, where standardization across organizations is hard to achieve yet without it lowers centralized observability in devops, liveops, security, and finops teams as a cohesive function. Yikes.

The Team

Pulumi cares a lot about thought leadership — both to and as a community. The biggest barrier to infrastructure as code? Learning to write it. When I looked for more information about their executive leadership, I managed to find more about how to build things than what their c-suite was up to. I tried to google “Pulumi” + “news” and “Joe Duffy” (Pulumi’s CEO, formerly of Microsoft) and Luke Hoban (Pulumi’s CTO, formerly Principal PM on AWS’s EC2 team and Principal Engineer on VS Code at Microsoft). I looked for raises and executive interviews, but it seems that, shocker, serving their actual customers is more important than talking about themselves. Nice!

It does not surprise me that Pulumi focuses a lot on documentation and teaching. They are also a remote-first company that lists their leadership principles at the top of the About Us page followed by, similar to Honeycomb, a list of team members unorganized by hierarchy. Nice!!

This flywheel gave you away you ex-Amazonians.

I appreciate how granular and deep their team gets on specific examples. For instance, because I love dogfooding, check out this blog for “How We Manage GitHub at Pulumi with Pulumi” for Go written by Guinevere Saenger, Software Engineer of Platform Integrations.

The Mission & The Tech

I know there are a lot of non-believers: “But Terraform has all the market!” I’m not sure taking over Terraform is a real goal — Pulumi appears to desire to unify software engineering, security, and infrastructure teams so they can all talk to each other for once. It’s a people mission as much as it is a suite of build, deploy, and management tooling. They allow you to write infrastructure as code for .NET, JavaScript, Python, Go, on any cloud (AWS, Azure, GCP) and 60 providers through a consistent SDK.

Screenshot from Mikhail Shilkov and Paul Stack, both Software Engineers at Pulumi’s, talk on Managing Any Cloud (including Azure, AWS, and on-prem Kubernetes) with .NET. You can watch it here.

The fact that you could have both your IaC and your Unit Tests as C# is having your cake and eating it too. And given that Visual Studio IntelliCode is improving its AI-assisted development I cannot wait to see where this future goes to compress development time. Tack Pulumi under the category where the ROI in 5 years will be giving back time across teams because of recursive innovation when those templates from customers start hitting Github.

On top of it, they enforce compliance and drift detection as well as secure access and policies. I’m a HUGE fan of companies with guardrails for IaC that integrate well with multiple SSO providers. Pulumi focuses on Policy as Code (which they call “CrossGuard”) as governance that can be used across stacks to lock down people from doing things they are not supposed to do.

The final area of note is how much documentation they have on Kubernetes. Everything from adopting Pulumi to examples for AKS, getting started tutorials to deploy Kubernetes with an NGINX web server using Pulumi’s Cloud Native SDK, all the way to the Pulumi Crosswalk for Kubernetes to deploy to any cloud (AWS, Azure, GCP, or private) with production-ready playbook focused on security, governance, and CICD. NICE!!!

Summary

I feel like this company cares alot about my time. Their documentation is on the surface, good. Answers are not hard to find. There is still much I would like to do beyond what remains theoretical and so while I can praise what I’ve discovered, the reality is I need to actually deploy Kubernetes to prod using Pulumi concepts over other IaC solutions. Add that to my to do.

Spacelift ( https://spacelift.io/ )

Spacelift is a flexible CICD for infrastructure as code. Spacelift’s management platform enables cloudops teams to customize workflows, automate tasks, reduce errors, and improve auditability of infrastructure with a focus on centralizing all of this as a problem-space. My immediate thought was “Wait. Does this help when you are a centralized team serving internal customers who still want to be autonomous?” Yes.

Like Pulumi, they are also big on policy as code (built on Open Policy Agent), but their tooling focuses on declaring rules around access, Git workflow, state changes, and project relationships (Ex. drift). Where Pulumi lets you write your stacks in whatever you want, Spacelift is more about automating running Pulumi stacks (ex, you don’t need to make your own CI pipeline) or if you have several stacks you want to connect. Ultimately, Spacelift helps with killing drift as it can detect it and help you remediate it.

You don’t have to use Pulumi, but you can. Write your stacks in whatever you want, then manage those CI pipelines in Spacelift, across all your clouds because I’ll never hear the end of how much some of you (who I adore with all my heart) just don’t trust going all-in on any provider no matter how awesome I’ve made AWS out to be…and I’m here to say, you can have your cake and eat it too because, remember, cynics travel in packs.

I need Ryan Cartwright to stop posting architecture diagrams on LinkedIn that fit the narrative I want to believe because they are full-on distracting me from forcing myself to try self-managed Kubernetes.

The Team

I would consider Spacelift the smallest, but still mighty. I’m hoping Pulumi and Spacelift don’t inadvertently cannibalize each other. The problem space for stack governance is, while intertwined with IaC, big enough for a team to focus on. Moving fast to deploy is challenging as stacks are complicated. Plus, Marcin Wyszynski, Spacelift’s CPO (formerly of both Google, META, and startups) brings a unique perspective that only comes from working at companies that are both giant and small — this is way too complicated. I especially appreciated him listing “creature comforts” as a product direction. We do a lot of invent, not enough simplify.

In a more dynamic environment, where microservices come and go, new environments proliferate and new product teams require their own Terraform workspaces. The need to configure it each and every time become a major nuisance, putting a lot of pressure on your DevOps team. — Marcin Wyszynski

Marcin talks about the vision for Spacelift’s CICD pipeline in a great podcast, with the north star for the company being that Spacelift “centralizes everything about your IaC system: it runs code, deploys within CI/CD pipelines, tracks the progress of your infrastructure, and gives you insight into who made what changes and why.” This is a much-needed promise to deliver.

Spacelift raised $22.6M to date with their most recent Series B of $15M in Q3 ’21. Counter this against Pulumi’s total raise of $57.5M ($37.5M Series B in ’20) and you can see interesting directions for these two companies — who is conquering what for customers between enterprise and SMB. It is okay to be the company focused on simplifying CICD usability while another company gets granular on IaC. Personas vary. I hope they work together more.

Spacelift, this diagram is perfect — except we need to have a talk about the stick figure.

The Mission & The Tech

Spacelift created a Cloudformation-like interface that works with multi-cloud problems. Their tooling focuses on building intuitive workflows and does not depend on PRs, but instead push and tag events with Git. Reported outcomes allow developers to do magic with commit status checks, block merging code and automate triggering via Git push policies. This is all great. I don’t like reinventing wheels, and the world needs a big rig that supports varied cargo.

I really appreciated this free 2-minute demo that doesn’t force me to talk to a person first. It automatically supports Github, GitLab, and Google SSO. An example of how they value my time: https://bit.ly/3uFntfG

Spacelift stacks as seen above are core to the product. Conceptually they connect source code for your infrastructure as code, for example, Terraform, to managed resources. You can then apply governance on top of the deployment of these stacks. Worker nodes control the CI pipeline — Spacelift can manage them for you or you can manage them yourself.

A good overview of how stacks work for Spacelift.

Spacelift’s policies, their policy as code features, give you places to start on governance of your CICD pipelines. For example, you may want to govern if a user can login from an office VPN and correctly decide who can be let into Spacelift based on groups like devops, contractors, or admins.

Spacelift’s Resources features are interesting. I’m fascinated by visualization that lets users understand what’s been written. The more we can see ideas for cloud-agnostic or hybrid conditions, and DAG or acyclic-like workflow visualization, the better. There is room for growth in CICD visualization.

For the containers box, Spacelift can be used with ECS or EKS, but they also recently added documentation for the kubectl. Anything that can be run via kubectl can be run within a Spacelift stack. I appreciate their Getting Started guide and hope more examples continue here.

One final callout for Spacelift is that they both have a managed worker pool they host and operate for their CI, but they also offer a private worker pool solution with asymmetric encryption and you can manage your private keys. I appreciate this. People in enterprise hate when other people touch their stuff.

When I think about AWS, I think a lot about AWS Organizations and multi-account governance through service control policies (SCPs) — I worry specifically about managing governance rules, policies, and roles for stacks outside of AWS Organizations and landing zone like tools (Control Tower). I would love to see more material around this in particular for those using private worker pool models as managing roles and policies in multiple places increases complexity, not decreases it. “Which rules go where” is a question I received a lot as a solutions architect. Pages like this were helpful to me. Clarity when to use Spacelift over AWS CloudFormation StackSets managed by AWS Organizations and what I don’t lose by doing so would also be helpful if an enterprise organization has defined structures for organizational units, SCPs, and roles already. This is critical for enterprise adoption.

Summary

I see promise for Spacelift because I understand the CICD void they are filling with just how uniquely hard it is to get everyone on the same page at multi-billion dollar companies. I also see a lot of adoption challenges as cloud providers try to solve the problem themselves in enterprise-targeted services. Many enterprises have DIYed their way through stack governance and they will not want to let go of what they have. Skepticism grows with abstraction. CICD for cloud infrastructure is a hard space to predict — perhaps more valuable to simplify than most of the conversations around the metaverse.

In Conclusion

I do not work for the companies, and I hope reading has encouraged you to investigate purely out of interest to understand what they address. I hope to go beyond the theoretical when I find my way out of self-managed Kubernetes to deploy things in these tools soon. It will take a while for teams to evaluate the promise future here, that flywheel, and my DIY enterprise colleagues may not commit for years to the trends unfolding before our eyes because adoption is hard. The volume of challenges is enough to keep us all busy.

It’s been fun, and while I may not end up at one of these companies (who knows?), you can bet I will be keeping an eye on them.

--

--

Molly Sheets

Director of Engineering, Kubernetes @Zynga | Former Principal SA, Enterprise Games & Principal PMT, Spatial @AWS | 25 Releases | 15 yrs in tech | ❤s CloudOps