A Year Of Running Kubernetes at MYOB, And The Importance Of Empathy.
Outcomes first: in a year, we had 100 production apps across 26 tribes running on our Kubernetes clusters; we drove ISO 27001 compliance across 12 products, and taught 200 developers how to build secure, production-ready apps. And we built this on a foundation of empathy, mentoring, and a belief in propagating change starting from developers up.
Oh, and we were the finalists for a company team award. I like awards.
This is how we did it.
I’m a big believer in judging the culture and values of a system by the artifact it produces, so I’m going to outline what we had at the start of the year vs. what we have now: in January 2018, we had a Kubernetes cluster that provided easy TLS through annotations (with 3 apps). By December 2018, we had:
- A internal product referred to internally as “The Jupiter Platform”
- Single sign-on for metrics, logging, and dashboarding services
- Easy provisioning of databases and queues through operators
- Automated backup and restoration services for developers
- ISO 27001 compliance
- Training programs (synchronous and self-service) for developers
- …and easy TLS through annotations
The 5-person crew runs on a couple of foundational principles:
We are not a DevOps crew. While we use DevOpsy tools and paradigms, we primarily operate and surface ourselves to the organization as a product-platform crew. We’ll unpack this later.
Our work is not in clusters, but change. We develop on Kubernetes not as a container scheduler, but as a organizational policy driver. Again, unpacking later.
Our processes start on empathy; proactive and reactive work operate on a starting need to actively interrogate what our clients (development crews, effectively) need.
What problem were we trying to solve?
Many organizations have adopted the Spotify model as a way of structuring an organization to optimize for quicker delivery and autonomy (while accepting and compensating for creating silos). And while I agree with its tenets, it produces a couple of problems especially in the DevOps space.
When implementing DevOps across a Spotify-ish organization, organizations usually create a DevOps role per team or tribe. While this creates autonomy by removing coupling from a central team, it introduces practice explosion: you always end up with many different ways of doing the same thing.
This explosion creates problems when the need for an organization-wide initiative arises: something that commonly happens in the infrastructure space. Off the top of my head:
Standards Compliance (e.g., ISO 27001, IRAP, SOX). Compliance requires enumeration of how you do things — which becomes a big problem when you have many ways of doing things.
Infrastructure cost accounting (and reduction). A delivery team’s primary concern is delivery, and it is often hard to relay the importance of efficient software.
Reducing the friction of moving teams. When you have many ways of doing things, you end up with non-transferrable knowledge when you move teams. I find it absurd for developers to have to relearn everything when you just have to move a couple of desks away.
The idea by late 2017 was that we would create a Kubernetes cluster, and have people move to it. I didn’t particularly agree with this, as this sounded like a technological fix to an organizational issue. You do not just build it and have people come.
However, we agreed that practice convergence in the infrastructure space was at least preferable. While I don’t hold any particular religious zeal towards K8s, it was more important that the organization picked a solution, and converged upon it as a deliberate choice.
From here on, we’ll unpack how we attempted to solve this problem in three areas: Adoption, Product Thinking, and Policy.
It’s important to highlight one thing: we don’t use standards as a way to drive technology choices — e.g., “Everyone has to use X!”. My crew effectively has no authority or mandate to drive cross-team decisions. We start from the assumption that the most effective way to propagate change is for people to understand why and how we can help them, then they can make their own decisions afterwards.
With this premise, the problem becomes easy: Adoption is a function of cognitive load. To increase adoption, how do we reduce that cognitive load?
Setting up better conversations
We noticed that having conversations with teams about using our platform was difficult: we’d have to explain to them that we use Kubernetes, explain what Kubernetes is, then explain automated provisioning of LetsEncrypt and the fact they get log forwarding through Sumologic, then explain those terms again. It’s hard enough to explain that to developers — imagine explaining it to someone unfamiliar with DevOps.
To remedy this, we rebranded the cluster as the “Jupiter Platform”: a platform that removes the need for teams to setup what they previously took 14 weeks to setup. We talked less about what the composition of the platform is, and more about the value it provides.
If we wanted to reduce the cognitive load for developers, then we would need to find out what that load is composed of. So we started by advertising Business Partners from our crew to pair with development teams — with the sell that we would aid them with their infrastructure setup in Jupiter.
A side-mission for our crew members however, was to interrogate what the onboarding process was like: which part was easy, and which parts were hard?
From here, it was easy to assemble a story map of the steps that a developer would need to productionize their application, and how long each block took:
Once we identified what the pain points are, it became easy to improve things. This included allowing for self-service, structuring documentation to cover foundational aspects, and making templates and runsheets for teams to consume.
What ended up being the longest part of the onboarding process was a surprise: it was developers being intimidated by the fact that they had to relearn a new, different thing with an entirely different lexicon. Containers are ‘pods’ now. Load balancers are ‘services’ now.
Note that intimidation is distinct from a lack of want: most developers are keen to learn new things, provided they had someone to help them out.
So we created workshops that covered everything a developer would need to know to develop and deploy on the Jupiter Platform.
We ran these in classes of 10 people each, with team members at hand ready to teach, with iterations based on feedback. By the latest iteration, we taught developers Kubernetes concepts, Dockerization, pipelines, monitoring, alert routing, and rollout strategies in 90 minutes.
And honestly, it’s not that hard. The latest version of it thus far is simple: a couple of markdown files, a couple of YAML files, with no slides. Just whiteboard markers: people don’t want to memorize as much as they want to understand how everything glues together.
Soon enough, we were teaching 150 developers in a quarter.
Let me clarify: what I mean by product thinking isn’t so much holistic as much as it’s a surface-level mimicry of how good products are surfaced to consumers. I’m focusing on the understanding and delivery of the value you can bring to clients vs. Real Product Management stuff. I haven’t the faintest idea what PMs do.
Understand baseline expectations.
There are minimums as far as services go that consumers expect: A.) That things work a big amount of the time, B.) That they have someone to talk to if things aren’t working, and C.) That they know what’s happening when things go pear shaped.
These translate cleanly into SLAs, support mechanisms, and transparency. Overall, this took the shape of a ‘contract’ that we have with development teams:
One of the hardest things to dispel when running an internal platform is the notion that you’re running something that whole businesses have dedicated themselves to, and is therefore of lower quality. With our ‘contracts’ in place, we have at least some degree of clarity about what people can expect out of us.
What developers really want.
We took to the operator route when developers started wanting specific features. Want a database? Here, have a Postgres operator. Oh wait, did I mention that the backup, maintenance, and alerting is all done for you, and that it’s in a private subnet, and that admin, write, and readonly credentials and secrets are already made for you?
Yep. And there’s only one way to do it.
The same applied to alert routing, log forwarding, and metrics servers. There’s a mass of undifferentiated heavy work that developers don’t really care for, and making all of that easy won us a lot of hearts and minds.
Delivery Leads need love too.
But the biggest change we’ve been able to affect was when we started catering to DLs and Program Managers. Sure, we can remove a multi-month block that developers used to use for setting up infrastructure (and a lot of people managers appreciate that), but the real love came when we started to take care of compliance evidencing work.
The Jupiter Platform being a central point for deployments, it became easy to gather information and implement remedial changes en masse — something that is critical with things like ISO 27001, IRAP, etc. This meant that people managers didn’t need to document and allocate sprint space, and that they could focus on what they prefer to focus on: delivery.
Over time, that became part of our product offering. It became critical that key decision makers also got benefits from our platform.
And so we move into what I find most important in our work: the defocus from creating technological capabilities to driving change throughout the organization. We see Kubernetes as less of a scheduler and more of an policy engine — in that key architectural choices are set implicitly based on what the platform supports and what it does not.
As our Principal Developer Paul Van De Vreede says, “Make good things easy, and bad things a bit harder.”
Below, we’ll provide a couple of examples of how this is the case.
Case 1: How should people assemble alerts?
The Jupiter Platform provides for glueless metrics aggregation (Prometheus) and log forwarding (Sumologic). In the bifurcation of practices, we’ve found that some teams wired their logging architecture into their alerting — mostly as a function of delivery space than a deliberate choice.
Now this can be a problem, as logging is significantly more expensive than metrics, and is liable to cost and availability problems. While logging is good for forensic analysis, metrics are better for anomaly detection. We wanted to encourage teams to use metrics to drive alerting.
The “solution” was to create a feature that made alert forwarding via metrics as easy as adding a single line of YAML. Just define your API key and/or Slack channel, and off you go.
Organizationally, this resulted in teams utilizing metrics further.
Case 2: Infrastructure coupling?
I have opinions on infrastructure coupling which can be summarized to “pls no”. It introduces ambiguity in ownership and accountability, which superimposes itself into security problems in the long run.
The “solution” in this particular case was to have a position on no VPC-peering at all. Because container schedulers don’t really allow SSH-proper, this forces interactions with databases to be app-only; which makes API development choices more deliberate.
Additionally, the creation of a queue operator puts receiving teams in a position where if they had a non-Jupiter app that needs to communicate with a database, the only ingress (besides creating an API) would be to decouple the backend application with a queue.
What about all the other things?
Scheduled containers do not suit all workloads, and we make that explicit. Thankfully (credits: Jon Eaves) we have a Solution Options channel where development teams can surface planned architectural choices. Other teams (affected or otherwise) can weigh in — again, in the spirit of making deliberate choices.
We get to ask the question: If you’re not using the platform, then why? Let’s talk.
The importance of organizational nous.
Very few of these actions are technical, if at all. They are not even process solutions. Most of our actions centre around a leadership of conscience/influence, if we define it as “the process of influencing others in order to gain their willing consent in ethical pursuits”. These are talky-bits, built with deliberate focuses on transparency, an active interrogation of what other people need, and a want of a better tomorrow.
“It’s not about making software. The real important work is setting up how we work in the future.” — Jason Dwyer, Architect.
I lack the words for it, but I feel that this craft, this skill, is something we vastly underestimate and undervalue. The ability to navigate an organization, know how it operates, and relay your values across it is important, if not critical to success.
That, and leadership that understands why what you do is important, and is willing to support you for it. I’m grateful that we have that at MYOB.
Why do I care about this too much?
Five years ago, I was stuck in a room with a manager who was irked by the fact that I was unhappy all the time. By that point, I was a lone sysadmin who had been oncall for 3 years in a row, always asking for more people to do the job with me. I said that I wanted to quit; he replied that no, people like me were a dime a dozen elsewhere. If I didn’t care, then they would find someone else quickly.
I wept then — frustrated by the fact that people did not understand that yes, I did care. I cared quite a lot about what I did. I just needed them to understand why I did, and why they also should.
It’s been a time since then, and I can only hope that I’ve gotten better at this. This is what we built, and this is why it matters. This is what we care about, and this is why it matters.
This is our team, these are our people, and this is you, and this is why we matter.