Why a Platform Engineer should lead your DevOps transformation

DevOps transformations are cultural transformations.

Mark Shipps
Slalom Build
9 min readOct 31, 2022

--

At Slalom Build, we approach software development with a DevOps mindset. It plays a central theme in my role of Platform Engineer (PE) in our Slalom Japan market, as Platform Engineering itself evolved from companies adapting to (or at least trying to) DevOps ways of working. Recently, I worked on a project that was a mixture of product engineering and DevOps transformation. This scenario is common in markets who are still early in their cloud adoption journeys, such as Japan. Companies rooted in a more traditional IT structure are seeking to modernize and embrace the advantages that come with cloud computing and want to see tangible results quickly in the form of a cloud-native product. So, my Slalom peers and I worked together to lead our client along a hands-on journey to understand DevOps, while building its new cloud-native product.

Over months of collaboration, I gave many presentations focused on best practices and patterns, as well as sprint demonstrations regarding Platform Engineering concepts such as toolchains, cloud architecture, CI/CD, and Site Reliability Engineering (SRE). During these interactions, I simultaneously drew links back to the core DevOps principles I had introduced in our initial planning days. The audiences of those sessions were people who spanned roles across the entire business, and I consistently received feedback saying they could finally understand the essence of DevOps—and even how their own roles fit into the picture. Throughout this time, it became increasingly apparent to me that a PE is uniquely situated to be a leader in DevOps transformations. To understand why, we can recall The Three Ways of DevOps by Gene Kim and explore them through the eyes of Platform Engineering.

Gene Kim’s The Three Ways of DevOps

The First Way: Systems Thinking

The First Way tells us to consider the entire system rather than our own departments, so we are not “throwing work over the wall.” We want to have optimal throughput through the entire system, but it is important that we do not optimize locally in a way that detrimentally impacts other areas of the system. To ensure that requires us to constantly learn as much about the entire system as we can. So how does Platform Engineering promote systems thinking?

At a high level, this might seem obvious as a key focus of Platform Engineering is providing a good experience for all parties (especially developers) involved in getting code changes to end users. These code changes may come in the form of many different products that share the same platform. Therefore, a PE must know not only the nature of the work itself, but also how work will enter, advance through, and exit the system in all scenarios. Only then can we truly optimize that journey. Digging a little deeper, some perhaps less-obvious examples from my project include:

  • Focus on program management and tooling to find efficiencies. Having knowledge of the system allowed me to make the decision to treat all coding work equally (bugs, application development, infrastructure development, etc.) I could then integrate JIRA with GitHub to marry releases or link issues to code commits. This helped team leads easily identify which stories or bugs were implemented in any given release or build artifact for change management and troubleshooting purposes. It also helps track sizes of releases with regards to actual code changes, which assists in program planning and monitoring.
  • Realize quality and security engineering strategies. These typically manifest themselves in part as implementations in the code repositories, branching strategy, and CI/CD pipeline (e.g., automated testing or pull request controls). I learned new automated testing and security scanning software so that I could understand how and when to integrate them into our system. For instance, VeraCode was chosen as a SAST tool and did not execute quickly enough to implement in an always run type of CI/CD pipeline task. As JetBrains points out, even though CI/CD relies heavily on automated tests, you should not aim to run every automated test all the time. Understanding the intent of the SAST scanning, the targeted scope, and the integrity of the code at each phase of the lifecycle enabled me to implement a solution that automated SAST while keeping pipelines fast.
  • Closely collaborate with application architects to design the cloud infrastructure of the platform used by the applications. On this project, I learned about .NET 6 microservices running in containers on AWS. Having a deep understanding of the application itself meant that I could ensure all the infrastructure was well-architected to fit our various use cases. This drove not only selecting proper services for security and reliability, but networking decisions, CI/CD decisions, IaC tooling decisions, and more.

The Second Way: Amplify Feedback Loops

The Second Way emphasizes creating feedback loops from right to left in your system. According to Gene Kim, an outcome of this way is “responding to all customers, internal and external.” The idea is to realize that quick feedback can be used to verify/validate work that was done. This is helpful in the case that work is not aligned with the customer vision, where it allows an opportunity to make changes earlier. This saves time, effort, and cost and is a central theme of process optimization theory. An easy example of this in Platform Engineering is continuous integration, but here are some other ways I incorporated this on my project:

  • Collaborated with application and quality engineers to create production-like environments on the developers’ local computers. The fastest feedback a developer can get is when their code can be tested as they write it, without connecting into a remote environment or before pushing off to a CI/CD pipeline. I created automation that stood up necessary infrastructure in various containers (using Docker Compose), complete with integration testing. While not practical to fully replicate the real production AWS environment like this, by working together we were able to create something that was similar enough to still give valuable feedback. This caught many issues before code was even committed, quickly and cheaply increasing the quality profile of the code repository at the lowest levels.
  • Integrated the system with application monitoring software using New Relic. For example, by working with the development team, I was able to collocate logs and metrics to our New Relic instance for a combined view of performance and errors data. From there, I set up alert policies and notifications that contacted us when performance degraded beyond acceptable ranges or when there were errors found in logs. Furthermore, I implemented deployment markers, which create release-specific metrics baselines for your application. This allowed us to instantly find anomalies via monitoring differences in release performances, which of course aids in error or performance-related troubleshooting. These operational feedback tools can be utilized in all environments (not just production) and are crucial pieces of the SRE strategy we employed.

The Third Way: Build a Culture of Experimentation and Learning

The Third Way is my personal favorite because it highlights the fact that DevOps is about improving culture. When engineers are empowered to experiment, innovation is accelerated, discoveries are made, and build (and recovery) skills are developed. This ultimately leads to more mature and advanced technologies, but also the the joy of learning, the satisfaction of achievement, and the replacement of fear with trust within the team.

From The Phoenix Project:

“A great team doesn’t mean that they had the smartest people. What made those teams great is that everyone trusted one another. It can be a powerful thing when that magic dynamic exists.”

As we have already covered, Platform Engineering focuses on providing rapid feedback and building reliable systems, which are crucial for allowing experimentation to be performed with minimal risks. Failure becomes tolerable in a sense because the damage is contained and automatically recoverable to the best extent possible. On my project, I built this culture of experimentation and learning with the strategies below:

  • Coach the organization on disaster recovery and introduce Game Days. Game Days are experiments and exercises that test our resiliency by injecting or simulating different failures within the platform. These events helped the organization learn of weak points in our platform architecture or processes while also training us on how to respond to failures as a team.
  • Emphasize ‘small changes’ and ‘fail forward’ mentalities. I often spoke about the importance of failure as a means of learning, even basing technical decisions on the anticipation of failure. For example, our branching strategy and CI/CD process for this platform was built in a way that did not account for bespoke or special workflows like hot-fixes in standard Git-Flow. Instead, I coached the organization to be very disciplined with sizing work into smaller pieces. Small pieces of work means small changes at one time and higher throughput in the development lifecycle. If there is a failure, it will therefore be a small failure that can also be corrected with a small fix (failing forward). Better yet, this small fix can simply come in the form of another piece of work that looks just like all the other work in the system. It does not need a special “hot fix” or “bug fix” workflow, thus keeping our processes simple and consistent!
  • Provide isolation to developers for safer experimentation. One of the key themes of the platform design on this project was to have a means for developers to work on real AWS infrastructure without affecting anyone or anything else. In this spirit, I designed the CI/CD for the platform to build and destroy ephemeral environments at the time of push. These environments were totally unique and isolated to the developer’s working repository and branch.

So then…

As we can see, the work involved with Platform Engineering is very aligned with The Three Ways of DevOps. As larger organizations begin DevOps transformations, they often form Platform Engineering teams (or something similar). Within those teams, PEs are poised to quickly grasp and be advocates of DevOps, simply by performing their job and sharing with the rest of the organization. Noted for being like the glue of tech companies, as the Platform Engineering team moves forward, it will bring the organization along with it. We know major transformations of any kind are difficult, but cultural changes like DevOps are extra hard. So, if your organization is starting its move or is stalling out halfway, look no further for a leader to drive momentum than your Platform Engineering team.

Notes:

  • When I use terms like “traditional organization” or “traditional IT,” I am referring to organizations before the cloud-computing and DevOps days. These were organizations that often treated the IT department as an afterthought. They were also primarily waterfall in their development practices.
  • Gene Kim is well known in DevOps communities as one of the authors of perhaps the quintessential DevOps book, The Phoenix Project. The Three Ways are taught in this book by telling a story of an IT Director implementing throughout his organization’s own DevOps journey. I highly recommend reading it if you have not already.
  • For brevity of the blog, the examples of how a PE’s job aligns with the Three Ways of DevOps that I have written about here are only a select few of the vast many. Along that same line, Platform Engineering itself is a relatively new discipline/term, and the scope of each organization’s Platform Engineering team may vary to some degree still. At Slalom Build, we incorporate DevOps and SRE within our Platform Engineering capability.
  • Over the years, many new flavors of DevOps have come about such as DevSecOps or DevSecFinBizOps. These speak towards the culture moving beyond only focusing on integrating Development and Operations. When I speak about DevOps in this blog, I am implicitly including every group or party that has a hand in product engineering.
  • I am not writing this to imply that only a PE is suited for driving a DevOps transformation, as DevOps transformations should be driven from all positions. This is only my take on it from a PE lens. In fact, the themes in this blog are closely aligned with another cultural concept called Extreme Ownership. My colleagues in Tokyo recently led a DevOpsDays presentation about this topic. In that talk, they discuss how we can lead and take ownership of a system from any role.

--

--

Mark Shipps
Slalom Build

Endless curiosity and just enough free time to make it dangerous.