Glass Silos: Scaling infrastructure by empowering development teams

Morgan Hoban
Sage Ai
Published in
7 min readFeb 27, 2023

At Sage AI we ship end-to-end AI/ML products. As anyone who’s fought to train on GPU can tell you, ML operating infrastructure effectively is no simple task and the road to operational hell is paved with special exceptions that work best for a given team’s product and constraints.

The cloud operations team at Sage AI struggles with the same tradeoffs as many other companies’ operations teams: To what extent are you empowering development teams and to what extent are you setting yourself up for scalability?

Small-batch, artisanal snowflakes of exquisite deployment topography springing forth from the improvisational minds of product teams on tight timelines necessitates a delicate web of tribal knowledge to maintain. Alternatively, imposing a single, monolithic pattern to which all teams must comply allows for rapid scaling with a lean ops crew — but can leave development teams feeling powerless over the interface between their work and the infrastructure that runs it.

At Sage AI, the ops team confronts this fundamental trade off-by using GitOps, infrastructure as code (IaC), and a “we welcome pull requests (PRs)” approach. This allows us to maintain our capacity to scale with a lean cloud ops team by erring on the side of empowering the development teams using our infrastructure.

Those who live in glass silos

Viewing an ops team as a closed silo that works in isolation from development teams creates a dreaded scenario: sending the infrastructure team a request, then having to wait to see if their ticket is ever picked up. This creates a situation where the ops team can act as an equal-opportunity impediment. As much as the mystique of how infrastructure gets provisioned and configured can be a nice piece of job security, a culture of siloed operations can be toxic. Forcing a team to throw a request over an opaque and ominous wall, hoping to eventually hear back, is a great way to take a ball peen hammer to the knees of a team’s momentum.

This is why, at Sage AI, everyone in our organization can see and open PRs against our infrastructure repositories — which then require an ops team approval. When everyone can see the code and the infrastructure team’s PRs, everyone can see how things are done and open PRs to do those things for themselves — whether that is a one-line change or a new database.

While a one-line change certainly shouldn’t take an inordinate amount of time, it isn’t entirely trivial for the operations team. Yes, it is a small change for the developer’s project, but the developer’s project is only one of the projects that the infrastructure team is working on, and context shifting grinds mental gears. It is more fluid for everyone involved for an ops engineer to review a dev team’s PR to fix a simple configuration problem, or provision a new queue, than it is for the ops engineer to fully context switch to do it themselves. Empowering teams to get their hands dirty if they really want something done quickly lets them create and ride their own momentum of product development by minimizing the external work on which they must wait.

When the code and process is transparent individual contributors can assess whether they have the skill and knowledge to submit a PR to keep their team’s progress moving forward, or whether they should look but not touch. Dev teams being able to look and see for themselves creates a more collaborative and proactive environment than the infrastructure silo being completely opaque and hidden away.

The path of least resistance, whoever is doing the work

Whether an ops engineer or an intrepid member of a development team are going to be writing the provisioning or configuration code, the process to be as streamlined as possible.

At Sage AI, this means Terraform code that uses higher level modules to abstract away the heavy lifting (and tedium) of generating all the interdependent hosted services for a single product to work securely and reliably. Even if only a single team is asking for a piece of tooling, we frontload the work of packaging it up as a reusable module. This way, the future work of provisioning the tool for a single team or everyone using our infrastructure is similarly straight forward.

The lure of clicking buttons in a web console to make something work for a tenant team can be strong. It is easy to justify that the team needs this quickly and other teams probably won’t have the same specific request so it is okay to do it “just this once to get it done.” But changes made through IaC are self documenting, whereas clicking buttons in a browser is not. Automating provisioning by chaining Terraform modules through Terragrunt stacks lets us recreate complex resource deployments an infinite number of times with a single command line invocation. Clicking buttons in a browser takes the same amount of time to do it every time. Getting the automation working before the first internal client gets their first deployment of a resource may take more time initially, but the front loading is well worth it to avoid the trap of drudgery.

If non-infrastructure ICs are going to be opening PRs, they should be able to follow a straightforward pattern. If your infrastructure team is going to be doing the dirty work, they should be able to avoid getting bogged down in rote drudgery. The less time the ops team spends provisioning things they already know how to provision, the more time they can spend working on ways to make the process even easier. This is why at Sage AI we are relentless about all of our provisioning being automatable IaC. The easier the automation behind provisioning and configuration, the more development teams we can support with fewer ops engineers while ensuring everyone gets to keep working on interesting tasks.

You can deploy whatever code you want*, whenever you want

While Sage AI defines its infrastructure in terraform code, everything that we run on that infrastructure takes the form of containerized workloads orchestrated by Kubernetes. This allows us to create two clear domains of responsibility — the configuration of the K8s clusters themselves and the code within the figurative corners of the app container. Any operations team needs ultimate control over the former, but the latter we try to leave almost entirely to the development teams.

At Sage AI, this takes the form of packaging applications as Helm charts, then deploying those charts with a neat piece of GitOps tooling called Flux. The manifest deploying the charts on our clusters is, just like the terraform code, available for everyone to see or open a PR for, but it does require an ops sign-off to merge.

However, the development teams have full control of the Helm charts in Git repositories that they control*. Any changes to the application code that will be baked into images or Helm template code that will be built into K8s objects are entirely within the purview of the development teams †… mostly. For the full disclosure, and for the sake of the Sage AI backend staying in a healthy operational state, this operational latitude is bracketed by CI checks before feature branches can be merged into main. In addition to the unit and integration tests the teams want to define for their own repositories, the infrastructure team packages up a series of checks on the generated Kubernetes manifests into a reusable CircleCI orb. This way every team uses the same enforced checks for security best practices and stability against all of their code before it hits the clusters. These checks are further enforced by Kyverno runtime checks on the clusters themselves.

With these protections in place, teams can deploy whatever version of code they want to whichever cluster they want whenever they want ‡… for the most part. Our CI tooling, Flux, is configured to pick up changes to the git repositories and when new images are pushed to our container registries. Using Flux Image Policies, tenant teams set their own criteria for when to push which new images out to different environments. By following the rules they set for themselves for tagging git branches and images deployments into development, pre-production, and production the teams handle releases on their own schedules and as they see fit. Because Flux is using Helm Releases with proper rolling deployments and pod health checks, releases with problems that were not caught by any of our continuous testing pipelines trigger an automatic rollback instead of creating a service outage. If there is a problem with a release and it is rolled back, the team is armed with access to full logs and monitoring; this empowers development teams to quickly diagnose and resolve the issue themselves and re-try the release without needing to wait for anyone else.

The exception to teams being able to ship code on their own schedule and initiative is the initial setup and subsequent re-configuration of the code that directly provisions Sage AI’s infrastructure and orchestrates its containerized workloads. Issues at this level affect not just a single team but every team running products on the Sage AI backend. Here the infrastructure team must responsible for ensuring everything is correct before it ships to the clusters or our hosted services. However, being responsible for things being correct before they ship doesn’t mean that dev teams need wait for the infrastructure team to do it. As I mentioned before, our infrastructure repositories are open for anyone in the Sage AI organization to see and we welcome PRs!

Conclusion

Infrastructure operations need a certain level of protection, but being protective to the point of robbing teams of their agency in the name of stability is not the answer. Certain things need to be kept behind glass, but visibility into what we are protecting and why is key. This visibility helps teams appreciate sensitive areas while giving them the agency and knowledge to expedite things for themselves. At a rapidly scaling organization, there is always more work to be done and more code to be written — and the more that ops teams can use PRs from dev teams to keep progress moving forward — rather than making dev teams wait on them, the better.

--

--