Scaling MLOps: It takes more than just the right tech

Mac Macoy
chick-fil-atech
Published in
5 min readOct 24, 2023

--

“MLOps” — the capability to reliably deploy ML models at scale — has gained incredible steam recently. Artificial Intelligence is more accessible now than ever, and organizations are hoping to reap the benefits.

Interest in “MLOps” over time

At Chick-fil-A, we often emphasize the importance of considering people, process, data and technology. Joining these four factors together creates a force multiplier, and this idea is core to our ML platform strategy.

In this post, we’ll focus on Chick-fil-A’s ML Platform team. We’ll share how we define success, how we partner with delivery teams across the organization, our core principles, and 8 Pillars that help us execute our strategy. We won’t focus on the technical components we have chosen (perhaps another time) but rather on the mindset and culture that are required to make almost any technical decisions successful in the context of an MLOps platform.

Photo credit: https://www.railway-technology.com/features/luminous-platform-help-reduce-train-delays-germany/

Defining Success

Success for our ML platform team means:

  • The platform is easy to use: Users shouldn’t need a long list of “gotchas” to manage when using it.
  • The path to production is clear: Equip data scientists and engineers with the right tools and the knowledge of how and when to use them.
  • The platform is self-service: Users can deploy a model to production with little to no assistance from the platform team.
  • Systems are well-engineered: The platform is scalable and easily maintained. When users are developing on the platform, there are clear paths and guardrails so that the right way to deploy a model is the easy way.
  • A high percentage of models make it to production: This is a by-product of the first three goals.
  • The “time-to-production” decreases over time: As the platform improves and users gain comfort using it, the mean time to production decreases, resulting in faster experimentation, learning, and business value.

We measure these success factors through user feedback and automated metrics.

Team Structure

Chick-fil-A has a central ML platform team focused on enabling the outcomes above. We also have model delivery teams that implement use cases on the platform. We’ve been diligent in ensuring the platform team doesn’t become a bottleneck that all ML projects must go through. If the platform team becomes a blocker, that’s a sign that we need to automate a process or provide better documentation for our users.

At the same time, we strive to develop strong relationships between the platform team and the delivery teams. Each team has unique perspective to bring to each other. The platform team’s dedicated focus on tooling and on the big picture helps delivery teams work smarter and faster. The delivery teams’ first-hand experience working on use cases helps the platform team see problems they never could see on their own.

Guiding Principles

In addition to our measures of success, our guiding principles clarify the mission and inspire us.

  • Empower with confidence. Provide easy-to-use technology, clear documentation, and quick support that empowers our users.
  • Keep it healthy. Keeping our systems running and healthy is a top priority and is critical to meeting our business’ needs.
  • Standards built-in. Apply standards automatically in templates and libraries to make the right thing to do the easy thing.
  • Stay in touch. Understand and empathize with our users by using our platform, supporting users, and keeping detailed metadata.
  • Pursue what’s next. Always look forward to new technologies and ideas in the fast-changing world of ML Ops.
  • No tool left behind. The tech stack will evolve, but we will always support the current iteration and the one before, and we will facilitate the migration from the old to the new.

Platform Pillars

We have 8 “Platform Pillars” that are tactical ways the platform team empowers the ML community:

  • Code Templates. We package the bulk of project setup in templates so users can focus on their unique business logic. We include boilerplate code for common use cases such as reading from our enterprise data lake. These templates are internal open source, so anyone can contribute to them.
  • Tutorials. We publish comprehensive tutorials that walk users through building an end-to-end ML project, from data engineering to training to inference. We publish a 101 tutorial, as well as a 201 tutorial for more advanced or edge use cases.
  • Reference Project. The platform team manages a ML project repo that the tutorials are based on. This gives the platform team experience “eating our own dog food”, and it serves as a reference so users can see how to use platform features in practice.
  • User guides. In addition to tutorials, we publish user guides that dive deep into specific topics like gaining access to data, managing package dependencies, and analyzing cost.
  • Workshops. We host workshops where the platform team presents on topics, and participants get hands-on experience solving problems.
  • Support Channel. Each week, one engineer on the platform team is on-call to assist users who post questions in the platform support channel. Users receive the help they need, and the platform team stays connected to the needs of the community as a bi-product.
  • Communities of Practice. Once a month, communities gather across the enterprise to share work and solve problems together. Two of these communities are the Community of Data Science and the Community of ML Engineering.
  • Model Portfolio Management. We’ve built automation to gather metrics across all ML use cases such as project names, stakeholders, links to code, and links to monitoring dashboards. This gives the platform team a picture of its users and provides insights to leadership.

In Summary

In technology, one thing that’s consistent is change. MLOps technologies and patterns are improving rapidly, and we frequently look for new tech to help us deliver models to the business faster. Our strategy, which centers on empowering users with confidence, remains constant amidst new capabilities and the evolving needs of the business.

This approach to scaling ML in our organization has proven useful for us, and I hope you find it useful as well. If you have any questions, feel free to drop a comment. And if you have any nuggets 😆 of wisdom to share about ML or data platform strategy, please share them as well.

--

--