You can build it yourself… Eventually

One thing I’ve heard from people trying to figure out how to make better use of their machine learning at scale is “we’ll just build it ourselves.” Certainly, you can build it yourself, but should you? Let’s find out.

In any software system, your primary goal is to satisfy the requirements of all stakeholders. In the case of ML systems, your stakeholders are typically the business intelligence team, data scientists, engineers and DevOps engineers, who help inform, for instance, a sales pipeline. The first step therefore extends, meeting with these groups, and figuring out what problems they face when interacting with models, which can take weeks to get right.

Now that you have an idea of what the system needs to do, you start consulting with software engineers on the scope of the features involved, your MVP if you will. Your engineering team might say something like, “We can just use Kubernetes.” They’re not wrong, you can use Kubernetes as a basis for this system, but that’s not where the road stops. Kubernetes is very much the lego blocks of such a system, as your engineering team is soon to find out. Engineering teams lean hard on DevOps during this process, to help solidify the infrastructure and understanding how you have to set up your networking. Communication between engineering and DevOps becomes more critical in the lead-up to putting users on your platform. During this time, you are likely to spend a few months working out many of the initial details, implementation and bugs that creep up in the process.

Congratulations! You have a product — almost — ok you have a maintenance headache managing a bunch of small little scripts to take care of the daily operation of your platform. You’ve run a few models on this platform, and you have feedback coming in from your data scientists telling you how they need to be able to see what’s going on without having to click on these 17 links in Kubernetes. You sit your engineers down and figure out a solution to enable your users to see their logs easily and quickly. Soon after, they ask for a way to track metrics over time, and different runs of a model. Then they ask about getting GPU support to make their training jobs run faster. Before you know it, your list of asks is a mile long, and you have to hire more software engineers to build this.

The business intelligence team hasn’t even had an opportunity yet to interact with the platform, because you haven’t quite figured out a way to get the data out of your system into tools they use. Your next focus has to be figuring out how to get data out of your system. The BI team use Tableau, and it has an extension API, your engineers declare with glee. Your engineers have figured out a way to get data into the tool, yay! Of course, your BI folks have to deal with the rough around the edges bits, like having to recreate dashboards periodically, or refreshing data every single time. These friction points wear on them and they become more frustrated with the experience your platform is providing.

By this time, you’re realizing this is a bigger problem, and the initial enthusiasm, “this is trivial to do with Kubernetes,” is tempered by the realization that Kubernetes is just your lego bricks, and it’s up to your master builders to make it work to your advantage. Your team has grown by 400% by this time point to support development, and you are starting to look more like a software company, and less like whatever you were before. Is this what the leadership group wanted for the company? Costs are growing.

Development continues over time, as you find more and more areas where you need to shape your bricks into new features to improve the performance of the platform, enabling models of different shapes, sizes and requirements to run.

You’re now a year into this project, and things are still not sufficiently stable, but 90% of the time it’s working how you need it to. A win, I guess? However, your platform won’t run unattended all the time. Issues come up; critical vulnerabilities get posted against the tools you use. You have to keep a group of dedicated people in positions to address these, and other concerns as the weeks turn into months.

The moral of this story is that Parkinson’s Law is a real thing, and given enough knowledge of how things can run by Engineering and DevOps teams, they can think of any number of different ways to make it happen that all sound easy. What you’re building is a nuclear power plant, not a bike shed. It takes considerably more time, and needs more planning, to ensure the proper operation of the platform. That’s what we’re here for, at Metis Machine. Let us show you how we’ve turned our platform into a scalable frictionless way to operationalize your machine learning models.