The Origin of MLMon
by Means of Natural Selection, or the Preservation of Favoured Microservices in the Struggle for Life
So this is going to be an experimental format for a blog post. I’m going to describe a problem and solution then the problems that came up after, then solutions to them, and new problems, etc. I am not claiming these to be the “best” solutions, or even applicable to anyone else. This post is more about how one particular product evolved over time. One has to keep in mind that this was a natural evolution. It is constrained not only by the problem, but the available resources to solve and the value of that solution. I’ve also included a relative pain meter for each problem and implementing solution. To explain this post’s shorthand, MLMon == “Machine Learning Monitor,” which is the system we use to run machine learning algorithms that power our various automation services. It includes training, batch scoring systems, and other compute heavy operations.
“If it could be demonstrated that any complex organ existed, which could not possibly have been formed by numerous, successive, slight modifications, my theory would absolutely break down. But I can find no such case.” ― Charles Darwin
What do you mean constrained by the value of the solution? Shouldn’t you always use the most robust solution available? Sometimes, but often, no. I like the example from Bruce Schneier: The most secure way you can encrypt a document is a truly random one-time pad… but you’d be rather silly to use that to secure your 13yo daughter’s diary. So when you decide on how to solve a problem, you should of course think about how well it’ll solve it and how robust it’ll be in the future, but you also need to question if it’s necessary and if there even will be a future? Much of the work we do on my team is experimental, and a lot will simply be abandoned once we prove it’s not useful or effective. So each time you see one of these solutions and think “Well that’s suboptimal, you should have XYZ,” stop and ask, “Would XYZ have been a good use of resources at the time?”
“The difference between a programmer and an engineer is the consideration of development costs, both in the present and future!”
Anyways, enough caveats, onto the evolution! I’ve included a scale of how much pain the problem causes versus how much pain implementing the solution would be using 🔥.
Problem: We need to run a single machine learning (ML) algorithm on data for all of our clients. The data is big and the ML is complicated, being difficult to run efficiently. Pain: 🔥🔥🔥🔥🔥
Solution: We created EC2 instance clusters for each client (around half a dozen at this point) and a script that starts the clusters, starts to run the ML, uploads the results, and stops the instances. Each client had a different number of instances in their cluster based on the client data size. Pain: 🔥🔥🔥🔥🔥
Problem: This is expensive! At the time, AWS charged per hour rounded up. So even if the ML took only 5 minutes, we’d pay for an entire hour on machines! Also, the size of available machines had grown, so we no longer needed clusters to run a single client, they could be ran on a single powerful instance. Pain: 🔥🔥🔥🔥🔥
Solution: Create a system that is able to run multiple clients sequentially on a single box, cleaning up after each. This system feeds off a queue of which clients need to run and autoscales the number of boxes based on the amount in the queue. It writes data to a central DB about each instance and job so we can learn how to binpack better in the future and save money. Pain: 🔥🔥🔥🔥
Problem: Using machine learning to improve things is awesome, let’s create more models/algorithms! Wait.. the current system was created to only run one type. Pain: 🔥🔥🔥🔥🔥
Solution: Abstract the class that runs ML and use subclasses for each new model. The class needs to know how to prepare the environment, run the job, and cleanup when all is done (either successfully or in error). Pain: 🔥
Problem: Creating a new subclass each time the data science team comes up with something new is a bit of a pain. It’s not a good use of resources to ask the data scientists to create the subclasses since they’d need to learn to build/test/follow standards of the project. Pain: 🔥🔥
Solution: Remove the subclasses and create a runtime configured class that reads from YAML files. Pain: 🔥🔥
Problem: As more and more algorithms are added and each needing slightly different configuration and the system itself needs more configuration for each job, the YAML files are becoming complicated and prone to human error. Pain: 🔥🔥
Solution: Create a UI to configure jobs! This also gives a chance to create a reporting system about the outcomes of each job. So a double win! Pain: 🔥🔥
Problem: The ML jobs are getting more divergent. They require different versions of libraries to run, and no one has time to update the old jobs that are going just fine. Pain: 🔥🔥🔥
Solution: Instead of running the jobs directly on the box, we can encapsulate the environment into Docker containers. We provide a skeleton of a few different types to the data science team and let them customize to their hearts’ content. This allows different dependencies, and makes clean up after each job far easier! Pain: 🔥🔥
Problem: While all the ML jobs can run on the beefy machines originally set up, it’s a bit wasteful. Some require GPU, some only a good CPU, some high memory but very little CPU, and some require very little at all to run. Keeping everything on the large boxes is wasteful. Pain: 🔥
Solution: Create multiple queues for different types of boxes available and separate Auto Scaling Groups for each queue. Pain: 🔥
Problem: As we grow in the number of clients and ML jobs, the queues are getting backed up a bit. Some jobs are dependent on others and some jobs needed to run more often than others. Unfortunately, they all run in a more or less FIFO nature. Pain: 🔥
Considered Solution: Create a priority queue and have jobs assigned an importance score. This was not chosen since we are currently using the basic AWS queueing service and writing our own was not worthwhile. Pain: 🔥🔥🔥
Solution: Move from a queue/Auto Scaling Group per resource to one per job type. Pain: 🔥
Problem: Creating new queue/Auto Scaling Groups each time a new ML algorithm comes up is a bit of a hassle. Pain: 🔥
Partial Solution: Use a terraform module to reduce the amount of work to simply creating a few lines of code and running a command. Pain: 🔥
Problem: We went from a half dozen clients with a single ML job to thousands of clients with dozens of ML jobs. The single DB is getting hammered and that’s causing job failures. Pain: 🔥🔥🔥🔥
Temporary Solution: Remove stats being written about each instance; they aren’t that useful since AWS no longer charges for whole hours. Add read replicas, this is a fairly common solution to reduce DB load. This reduces the load, but still doesn’t improve scaling. Pain: 🔥
Problem: More DB issues. We now have hundreds of ML boxes running at a time during peak. Pain: 🔥🔥🔥🔥
Considered Solutions: Move from standard SQL DB to cassandra. This was deemed to be more work than useful since it’d require rewriting the entire DB layer of the job runner and the reports. Pain: 🔥🔥🔥🔥🔥
Solution: None of the writes are needed immediately in the DB. They are primarily just there for reports. So, let’s create a queue of DB writes, we can then reuse the whole system’s “read from queue and perform job” to perform the DB writes. This requires a relatively small amount of code to be written, and reuses the throttling system already in place. Pain: 🔥🔥
Possible Future Problem: Creating the new queues/Auto Scaling Groups still requiring engineering time, which isn’t ideal. Pain: 🔥
Possible Future Solution: Moving this creating into the UI. Of course, having a UI that creates infrastructure is generally a dangerous thing. So perhaps we’ll revisit making a new priority queue service. Pain: 🔥🔥
Possible Future Problem: This bespoke binpacking isn’t really the greatest and as the number of clients and jobs grow, the small difference will start to add up to real costs. Pain: 🔥🔥
Possible Future Solution: Move this system into Kubernetes, which we already have in place for other services and which already knows how to efficiently pack things together. Pain: 🔥🔥🔥
“Just as smart financial debt can help you reach major life goals faster, not all technical debt is bad, and managing it well can yield tremendous benefits for your company.” — hackernoon.com
As mentioned above, this article is in an experimental format. One thing that became clear while I was writing is the pain scale. When we faced a challenge and brainstormed solutions, we’d analyze how much trouble implementation would be, but we never boiled it down to a simple scale like in this article. Looking back, it is obvious, to paraphrase Dr. Henry Cloud, that changes happen when the pain of the system remaining the same exceeds or equals the pain of changing. Each decision made is an attempt to minimize pain/cost or, more positively, maximize expenditure. We make decisions with trade-offs and knowledge that nothing is final; the system shall continue to evolve.