The Life and Times of an ML Organization

Jordan Volz
13 min readAug 11, 2021

--

(This is a companion piece to the blog Is Data-First AI the Next Big Thing? I recommend reading it before diving in here)

In this blog, we’ll parody an average data team’s journey through the ML platform market with a fictional example (any similarities to real or fictional people or characters is purely coincidental). In doing so, we’ll see firsthand many of the trials and tribulations of the average enterprise. Hopefully, we can get a model into production before we get fired.

Chapter 1: Congratulations! You’re Now Chief Data Officer

Due to your renowned success in data and your illustrious background in high-tech silicon valley companies, you’ve landed a pretty comfy spot as CDO at Rick’s Delicious Pickles in Nowhere, Oklahoma in the mid to late 2010s. Rick’s has recently become a social media darling with the onset of the Flamin’ Rick Challenge, in which people post videos of themselves eating ten of the company’s extremely spicy pickles, one after another, culminating with the Ghost of Rick, a pickle so spicy that it’s been banned in over 50 countries. As a result, the CEO, Rick Sanchez, has a new mission to embrace digital marketing and aggressively expand the business: a new app is in the works to help users connect and share their Pickle Stories, Ryan Reynolds has signed on as a spokesman for the company, and a new direct-to-consumer delivery service is launching via partner affiliates. What this means for you: lots of data. And your CEO has lots of questions he wants answered: Which market segments produce the best ROI? Which demographics are most at risk of churn? How do you optimize your marketing budget? Etc … You definitely have your work cut out for you.

So delicious… Who could say no?

Your first task is to assemble a team. You open several data science positions and are sorely disappointed to learn that there are not a lot of people who are willing to move to Oklahoma and take a pay cut to join the company, despite the perk of unlimited free pickles. After a few months, you decide to open the positions to remote work — you have a gut feeling that by the time the 2020s roll around, remote work will be all the rage — and you sell this idea to your CEO as Rick’s being ahead of the curve. He eats it up. Four crack data scientists are hired. Time spent: 6 months.

You give your team a long list of data science problems to crank out and send them off to races. Content that your team will be the crown jewel of the company soon, you set your sights on your real goal: making Ryan Reynolds your new best friend. Unfortunately, your team quickly returns with the realization that Rick’s data infrastructure leaves a lot to be desired. Furious, you storm into Rick’s office, demanding to know what’s going on — he had, of course, guaranteed their IT team was “state of the art.” The truth is a bit more .. nuanced. The “IT team” is really just this guy Milton. Milton threw together a very makeshift server that runs in the janitor’s closet. That stores all the data that Rick knows about. The problem is: no one has seen Milton in months, and a recent article in the Nowhere Times indicates that Milton may have suffered some serious injuries after a mishap with a Jump to Conclusions mat. You ask how the software engineering team is doing things like running an app with hundreds of thousands of users, and Rick shrugs, “It’s in the cloud.”

Rick apologizes for the “mixup” and gives you a blank check to get whatever your team needs in the cloud: the new digital marketing campaign is crushing it and the app is printing money (literally, they have an in-app cryptocurrency called Schmeckle). Dejected, you realize that you’ll have to pivot your priorities and build out infrastructure and data engineering teams before the data science team can even begin solving the hard problems. With a newfound budget, you begin hiring data architects and data engineers to construct a data platform in the cloud. For ease of use and in the interest of getting something running quickly, you turn to one of the newfound Cloud Data Warehouses as your main center of data. However, getting all the data you need from other teams at Rick’s takes its time, and you end up another 18 months ironing out the kinks before you can really begin making progress on your original goals.

Wow, so much to do before we get to the fun stuff…

Finally, close to 2 years after you were hired, your data scientists can get working. Or, what’s left of them. Half the team left during your data platform build-out, as they weren’t too keen on spending most of their time writing DAGs in pipelining tools and sought out greener pastures. So you have to find replacements. Rick’s blank check was less carte blanche than you anticipated, and you were forced to hire some local junior data scientists coming straight out of data science boot camp. The team slowly makes progress, but you begin to realize that real progress is largely hindered by the team’s inability to effectively collaborate. Their main tool is Jupyter Notebooks, they share ideas via git, and use templatized VM instances to run code, but each use case takes them months upon months to execute as every step along the path is full of manual work. It’s unlikely they’ll have anything approaching a production process any time soon. Your lead data scientist reminds you that back in San Francisco you guys had spent several years building out a bespoke system that worked for the company, and that was with a team of dozens of engineers.

This system is easily built… with a dozen engineers and several years of development time…

You don’t have the manpower or the time to go through that process again, so… You start making inquiries into software vendors.

Chapter 2: To Production! And Beyond!

After an additional 6–12 months (RFP + POC + legal and procurement + implementation and integration), the company has bought and implemented a collaborative data science tool. The data science team now has a means to effectively share work and has easy access to things like GPU for deep learning models and Apache Spark when working with big data, etc. Data Scientists are content. The business is less so, who reports that it takes too long to get results from your team.

Use cases are pretty slow to come out the door. Although the team can send you a notebook or PDF with results rather quickly, turning that into a reliable production process for other teams to consume the insights is time-consuming. By your calculations, it can take the team anywhere from 1 to 6 months to really get an ML use case “into production”. This generally requires rigging complex data pipelines that connect data engineering into feature engineering scripts, running model experiments in containers to train a model, then hosting the winning experiment, once completed, in an API so that predictions can be made. Not to mention monitoring models and endpoints to figure out when things need to be retrained and deployed. To be honest, they haven’t quite figured out most of the last few steps, but even the first steps take a considerable amount of manual work every time they wish to operationalize a new model or update an existing model. After all, ML is not a “one-and-done” activity.

Interestingly, you also notice that the DS team is doing very little original work in all of this ML experimentation. You learn that the vast majority of models are scikit-learn or XGboost, with an occasional neural net from Tensorflow or Pytorch. They spend a lot of time experimenting with these different frameworks and trying to templatize use cases to speed up this process, but it’s been slow going and junior members of the team struggle to effectively manage the process, not to mention going into shock whenever anyone mentions writing a SQL script.

Your reaction when learning that yet another ML project has been delayed… Also, why are you editing photos at work? Don’t you have someone to do that for you?

It’s clear that the business is requiring that your team be able to execute and turnover use cases faster. To do this, you need tools that are more geared towards automation and allow the team to get use case resolution down into the realm of several weeks, not months. You think more about the bespoke solutions you built back in the day, the lack of engineering expertise at your disposal, and begin making some more calls…

Chapter 3: Democratizing ML

You start your Gen 2 journey by going back to your vendors, and a few months later you have an AutoML tool. Initially, your team uses it to streamline their model training & experiment tracking (previously all done in notebooks), but the AutoML sales rep sells your team on the idea of “democratizing ML”. “Why do all the work yourself?,” he asks, “The tool is so simple you can give it to anyone and they can build their own models.” You introduce it to your analysts so they can do exactly that. They don’t know about all this fancy ML stuff, but the weeks your data scientists usually take to figure out how all the data fits together and what they should be predicting only takes the analysts a few days. You can now get projects up and running in a couple of months, and you pay an integration fee of a few months to get the ML platform hooked into your AutoML platform.

Eventually, things go wrong. An analyst creates a bad model, it’s pushed into production, and suddenly children are being recommended Sexy Pickles (don’t ask). This is a huge scandal that you didn’t need. Rick is upset, Ryan is upset, your own mother is ashamed of you. You decide that you need more oversight into the ML process. The path to production needs more gatekeeping and accountability. After a bit of research, you decide that you need an MLOps tool and an XAI tool. You spend another six months picking vendors and another three integrating them into whatever else you have (hard to keep track of now, isn’t it?).

The MLOps tool gives your team insight and accountability into what’s running in production. Your team now has a reliable and repeatable process for promotion as well. The XAI tool does a lot of the work of analyzing your models and presenting the data to your DS so they understand the performance of models and the impact of changes to downstream processes. Progress is made, you celebrate by trying to persuade Hugh Jackman to fly into town for dinner. He declines.

Half your DS team already looks like Gandalf, so they were pretty happy when they got to become gatekeepers

Things are working, but it’s not without its difficulties. The stack you’ve created is very fragile and thrown together. You have to employ several expert system architects just to maintain the various components and troubleshoot the inevitable issues that arise. On the surface, you can get results out to the business, but underneath, everything is a messy mash of pipelines, scripts, API calls, and patchwork workflows. You know there must be a better way to do this and wonder if you should have just built out a bespoke solution like you originally intended. Your lead DS keeps asking for a feature store, and you spend a lot of your day hiding in the bathroom and avoiding emails. You fear if you see another junior data scientist try to operationalize a notebook, you’re going to lose it.

Congratulations, your team has acquired and implemented a best-of-breed ML platform with collaborative notebooks, AutoML, MLOps, and XAI. It’s been four, or five, or maybe six years since you started this journey — You’ve definitely lost track of time. What’s certain is that you’ve lost a significant amount of hair, gained 30 pounds, Ryan Reynolds won’t return your calls anymore, the only member of your family who will listen to your rants about artificial intelligence is your dog, and Rick has gotten so wealthy selling black market supplements that turn people into pickles, that you often feel like you have little guiding your day to day — but you’re finally beginning to see the light at the end of the tunnel. Plus, Nowhere has a pretty good whiskey distillery that you are the #1 patron of after work hours.

Is this the end? Or just a new beginning?

Chapter 4: Data-First AI

In an alternate universe, your initial inquiry into ML platforms took you not to Gen 1, which all your data scientists were asking for, but instead, to a data-first solution. Perhaps it was a particularly well-informed peer who tipped you off, or a very effective online ad, or maybe even a too-witty-for-its-own-good blog post — but, you somehow made the connection and decided to give it a try. You’re stunned that with the data-first approach you’re able to POC an entire use case in less than a week, with models and predictions being rebuilt automatically as frequently as needed.

You fear it’ll be a hard sell to your notebook-centric data scientists. At first, they are skeptical, but once they start using it, they warm up. Your DS lead comments, “This is just like a system I would build if I had a team of 10 engineers and 3 years.” It allows them to execute use cases with less fuss and at a much quicker pace. The business is happy because they get results faster and in a format that they know how to work with (in their database), and the data scientist team is happy because they’re no longer under extreme pressure or trying to manage a collection of spinning plates masquerading as a ML platform.

Soon, other teams start using the tool: data engineers, data analysts, even business analysts — heck even Milton gets hooked. They generally have a much more intimate knowledge of the businesses’ data and are able to quickly contribute to the platform’s feature store. The platform sees all these new relationships and begins to build better models and gets better results back out to the business faster.

Instead of spending most of their day trying to build/maintain a web of data pipelines, the data science team now spends most of their time reviewing the work of others and enhancing existing work via feature engineering. It’s an old cliché that data scientists spend 80% of their time wrangling data — this is meant as a negative, with the thinking being that you should be paying data scientists to code. In the data-first platform, what you want data scientists to be doing is precisely that: building better features to better inform models so they produce better results downstream for the business. This is something that they actually like.

Impressed with your swift progress — within a year you’ve bootstrapped a DS team from nothing and are reliably executing on new use cases every week — Rick decides to promote you to Chief Pickle Officer. This requires changing your legal name to end in “.pkl”, but it also comes with a lifetime supply of alcoholic pickle juice, so you gladly accept.

In our current universe, you are impressed with the notion of a data-first AI platform but have to reconcile the years of effort you’ve spent building your existing platform. After being honest about your accomplishments, you realize:

  1. Your system is so complex that you have to staff teams of people to run and maintain the tech stack. This is overhead that you’d like to re-allocate.
  2. Only data scientists can really effectively use the tool.
  3. Only senior data scientists can really effectively use your system.
  4. Despite all the effort, it still takes several weeks to fully operationalize a use case, in the best-case scenario.
  5. Things often go wrong and require lots of manual effort to fix.

You begin to adopt a dual strategy, putting the data-first system next to your existing Frankenstein, quickly you notice:

  1. The non-data scientist persona is able to use the data-first AI platform to quickly iterate on problems and operationalize their work.
  2. Your data science team can manage the workflow in the data-first AI platform with a minimal amount of time-consumed. This allows them to spend more time focusing on hard problems.
  3. Use cases can often be POC’ed and pushed into production the same week. This amazes you.
  4. The system has a robustness built-in that means you don’t need a team of experts maintaining the stack and any issues that arise are easily dealt with and corrected by the end-user.
  5. Your expert data science team sometimes still needs to code things by hand. This is ok, they’re able to leverage the older ML platform to help automate things when this is necessary.

It’s a compromise between the past and present, but you end up being really happy that everyone in your organization can be involved in the ML process and do so on their own terms. You’ve lowered the burden of the ML platform considerably by adopting a data-first approach to handle the majority of your use cases, actually democratized AI in the process, and the business reports that they’ve never been so impressed with how fast your team can get their insights to them. Even your lead data scientist has no complaints: she finally got her feature store and you can now enjoy eating lunch in the cafeteria without fear of being hounded by product requests.

Rick knows this journey has been tough on you. The Pickleverse has recently launched and he knows the new amount of data is going to wreck your team. As a gift, he says, he’ll get you anything you desire as a bonus. You ask for a flight into space, which he promptly denies, then follow back up with a Tesla. He agrees, although he has a perplexed look on his face. The next day you find the following in your parking spot at work.

Meet Tesla, your new ride.

And like all great stories, you ride off into the sunset…

--

--

Jordan Volz

Jordan primarily writes about AI, ML, and technology. Sometimes with a humorous slant. Opinions here are his own.