Accelerating ML within CNN
At CNN, our mission is to inform, engage, and empower the world in a way that is trusted, timely, and transparent. This mission is more critical than ever as we face some of the most challenging times of our generation. As the world is becoming increasingly digital in nature, we are relentlessly focusing our mission to directly connect with our audience, understand what they care about most, and reach them in a way that is most accessible for their lifestyle. Our Data Intelligence team, in particular, leverages data and machine-learning capabilities to build innovative experiences for our audience and provides scalable solutions to CNN’s operations. As the world’s largest digital news destination, we averaged more than 200 million unique global visitors every month of 2020. Our catalog of raw audio and video footage also goes back several decades. Clearly, we have a lot of data!
Our data science team believes they were able to test twice as many models in Q1 2021 as they did in all of 2020.
Given the depth of our data and the rapid pace of advancements in machine-learning, there are always far too many interesting ideas to explore than team-members to explore them. (We’re hiring!) As we’ve grown our team, it’s become increasingly important that our researchers spend their time on the parts of the project that are most valuable to our audience. Our audience is best served when we provide trusted and personalized news to them, and we want our researchers to be able to focus as much of their time as they can on improving that experience. A significant amount of our research experimentation time has typically involved co-ordinating compute jobs and compute clusters, with support from our engineering team.
Initially, our process for training a new experimental model was fairly similar to our process for (re)training a production model. This process was optimized for consistently training production models at scale, and did really well at that. Unfortunately, it was not designed for lightweight experimentation, and because of that, the research iteration process was frustratingly slow.
Partnering with the folks at Hop Labs, we explored a number of platforms and approaches to streamlining this. The one that most resonated with our team was an open-source project put out by the folks at Netflix: Metaflow. The key concept of Metaflow is the “Flow” which implements a directed acyclic graph (DAG) in Python. This metaphor (that we were already using in our Airflow data pipelines) and the straightforward Python interface meant that it instantly felt familiar to our ML engineers. Additionally, the seamless integration with AWS Batch gave us straightforward and simple scalability options.
A Flow is made up of Steps, each of which can have its own separate dependencies and compute requirements. This allows a researcher to write pure Python and declaratively identify requirements on a granular step level, without having to worry about how those requirements are met. Our ML Ops engineers were able to implement a rich compute substrate that met the security/scalability requirements of CNN, and Metaflow was able to leverage that substrate using AWS Batch to scale up (and spin down) resources dynamically. This allowed our researchers to rapidly iterate with a small set of data locally. Then, when they were confident in their pipelines, they could scale up to larger resources as necessary — all without having to re-write code or manually schedule jobs and compute resources.
Each run is reproducible by default, and all the code and dependencies are snapshotted to AWS S3 in a way that is transparent to (and effortless for) the researcher.
While we appreciated all that Metaflow had to offer, it had some specific challenges that we discovered along the way. The first challenge we ran into was that the only official way to stand up Metaflow’s infrastructure was via a single Cloudformation template. This likely is sufficient for many teams, but our policies required us to make changes to the default infrastructure, and the existing template did not lend itself to that. Within CNN’s Data Intelligence team, we use Hashicorp’s Terraform for managing our infrastructure. Part of our engineering effort in adopting Metaflow was rewriting the provided Cloudformation template in highly modular and extensible Terraform. We are pleased to be able to submit these changes back to the Metaflow community, so hopefully this will not be a challenge for other teams looking to use Metaflow.
Another obstacle for our team was that Metaflow uses Conda for package management. This is a natural and reasonable technical choice, that is particularly helpful to snapshot code and dependencies for reproducibility. Nonetheless, a large percentage of our team had not previously used Conda heavily and faced a bit of a learning curve while switching to it. There are also occasional package version mismatches between PyPI and the Conda repositories. We’ve built some internal solutions to partially address these challenges, but remain on the lookout for more comprehensive ways to streamline things here.
At this point, we’ve been able to entirely switch over our research experimentation process to rely on this Metaflow-powered workflow and with great success. Our researchers are able to spend more of their time improving their models for our audience, and our engineers are able to provide a rich compute substrate without having to manually manage clusters. As an informal estimate, our data science team believes they were able to test twice as many models in Q1 2021 as they did in all of 2020, with simple experiments that would have taken a week now taking half a day.
Although not the intention, we were also able to transition a number of production training workloads to this workflow. Our engineering and operations teams appreciate having one less always-on stack to have to maintain, therefore reducing their operational burden and simplifying the surface area of their architecture.
The work doesn’t stop though — we’ve kaizen’d this part of the workflow, but that just means we get to focus on the next improvement to the process. We strongly believe in continuously improving the tooling available to our researchers and engineers, so that they can continuously improve the CNN experience for our audience.
If you’re interested in working on a cutting-edge team with broad societal impact (and continuously improving tools!), we’re hiring and would be excited to talk to you.
Thanks to Deepna Devkar, Gregory Hilston, Roshan Bangera and the rest of the CNN Data Intelligence team for valuable contributions and feedback.