Machine Learning Engineering at Pluralsight

History

Machine Learning Engineers (MLEs) are a new and interesting fixture in the job landscape. To those of us interested in machine learning, the role’s very existence signals an exciting phase of ML’s history — one where the industrial applications once dreamed of are being realized at a large scale. This progress has been in fits and starts, including boom and bust periods. The first boom period, ranging from the 1950s-1990s (although cracks began appearing as early as the late 60’s), brought about the field of machine learning as a research discipline and something that had to promise to automate previously tedious tasks, such as translation or check reading. The bust occurred when the algorithms, borne out of this research, failed to live up to what some may argue were over-inflated expectations.

The resurgence of Machine Learning has been a slow and gradual process, but the event that caught the attention of the wider world is the performance of Alex-net on the Imagenet challenge. Now that the world was attuned to the power of these techniques, in particular “deep learning”, the potential was realized more widely than the few, large companies like Ebay, Yahoo, and Google, to name a few, which were successfully applying machine learning in an industrial setting. This event, coinciding with the explosion in data science meant more companies were looking to leverage their data in new ways. The natural question then becomes, why does the MLE role exist independently of Data Scientist? One simple answer is that unicorns are rare and there is a large amount of market demand for ML, because it is something that can make a big difference. It is easy to see that there are a few, distinct groupings of skills necessary to research, prototype, deliver and monitor ML at scale. A “unicorn” (a terrible term) would be someone who can do all of the above, but they are increasingly few and far between. It is easier to subdivide the above chain into research and development as well as delivery and monitoring of products. ML engineering falls into the latter and is a natural result of industrialization.

History at Pluralsight

Up to now, machine learning’s potential has been automating high volume, relatively simple tasks like personalization of search rankings, ad placement, metadata creation and content recommendations. Because we are at a relatively nascent phase with respect to industrialization of ML, we at Pluralsight have had to chart our own path for shipping ML reliably at scale.

Our first attempts at shipping products which learn from or leverage large amounts of data were simple and predated the hiring of Machine Learning Engineers. The process, a collaboration between data scientists and web developers, was one where Hive queries were managed by chron jobs and bash scripts, performing batch-style data prep which could then be served. These jobs were prone to failure but were simple and quick to get the company started. This process of shipping value to the customer in the quickest way possible and only then iterating is core to how Pluralsight develops products.

As the warts of this system began to show via intermittent failures, which did not always bubble up to the team responsible; scaling issues; and difficulty serving we began to see that we needed to go back to the drawing board and deliver a system which could better surface the workflow, illustrate data provenance, serve models live and alert on failures. This coincides with the hiring of our first Machine Learning Engineers.

A New Pattern Emerges

As of our hiring, the conscious decision was made to focus all MLE talent on the recommendations team. This allowed us to focus on a singular domain of ML and develop solutions that would remain focused on improving the current system, which was focused on recommendations as well. For this effort we began by getting onto the same page regarding the deficiencies of the current system as well as leaders’ asks for the next generation. We knew we needed a system which would:

  • Scale well for training and serving models
  • Ensure data lineage is clearly illustrated and understandable
  • Keep serving as low-latency as possible to increase model usefulness
  • Accept models from different frameworks
  • Python-based, as python is the undisputed lingua franca of ML

This translated into a research and development effort which would result in a replacement of some of our earlier Hive-based, ETL-style models. To meet requirements above, the first and foremost component would be task orchestration and job management. There are a number of wonderful frameworks in python for this: Airflow, Luigi, and Celery are quite popular choices here. Luigi was showing signs of age and there were indications of less and less support from the original authors and not a lot differentiating it (at this point the original authors have moved on from the project). Celery was a piece of technology Pluralsight was quite familiar with, just not for task management or large scale directed acyclic graph (DAG) oriented work. However, when celery was compared side-by-side with airflow, airflow won out. This was due to its very nice UI/UX, a focus on longer running and large tasks, and native support for many common data tools right out of the box.

For our machine learning models, we wanted to focus on as few frameworks to start with as possible, so that we can come up with patterns for automating the training/retraining, deployment, and monitoring processes. As such, we wanted a framework that would be expressive enough to allow enough flexibility in defining new and interesting models. We knew we wanted something that utilized batched and gradient based learning primarily, so that we could more easily scale to larger-than-memory datasets and express modern deep learning models. We also knew that we wanted to ensure that serving latency would be as low as possible, because a slow model is exponentially less useful than a very fast one at inference time. As such, we coalesced around Tensorflow (TF) and Tensorflow serving for the modeling and serving. During this evaluation, we looked into clipper.ai and Pytorch, more generally as well. We found the strongest story for production-facing models in the Tensorflow ecosystem.

With these two key pieces of the stack it was easy to fill in the rest for our first pass. We connected airflow and tensorflow serving with a network drive provided as Amazon’s EFS. This is because Tensorflow Serving, at start time, is told to listen to a directory where new model versions come in as epoch time stamped directories. TF Serving takes the directory with the latest timestamp and serves that as the default. One can customize Tensorflow Serving to do canary deployments, however we have not done so ourselves as we have other, more dynamic methods for testing model deployments using multi-armed bandits, which I will discuss later and in more detail in subsequent posts. We also built an API layer over TF Serving so our clients would not need to understand the nuances of feature engineering or be relied upon to construct inputs for the models. Our first pass at this leveraged node.js, which felt like a great choice due to asynchronous programming being a first class citizen here. This is because model response times can vary widely, so blocking for these calls seems like you would use many more resources than truly necessary. Finally, we leverage the AWS S3 service for saving training data and models from each run and AWS application load balancers and auto-scaling groups to ensure our service elastically scales with load safely.

Our first pass at the architecture looked like so:

Our first projects with this new architecture added two components. AWS Athena and a multi-armed bandit framework for testing models against one another. Athena was selected to help us reproduce a collaborative filtering model which was being transferred to the team, which could not be handled with a typical OLTP database like Postgres. As collaborative filtering can be seen as a map-reduce problem which can be expressed with SQL, we went with Athena, which provides a PrestoDB backed serverless querying service that will handle parallelism within queries out of the box. This predated Airflow working natively with Athena, so we needed to write some custom code for doing so which would eventually develop into an internal python package the Machine Learning Engineers collectively manage called ps-airflow, we will talk more about that in a future post on our internal packages.

Very soon after porting this model, we replicated a Variational Autoencoder (VAE) model for collaborative filtering as described in this paper and ran into the prescient and tricky problem of deciding which recommender “works better”. We realized quickly that the standard evaluation metrics do not necessarily track or translate to a better user experience, so we knew we needed to find a way to test these models with real users in a live setting. We discussed traditional a/b testing as well as dynamic routing via multi-armed bandits and settled upon the bandit approach due to the dynamic nature of bandits themselves, which would help ensure as few users as possible are exposed to a less performant treatment. Bandits also provided a framework, with guarantees around regret minimization in some instances, for personalizing or conditioning the serving on features of the user, model, or product needs as well — which is quite exciting and has provided many benefits, which we will discuss in later posts.

At this point our architecture looked like so:

This represents our 1.0 architecture which scaled well with the recommendations team, eventually handling many recommendation models and endpoints to serve them. This architecture, namely the bandit framework, allowed for rapid testing of models and very fast research to production pipeline, bringing data science and machine learning closer together, a pain point for many organizations today.

Into the Future

As machine learning has evolved at Pluralsight out from the recommendations team, so has our architecture. As we embedded machine learning engineers and data scientists on the search team we developed new capabilities around training and serving “Learning to Rank” (LTR) models using xg-boost and the elastic search learn to rank plugin. We have also developed click models for search template hyperparameter optimization, also trained using this architecture. These machine learning based features have steadily increased click through rate and led to noticeable improvements in search relevance.

This framework has evolved at an exponential rate as we have hired data practitioners onto more teams. We have developed a few internal packages for working with airflow, Tensorflow and text data, in particular, which has helped us maintain some consistency at scale. However, with so many teams engaging with some or all of this framework, there has been a healthy amount of experimentation with how to do a certain task — such as map-reduce style data processing. As such, we have experimented and expanded this framework to interact with Dask and Apache Beam for data processing.

As our organization adopts Kubernetes, we will be working to put more and more of the framework itself on that platform. In particular, we already have Kubernetes manifests for our Dask cluster and some for a simple Airflow deployment, however we would like to get to the point where we are using the Kubernetes executor instead of running the scheduler, workers, and web server as separate pods. Our internal api layer, usually now expressed as Flask APIs, is easy to transfer to this environment which holds for the Tensorflow serving containers as well. Another area of interest is leveraging distributed training for Tensorflow models over the Kubernetes cluster.

All in all, this framework has held for a few years now and has allowed us to rapidly scale machine learning as a discipline within Pluralsight, to the point that key product features would be severely harmed without the presence of machine learning. We will be continuing to experiment with new pieces of technology to fulfill each role in the stack and updating the community as we find successes.

--

--