When you think about machine learning, you usually only think about the great models that you can now create. After all, that’s what many of the research papers are focused on. But when you want to take those amazing models and make them available to the world, you need to think about all the things that a production solution requires — monitoring, reliability, validation, etc. That’s why Google created TensorFlow Extended (TFX) — to provide production-grade support for our machine learning (ML) pipelines. We are sharing this with the open source community so that developers everywhere can create and deploy their models on production-grade TFX pipelines.
Google created TFX because we needed it, and there was nothing already available that could meet our needs. Google, and more generally, Alphabet, makes extensive use of ML in most of our products. In fact, TFX wasn’t the first ML pipeline framework that Google created. It evolved out of earlier attempts and is now the default framework for the majority of Google’s ML production solutions. Beyond Google, TFX has also had a deep impact on our partners, including Twitter, Airbnb, and PayPal.
It’s not just about ML
When you start planning for incorporating ML into an application, you have all the normal ML things to think about. This includes getting labeled data if you’re doing supervised learning, and making sure that your dataset covers well the space of possible inputs. You also want to minimize the dimensionality of your feature set while maximizing the predictive information it contains. And you need to think about fairness, and make sure that your application won’t be unfairly biased. You also need to consider rare conditions, especially in applications like healthcare where you might be making predictions for conditions that only occur in rare but important situations. And finally you need to consider that this will be a living solution that will evolve over time as new data flows in and conditions change, and plan for lifecycle management of your data.
But in addition to all that, you need to remember that you’re putting a software application into production. That means that you still have all the requirements that any production software application has, including scalability, consistency, modularity, and testability, as well as safety and security. You’re way beyond just training a model now! By themselves these are challenges for any production software deployment, and you can’t forget about them just because you’re doing ML. How are you going to meet all these needs, and get your amazing new model into production?
To tie this all together Google has created some horizontal layers for things like pipeline storage, configuration, and orchestration. These layers are really important for managing and optimizing your pipelines and the applications that you run on them.
ML in production presents many challenges, and Google doesn’t pretend to have all the answers. This is an evolving field in the ML community, and we welcome contributions. This paper provides a great overview of the challenges of machine learning in production environments.
What are “pipelines” and “components”?
TFX pipelines are created as a sequence of components, each of which performs a different task. Components are organized into Directed Acyclic Graphs, or “DAGs”. But what exactly is a component?
A TFX component has three main parts: a driver, an executor, and a publisher. Two of these parts — the driver and publisher — are mostly boilerplate code that you could change, but probably will never need to. The executor is really where you insert your code and do your customization.
The driver inspects the state of the world and decides what work needs to be done, coordinating job execution and feeding metadata to the executor. The publisher takes the results of your executor and updates the metadata store. But the executor is really where the work is done for each of the components.
So first, you need a configuration for your component, and with TFX that configuration is done using Python. Next, you need some input for your component, and a place to send our results. That’s where the metadata store comes in. We’ll talk more about the metadata store in a bit, but for now just be aware that for most components the input metadata will come from the metadata store, and the resulting metadata will be written back to the metadata store.
So as your data moves through your pipeline, components will read metadata that was produced by an earlier component, and write metadata that will probably be used by a later component farther down the pipeline. There are some exceptions, like at the beginning and end of the pipeline, but for the most part that’s how data flows through a TFX pipeline.
Orchestrating a TFX pipeline
To organize all these components and manage these pipelines, you need orchestration. But what is orchestration exactly, and how does it help you?
To put an ML pipeline together, define the sequence of components that make up the pipeline, and manage their execution, you need an orchestrator. An orchestrator provides a management interface that you can use to trigger tasks and monitor our components.
If all that you need to do is kick off the next stage of the pipeline, task-aware architectures are enough. You can simply start the next component as soon as the previous component finishes. But a task- and data-aware architecture is much more powerful, and really almost a requirement for any production system, because it stores all the artifacts from every component over many executions. Having that metadata creates a much more powerful pipeline and enables a lot of things which would be very difficult otherwise, so TFX implements a task- and data-aware pipeline architecture.
One of the ways that TFX is open and extendable is with orchestration. Google provides support for Apache Airflow and Kubeflow out of the box, but you can write code to use a different orchestrator if you need to. If you’ve already got a workflow engine that you like, you can build a runner to use it with TFX.
Why should you store metadata?
TFX implements a metadata store using ML-Metadata (MLMD), which is an open source library to define, store, and query metadata for ML pipelines. MLMD stores the metadata in a relational backend. The current implementation supports SQLite and MySQL out of the box, but you can write code to extend ML-Metadata for basically any SQL compatible database. But what exactly do you store in your metadata store?
First, we store information about the models that you’ve trained, the data that you trained them on, and their evaluation results. We refer to this type of metadata as “artifacts”, and artifacts have properties. The data itself is stored outside the database, but the properties and location of the data are kept in the metadata store.
Next, we keep execution records for every component, each time it was run. Remember that an ML pipeline is often run frequently over a long lifetime as new data comes in or conditions change, so keeping that history becomes important for debugging, reproducibility, and auditing.
Finally, we also include the lineage or provenance of the data objects as they flow through the pipeline. That allows you to track forward and backward through the pipeline to understand the origins and results of running our components as your data and code changes. This is really important when you need to optimize or debug our pipeline, which would be quite hard without it.
Now that you have some idea what’s in your metadata store, let’s look at some of the functionality that you get from it.
First, having the lineage or provenance of all of your data artifacts allows you to trace forward and backward in your pipeline — for example, to see what data our model was trained with, or what impact some new feature engineering had on your evaluation metrics. In some use cases, this ability to trace the origins and results of your data may even be a regulatory or legal requirement.
Remember that it’s not just for today’s model or today’s results. You’re likely also interested in understanding how your data and results change over time as you take in new data and retrain your model. You often want to compare to model runs that you ran yesterday, or last week, to understand why your results got better or worse. Production solutions aren’t one-time things, they live for as long as you need them, and that can be months or years.
You can also make your pipeline much more efficient by only rerunning components when necessary, and using a warm start to continue training. Remember that you’re often dealing with large datasets that can take hours or days to run. If you’ve already trained your model for a day and you want to train it some more, you can start from where you left off instead of starting over from the beginning. That’s much easier if you’ve saved information about our model in metadata.
You can also make the other components of your pipeline much more efficient by only rerunning them when the input or code has changed. Instead of rerunning the component again, you can just pull the previous results from cache. For example, if a new run of the pipeline only changes parameters of the trainer, then the pipeline can reuse any data-preprocessing artifacts such as vocabularies — and this can save a lot of time given that large data volumes make data preprocessing expensive. With TFX and MLMD this reuse comes out of the box, while you see a simpler “run pipeline” interface and don’t have to worry about manually selecting which components to run. Again, that can save you hours of processing. That’s much easier if you’ve saved our component’s input and results in metadata.
Components? What kind of components?
So now that you have an orchestrator, let’s talk about the components that come standard with TFX.
But first, let’s talk about Apache Beam
But before we talk about the standard components, let’s talk about Apache Beam. To handle distributed processing of large amounts of data, especially compute intensive data like ML workloads, you really need a distributed processing pipeline framework like Apache Spark, or Apache Flink, or Google Cloud Dataflow. Most of the TFX components run on top of Apache Beam, which is a unified programming model that can run on several execution engines. Beam allows you to use the distributed processing framework you already have, or choose one that you like, rather than forcing you to use the one that we chose. Currently Beam Python can run on Flink, Spark, and Dataflow runners, but new runners are being added. It also includes a direct runner, which enables you to run a TFX pipeline in development on your local system, like your laptop.
TFX includes a fairly complete set of standard components when you first install it, each designed for a different part of a production ML pipeline. The Transform component for example uses Apache Beam to perform feature engineering transformations, like creating a vocabulary or doing primary component analysis (PCA). Those transformations could be running on your Flink or Spark cluster, or on the Google Cloud using Dataflow. Thanks to the portability of Apache Beam you could migrate between them without changing your code.
The Trainer component really just uses TensorFlow. Remember when all you were thinking about was training your amazing model? That’s the code you’re using here. Note that currently TFX only supports tf.estimators. Other information on compatibility is listed here.
Some components are very simple. The Pusher component for example only needs Python to do its job.
Now let’s look at each of these components in a little more detail.
Read in data
First, you ingest your input data using ExampleGen. ExampleGen is one of the components that runs on Beam. It reads in data from various supported sources and types, splits it into training and eval, and formats it as tf.examples. The configuration for ExampleGen is very simple, just two lines of Python.
Next, StatisticsGen makes a full pass over the data using Beam, one full epoch, and calculates descriptive statistics for each of your features. To do that it leverages the TensorFlow Data Validation (TFDV) library, which includes support for some visualization tools that you can run in a Jupyter notebook. That lets you explore and understand your data, and find any issues that you may have. This is typical data wrangling stuff, the same thing we all do when we’re preparing our data to train our model.
The next component, SchemaGen, also uses the TensorFlow Data Validation library. It looks at the statistics which were generated by StatisticsGen and tries to infer the basic properties of your features, including data types of feature values, value ranges, and categories. You should examine and adjust the schema as needed, for example adding new categories that you expect to see.
The next component, ExampleValidator, takes the statistics from StatisticsGen and the schema (which may be the output of SchemaGen or the result of user curation) and looks for problems. It looks for different classes of anomalies, including missing values or values that don’t match your schema, training-serving skew, and data drift, and produces a report of what it finds. Remember that you’re taking in new data all the time, so you need to be aware of problems when they pop up.
Transform is one of the more complex components, and requires a bit more configuration as well as additional code. Transform uses Beam to do feature engineering, applying transformations to your features to improve the performance of your model. For example, Transform can create vocabularies, or bucketize values, or run PCA over your input. The code that you write depends on what feature engineering you need to do for your model and dataset.
Transform will make a full pass over your data, one full epoch, and create two different kinds of results. For things like calculating the median or standard deviation of a feature, numbers which are the same for all examples, Transform will output a constant. For things like normalizing a value, values which will be different for different examples, Transform will output TensorFlow Ops.
Transform will then output a TensorFlow graph with those constants and ops. That graph is hermetic, so it contains all of the information you need to apply those transformations, and will form the input stage for your model. That means that the same transformations are applied consistently between training and serving, which eliminates training/serving skew. If instead you’re moving your model from a training environment into a serving environment or application, and trying to apply the same feature engineering in both places, you hope that the transformations are the same but sometimes you find that they’re not. We call that training/serving skew, and Transform eliminates it by using exactly the same code anywhere you run your model.
Training your model
Now you’re finally ready to train your model, the part of the process that you often think about when you think about machine learning. Trainer takes in the transform graph and data from Transform, and the schema from SchemaGen, and trains a model using your modeling code. This is normal model training, but when training is complete Trainer will save two different SavedModels. One is a SavedModel that will be deployed to production, and the other is an EvalSavedModel that will be used for analyzing the performance of your model.
The configuration for Trainer is what you’d expect, things like the number of steps and whether or not to use warm starting. The code that you create for Trainer is your modeling code, so it can be as simple or complex as you need it to be.
To monitor and analyze the training process you can use TensorBoard, just like you would normally. In this case you can look at the current model training run or compare the results from multiple model training runs. This is only possible because of the ML-Metadata store, which was discussed above. TFX makes it fairly easy to do this kind of comparison, which is often revealing.
Now that you’ve trained your model, how do the results look? The Evaluator component will take the EvalSavedModel that Trainer created, and the original input data, and do deep analysis using Beam and the TensorFlow Model Analysis library. It’s not just looking at the top level results across your whole dataset. It’s looking deeper than that, at individual slices of your dataset. That’s important, because the experience that each user of your model has will depend on their individual data point. Your model may do well over your entire dataset, but if it does poorly on the datapoint that a user gives it, that user’s experience is poor. We’ll talk about this more in a future post.
So now that you’ve looked at our model’s performance, should you push it to production? Is it better or worse than what you already have in production? You probably don’t want to push a worse model just because it’s new. So the ModelValidator component uses Beam to do that comparison, using criteria that you define, to decide whether or not to push the new model to production.
Where do you go from here?
The goal of this post was to give you a basic overview of TFX and ML pipelines in general, and to introduce the main concepts. In the posts to follow we’ll dig deeper into TFX, including discussing ways that you can extend TFX to make it fit your needs. TFX is open source, so Google is also encouraging the software and ML community to help us make it better. A good place to start would be to try the TFX developer tutorial!
TensorFlow Extended: Machine Learning Pipelines and Model Understanding (Google I/O’19)
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform (KDD 2017)