From Research to Production with TFX Pipelines and ML Metadata

Posted by Jarek Wilkiewicz on behalf of the TFX team

If your code runs in production, you probably are already familiar with version control / software configuration management (SCM), continuous integration and continuous deployment (CI/CD) as well as many other software engineering best practices. These took years to develop and now we often take them for granted. Much like how writing an efficient algorithm implementation is just the beginning a software engineer’s journey, machine learning (ML) model code typically represents only 5% of the overall system¹ required to deploy it to production. At Google, we’ve also been working on improving the remaining 95% over many years². A fruit of our labour, TensorFlow Extended (TFX³), aims to introduce the benefits of software engineering discipline to the fast growing space of ML. In an upcoming series of blog posts, we’ll highlight what’s new in TFX and show you how TFX can help you build and deploy your ML models to production environments.

Until recently only the underlying TFX libraries (TensorFlow Data Validation, TensorFlow Transform, TensorFlow Model Analysis, TensorFlow Serving) were available in open source which meant that developers still had to build their own ML pipeline components using the libraries. You can now create a complete TFX ML pipeline using several ready-made components, configure them for many typical ML use cases with a high level Python API, and execute them with an orchestration system of your choice such as Apache Airflow or Kubeflow as shown in the figure below.

When the TFX pipeline executes, ML Metadata (MLMD, another Google open source project) keeps track of artifacts pipeline components depend upon (e.g. training data) and produce (e.g. vocabularies and models). ML Metadata is available as a standalone library and has also been integrated with TFX components for your convenience. MLMD allows you to discover the lineage of an artifact (for example what data a model was trained on), find all artifacts created from an artifact (for example all models trained on a specific dataset), and enables many other use cases.

To better understand how this all fits together check out the Google I/O ’19 presentation: “TensorFlow Extended (TFX): Machine Learning Pipelines and Model Understanding”.

In our next TFX blog post, we will describe the TFX pipeline components in more detail. Until then, please try the TFX developer tutorial. You’ll follow a typical ML development process, starting by examining the dataset, and ending up with a complete working ML pipeline. If you have TFX questions please reach us on Stack Overflow, bug reports and pull requests are always welcome on GitHub, and we invite general discussion at tfx@tensorflow.org.


[1] Sculley, D., Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo and Dan Dennison. “Hidden Technical Debt in Machine Learning Systems.” NIPS (2015).

[2] Tushar Chandra, “Sibyl: A System for Large Scale Machine Learning at Google”, IEEE DSN (Dependable Systems and Networks) conference keynote, Atlanta, GA, June 25th 2014.

[3] Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘17). ACM, New York, NY, USA, 1387–1395. DOI: https://doi.org/10.1145/3097983.3098021.