Introducing Fabric for Deep Learning (FfDL)

Published in

IBM watsonx Assistant

6 min readApr 12, 2018

This post is co-authored by Animesh Singh and Scott Boag, and is an updated version of a post on IBM Developer Works by the same authors

According to Gartner, the ability to use AI to enhance decision making, reinvent business models and ecosystems, and remake the customer experience will pay off for digital initiatives through 2025. Companies are collecting huge amounts of data, they want to use the data to train and create deep learning algorithms and models, and they want these deep learning capabilities to be offered as a service in an easily consumable way.

Training deep neural network models requires a highly tuned system with the right combination of software, drivers, compute, memory, network, and storage resources. To address the challenges around obtaining and managing these resources, we are happy to announce the launch of Fabric for Deep Learning (FfDL).

FfDL offers a stack that abstracts away these concerns so data scientists can execute training jobs with their choice of deep learning framework at scale in the cloud. It has been built to offer resilience, scalability, multi-tenancy, and security without modifying the deep learning frameworks, and with no or minimal changes to model code.

Jim Zemlin, Executive Director of The Linux Foundation, echoes these sentiments succinctly:

“Just as The Linux Foundation worked with IBM, Google, Red Hat and others to establish the open governance community for Kubernetes with the Cloud Native Computing Foundation, we see IBM’s release of Fabric for Deep Learning, or FfDL, as an opportunity to work with the open source community to align related open source projects, taking one more step toward making deep learning accessible. We think its origin as an IBM product will appeal to open source developers and enterprise end users.”

FfDL architecture

The FfDL platform uses a microservices architecture, with a focus on scalability, resiliency, and fault tolerance. According to one IDC survey, by 2021 enterprise apps will shift toward hyper-agile architectures, with 80% of application development on cloud platforms using microservices and functions, and over 95% of new microservices deployed in containers. And what better cloud native platform to build on than Kubernetes? The FfDL control plane microservices are deployed as pods, and we rely on Kubernetes to manage this cluster of GPU- and CPU-enabled machines effectively, to restart microservices when they crash, and to report the health of microservices.

REST API

The REST API microservice handles REST-level HTTP requests and acts as proxy to the lower-level gRPC Trainer service. The service also load-balances requests and is responsible for authentication. Load balancing is implemented by registering the REST API service instances dynamically in a service registry. The interface is specified through a Swagger definition file.

Trainer

The Trainer service admits training job requests, persisting metadata and model input configuration in a database (MongoDB). It initiates job deployment, halting, and (user-requested) job termination by calling the appropriate gRPC methods on the Lifecycle Manager microservice. The Trainer also assigns a unique identifier to each job, which is used by all other components to track the job.

Lifecycle Manager and learner pods

The Lifecycle Manager (LCM) deploys training jobs arriving from the Trainer, halting (pausing) and terminating training jobs. LCM uses the Kubernetes cluster manager to deploy containerized training jobs. A training job is a set of interconnected Kubernetes pods, each containing one or more Docker containers.

The LCM determines the learner pods, parameter servers, and interconnections among them based on the job configuration, and calls on Kubernetes for deployment. For example, if a user creates a Caffe2 training job with four learners and two CPUs/GPUs per learner, the LCM creates five pods: one for each learner (called the learner pod), and one monitoring pod called the job monitor.

Training Data Service

The Training Data Service (TDS) provides short-lived storage and retrieval for logs and evaluation data from a Deep Learning training job. As the training job progresses, information is needed for evaluation of the ongoing success or failure of the learning progress. These metrics normally come in the form of scalar values, and are termed evaluation metrics (or sometimes the term emetrics might be used). Debugging information can also be output through log lines.

While the learning job is running, a process runs as a sidecar to extract the training data from the learner, and then pushes that data into the TDS, which pushes the data into ElasticSearch. The sidecars used for collecting training data are termed log-collectors. Depending on the framework and desired extraction method, different types of log-collectors can be used. Log-collectors are somewhat misnamed, since their responsibilities include at least both log line collection, and evaluation metrics extraction.

FfDL forms the core of Watson Studio Deep Learning Service

FfDL, developed in close collaboration with IBM Research and Watson product development teams, forms the core of our newly announced Deep Learning as a Service within Watson Studio. Watson Studio provides tools for supporting the end-to-end AI workflow in a public cloud hosted environment, with best of the breed support for GPU resources on a Kubernetes environment.

Watson Studio architecture enables flexible machine learning and introduces a new, scalable paradigm for deep learning (both for small teams and enterprises)

Join the revolution and democratize AI

Get started with FfDL today. Deploy it, use it, and extend it with capabilities that you find helpful. We’re waiting for your feedback and pull requests — let’s start the revolution and democratize AI!

Authors:

Animesh Singh is an STSM and lead for IBM Watson and Cloud Platform, currently leading Machine Learning and Deep Learning initiatives on IBM Cloud. He has been with IBM for more than a decade and is currently working with communities and customers to design and implement Deep Learning, Machine Learning and Cloud Computing frameworks. He has been leading cutting edge projects for IBM enterprise customers in Telco, Banking, and Healthcare Industries, around cloud and virtualization technologies. He has a proven track record of driving design and implementation of private and public cloud solutions from concept to production. He also led the design and development first IBM public cloud offering, and was the lead architect for Bluemix Local.

Find Animesh on Twitter @AnimeshSingh.

Scott Boag has been working for the past 30 years in client-server applications, document exchange/transformation technologies, and computer language design and compilation. He is interested in the following research areas: data transformation, markup languages, language compilation, language design, microservices, visual representation architectures

Lead developer, team leader, manager, Northern Telecom Compass, Electronic Documents Search and Delivery, CGM-based (1986–1992)
Founding member and major contributor, W3C XSLT/XQuery Working Groups (1998–2008)
Inventor of Xalan, popular open source software library from the Apache Software Foundation, member ASF (1998–2005)
Developer of XQuery, JSONiq, and DFDL Compilation Technologies, IBM DataPower (2009–2015)
Developer, Fabric for Cloud-based Deep Learning (2015–2017)