Distributed TensorFlow 譯

這個文件展示了如何建立一個 TensorFlow servers 群集, 跟如何分散計算圖於整個cluster上. 我們假設你 寫 TensorFlow programs 的basic concepts 都熟惹.

This document shows how to create a cluster of TensorFlow servers, and how to distribute a computation graph across that cluster. We assume that you are familiar with the basic concepts of writing TensorFlow programs.

Hello distributed TensorFlow!

要展示一個簡單 TensorFlow cluster 的行為, 你只要尻如下:

To see a simple TensorFlow cluster in action, execute the following

tf.train.Server.create_local_server 這個方法創建了一個 single-process的群集, 而server 包含在這個process中

The tf.train.Server.create_local_server method creates a single-process cluster, with an in-process server.

Create a cluster

一個 TensorFlow “cluster” 是由一組 "tasks" 組成(?) 用來分散式的執行 一個TensorFlow graph. 每一個 task 會跟一個 TensorFlow "server"有關, 而帶有 "master" 的 task 就會用來創建session, 而帶有”worker”的task 就是用來執行 graph 的 operation的. 一個 cluster 會被切割成一個或多個”jobs”, 每個job 可能包含一個或多個tasks.

A TensorFlow “cluster” is a set of “tasks” that participate in the distributed execution of a TensorFlow graph. Each task is associated with a TensorFlow “server”, which contains a “master” that can be used to create sessions, and a “worker” that executes operations in the graph. A cluster can also be divided into one or more “jobs”, where each job contains one or more tasks.

要創造一個cluster,你必須啟動TensorFlow server 對於每個cluster 內的 task.每個task通常都會在不同的機器上執行, 但你也可以在相同的機器上執行多個task (用來控制多個GPU devices). 每個task 做以下的事情 :

  1. 創造一個 tf.train.ClusterSpec 敘述了所有cluster 中的tasks. 這在每個task中應該要一致
  2. 創造一個 tf.train.Server 傳入tf.train.ClusterSpec 到建構子中, 且用job name 跟 task index 來辨識 local task.

To create a cluster, you start one TensorFlow server per task in the cluster. Each task typically runs on a different machine, but you can run multiple tasks on the same machine (e.g. to control different GPU devices). In each task, do the following:

  1. Create a tf.train.ClusterSpec that describes all of the tasks in the cluster. This should be the same for each task.
  2. Create a tf.train.Server, passing the tf.train.ClusterSpec to the constructor, and identifying the local task with a job name and task index.

Create a tf.train.ClusterSpec to describe the cluster

這個 cluster 的 specification 是一個 python 的 dictionary 會將 “job names” map到 “list of network address” 將這個 dictionary 傳遞到 tf.train.ClusterSpec 的建構子, 舉例來說:

The cluster specification dictionary maps job names to lists of network addresses. Pass this dictionary to the tf.train.ClusterSpec constructor. For example:

Create a tf.train.Server instance in each task

一個 tf.train.Server 的物件通常會包含一組 local devices, 及一組 “連線” (到其他在tf.train.ClusterSpe 中定義的task), 而一個 tf.session 可以用這些(Sever)來表現一個分散式運算. 每一個server 是一個擁有 specific named job 的成員 而且job帶有 task index. 一個Sever 可以跟其他cluster Server內的成員溝通.

A tf.train.Server object contains a set of local devices, a set of connections to other tasks in itstf.train.ClusterSpec, and a tf.Session that can use these to perform a distributed computation. Each server is a member of a specific named job and has a task index within that job. A server can communicate with any other server in the cluster.

舉例來說: 要啟動一個 擁有兩個server的cluster (localhost:2222, localhost:2223) 需要執行下列程式代碼在同一台local machine上兩個不同的process內 :

For example, to launch a cluster with two servers running on localhost:2222 and localhost:2223, run the following snippets in two different processes on the local machine:

注意: 手動設置 這些 cluster specifications可能會非常無聊, 特別是如果一個大型的clusters的化. 我們正在做一些tools 可以程式化地 launching tasks(例如使用cluster manager : Kubernetes). 如果有哪個特定的 cluster manager 你希望他被support 可以在 GitHub issue 提出.

Note: Manually specifying these cluster specifications can be tedious, especially for large clusters. We are working on tools for launching tasks programmatically, e.g. using a cluster manager like Kubernetes. If there are particular cluster managers for which you’d like to see support, please raise a GitHub issue.

Specifying distributed devices in your model

要把operations 放到特定的process上, 你可使用tf.device 這個function 這可用來指定 ops 跑在 CPU 或 GPU上 舉例來說:

To place operations on a particular process, you can use the same tf.device function that is used to specify whether ops run on the CPU or GPU. For example:

在上面的example中, 這些variable 是在 ps job 中建立的, 而 運算強度較強的部分是被創建在 worker job裡. Tensorflow 會在 jobs間 插入適當的資料傳輸(forward pass -> ps 到 worker, 接著 gradient worker ->ps).

In the above example, the variables are created on two tasks in the ps job, and the compute-intensive part of the model is created in the worker job. TensorFlow will insert the appropriate data transfers between the jobs (from ps to worker for the forward pass, and from worker to ps for applying gradients).

Replicated training

通常你training的設定叫作 “data parallelism”, 多個tasks 在 worker job 裡訓練著一樣的model但使用的是不同的mini-batches data, 接著在相對應的 ps jobs裡 由一個或多個 tasks 來 update shared parameters. 一般來說所有的 tasks 都會跑在不同的machine上. 有多個方法可以指定 TensorFlow 的架構, 我們正在建構一些套件讓這些"複製模型" 工作可以變得簡單點 可能的方式如下:

A common training configuration, called “data parallelism,” involves multiple tasks in a worker job training the same model on different mini-batches of data, updating shared parameters hosted in one or more tasks in a psjob. All tasks typically run on different machines. There are many ways to specify this structure in TensorFlow, and we are building libraries that will simplify the work of specifying a replicated model. Possible approaches include:

  • In-graph replication. 這個方法中, client 會建立單一的Graph 包含一組 參數(in tf.Variable nodes pinned to /job:ps (?)); 接著 將運算強度較強的部分複製多份到 job:worker上.
  • Between-graph replication. 這個方法, 對於每個 job:worker都會需要一個不同的client, 一般來說就是跟 worker task 同個process (?).每個client 都會建構類似的圖包含 parameters(pinned to /job:ps as before 使用tf.train.replica_device_setter 將參數指定到相同的tasks) (?). 而每個client 只會將運算強度較強的部分複製一次 並釘選在 local task上.
  • Asynchronous training. 這個方法對於每個 graph的replica都會有一個獨立的training loop來執行 且不需要協調 這是與上面兩種replication都適用.
  • Synchronous training. 在這個模式 所有的replicas 會讀取數值相同的parameters(現在的), 並平行的計算 gradients, 然後將他們整合在一起. 可以用在 in-graph replication (也就是使用 averaging gradient : CIFAR-10 multi-GPU trainer),也可以用在 between-graph replication (e.g. using thetf.train.SyncReplicasOptimizer).(????)
  • In-graph replication. In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps); and multiple copies of the compute-intensive part of the model, each pinned to a different task in /job:worker.
  • Between-graph replication. In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task. Each client builds a similar graph containing the parameters (pinned to /job:ps as before using tf.train.replica_device_setter to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in/job:worker.
  • Asynchronous training. In this approach, each replica of the graph has an independent training loop that executes without coordination. It is compatible with both forms of replication above.
  • Synchronous training. In this approach, all of the replicas read the same values for the current parameters, compute gradients in parallel, and then apply them together. It is compatible with in-graph replication (e.g. using gradient averaging as in the CIFAR-10 multi-GPU trainer), and between-graph replication (e.g. using thetf.train.SyncReplicasOptimizer).

這邊我更改一下排序, 先介紹各個名詞的意義

圖行化整圖行化整理

Glossary

Client

Client 是一個程式, 通常建了一個TensroFlow Graph, 創 tensorflow::session 跟cluster 溝通. Cients 通常是用python 跟 C++寫成的. 一個簡單的 Client process可以直接跟多個 TensorFlow servers溝通( “Replicated training”), 而一個 簡單的single server 也可以服務多個 clients.

A client is typically a program that builds a TensorFlow graph and constructs a tensorflow::Session to interact with a cluster. Clients are typically written in Python or C++. A single client process can directly interact with multiple TensorFlow servers (see "Replicated training" above), and a single server can serve multiple clients.

Cluster

一個 TensorFlow cluster通常 包含一到多個"jobs" , 而每個 "job" 由一個或多個"tasks"的list 組成. 一個 Cluster 一般來說 會致力於完成某個 high-level objective (像是train一個 NN), 且平行化地使用多個機器. 一個cluster 是由tf.train.ClusterSpec 初始化的物件.

A TensorFlow cluster comprises a one or more “jobs”, each divided into lists of one or more “tasks”. A cluster is typically dedicated to a particular high-level objective, such as training a neural network, using many machines in parallel. A cluster is defined by a tf.train.ClusterSpec object.

Job

一個job 是由 一組"task" list 所組成, 通常會用來 "服務" 特定的 目標. 舉例來說 一個叫作"ps” (for parameter server) 的job 通常就會主導nodes 儲存與更新variables; 而如果 job 取作 “worker” 的話 通常就會主導 stateless 的 nodes 跟 處理 計算強度強的tasks. 而 在job的 tasks 通常都會跑在不同機器上. 最後 job的角色通常是很彈性的, 取作worker 你也可以用來存特定的state(?).

A job comprises a list of “tasks”, which typically serve a common purpose. For example, a job named ps (for "parameter server") typically hosts nodes that store and update variables; while a job named worker typically hosts stateless nodes that perform compute-intensive tasks. The tasks in a job typically run on different machines. The set of job roles is flexible: for example, a worker may maintain some state.

Master service

這是一個 RPC( Remote Procedure Call) service 用來提供 分散式機器們的 存取許可, 而且主要會被當作連線的目標. 這個master service 實作了 tensorflow::session 介面, 而且負責 用來 定位一到多個 "worker services". 而所有的 TensorFlow Server 會實作master service.

An RPC service that provides remote access to a set of distributed devices, and acts as a session target. The master service implements the tensorflow::Session interface, and is responsible for coordinating work across one or more "worker services". All TensorFlow servers implement the master service.

Task

一個 task 會對應到特定的 TensorFlow server , 而且通常 會對應到一個single process. 一個 task 是屬於特定的job , 在job中也擁有自己的id index.

A task corresponds to a specific TensorFlow server, and typically corresponds to a single process. A task belongs to a particular “job” and is identified by its index within that job’s list of tasks.

TensorFlow server

一個 process 會執行 tf.train.Server instance, 這是一個 cluster 的member,並會輸出 “master service” and “worker service”.

A process running a tf.train.Server instance, which is a member of a cluster, and exports a "master service" and "worker service".

Worker service

一個 RPC service 會執行一部分的 TensorFlow graph 並使用他本地的機器. 一個Worker service 是實作 service.proto. 而所有的 TensorFlow servers 均實作了 worker service.

An RPC service that executes parts of a TensorFlow graph using its local devices. A worker service implements worker_service.proto. All TensorFlow servers implement the worker service.


Code example : 待續

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.