In the figure, we take distributed deep learning as an example to explain the glossary of Client, Cluster, Job, Task, TensorFlow server, Master service, Worker Service in TensorFlow (TF).
In the figure, the model parallelism within every model replica and data parallelism among replicas are adopted, for distributed deep learning. A example of mapping physical nodes to TensorFlow glossary is illustrated.
- The whole system is mapped to a TF cluster.
- Parameter servers are mapped to a job
- Each model replica is mapped to a job
- Each physical computing node is mapped to a task within its job
- Each task has a TF server, using “Master service” to communicate and coordinate works and using “Worker service” to compute designated operations in the TF graph by local devices.
Official Explanation of Glossary in TensorFlow:
- A client is typically a program that builds a TensorFlow graph and constructs a `tensorflow::Session` to interact with a cluster. Clients are typically written in Python or C++. A single client process can directly interact with multiple TensorFlow servers (see “Replicated training” above), and a single server can serve multiple clients.
- A TensorFlow cluster comprises a one or more “jobs”, each divided into lists of one or more “tasks”. A cluster is typically dedicated to a particular high-level objective, such as training a neural network, using many machines in parallel. A cluster is defined by a `tf.train.ClusterSpec` object.
- A job comprises a list of “tasks”, which typically serve a common purpose. For example, a job named `ps` (for “parameter server”) typically hosts nodes that store and update variables; while a job named `worker` typically hosts stateless nodes that perform compute-intensive tasks. The tasks in a job typically run on different machines. The set of job roles is flexible: for example, a `worker` may maintain some state.
- Master service
- An RPC service that provides remote access to a set of distributed devices, and acts as a session target. The master service implements the
tensorflow::Sessioninterface, and is responsible for coordinating work across one or more “worker services”. All TensorFlow servers implement the master service.
- A task corresponds to a specific TensorFlow server, and typically corresponds to a single process. A task belongs to a particular “job” and is identified by its index within that job’s list of tasks.
- TensorFlow server
- A process running a
tf.train.Serverinstance, which is a member of a cluster, and exports a “master service” and “worker service”.
- Worker service
- An RPC service that executes parts of a TensorFlow graph using its local devices. A worker service implements
worker_service.proto. All TensorFlow servers implement the worker service.