Horovod

References:

Motivation

Problems in the standard distributed TensorFlow technique:

  • not always clear which code modifications needed to be made to distribute the model training code;
  • Many new concepts introduced hard-to-diagnose bugs that slowed training.

    • The standard distributed TensorFlow package introduces many new concepts: workers, parameter servers, tf.Server(), tf.ClusterSpec(), tf.train.SyncReplicasOptimizer(), and tf.train.replicas_device_setter() to name a few. While beneficial for certain scenarios, this also introduced hard-to-diagnose bugs that slowed training.
  • Does not scale well;

    • Both the Inception V3 and ResNet-101 models were unable to leverage nearly half of our GPU resources.

New insights on parallel optimizations:

Horovod based on Baidu

Distributed training in steps:

  1. Run multiple copies of the training script and each copy:
    • a) reads a chunk of the data
    • b) runs it through the model
    • c) computes model updates (Gradients)
  2. Average gradients among those multiple copies;
  3. Update the model;
  4. Repeat (from step 1a).

Parameter Server based approach:

  • Parameter Server to average gradients;
  • Worker server to process the training data, compute gradients, and send gradients to parameter servers;

Two challenges for this approach (standard distributed TensorFlow package):

  • Identify the right ratio of worker to parameter servers;
    • If one parameter server, likely becoming a networking or computational bottleneck;
    • If multiple parameter servers, the all-to-all connections may saturate the network;
  • Handling increased TensorFlow program complexity: A steep learning curve and a significant amount of code restructuring, taking time away from the acutal modeling.
    • explicitly starts each worker and parameter server;
    • pass around service discovery information such as hosts and ports of all the workers and parameter servers;
    • modify the training program to construct tf.Server() with an approprate tf.ClusterSpec().
    • ensure all the operations were placed appropriately using tf.train.device_replica_setter()
    • modify code to use towers to leverage multiple GPUs within the server.

Baidu’s ring-allreduce:

  • Each of N nodes communicates with two of its peers 2*(N-1) times. Each node sends and receives chunks of the data buffer.
    • First N-1 iterations: received values are added to the values in the node’s buffer.
    • Second N-1 iterations: received values replace the values held in the node’s buffer.
  • Algorithm is bandwidth optimal: If the buffer is large enough, it will optimally utilize the avaiable network.
  • Much easier to understand and adopt.
    • Users utilize a Message Passing Interface (MPI), such as OpenMPI, to launch all copies of the TensorFlow program.
    • MPI then transparently sets up the distributed infrastructure necessary for workers to communicate with each other.
    • All the users need to do is to modify their program to average gradients using an allreduce() operation.

Horovod:

  • Adopted Baidu’s draft implementation.
    1. converted into standalone Python package called Horovod;
    2. compatiable with different versions of TensorFlow.
    3. Used NVIDIA’s NCCL implementation, a highly optimized version of ring-allreduce.
    4. Added support for models that fit inside a single server, potentially on multiple GPUs
    5. Original version only supported models that fit on a single GPU.
    6. Serveral API improvements.
    7. Broadcast operation that enforces consistent initialization of the model on all workers.
      • Cut down the num of operations a user had to introduce to their single GPU program to four.

Use Horovod

Reference: https://eng.uber.com/horovod/

use horovod
Created Nov 14, 2020 // Last Updated Aug 31, 2021

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?