Welcome > Artificial Intelligence > ML > Deep Learning > Horovod

Horovod

References:

Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow

Motivation

Problems in the standard distributed TensorFlow technique:

not always clear which code modifications needed to be made to distribute the model training code;
Many new concepts introduced hard-to-diagnose bugs that slowed training.
- The standard distributed TensorFlow package introduces many new concepts: workers, parameter servers, tf.Server(), tf.ClusterSpec(), tf.train.SyncReplicasOptimizer(), and tf.train.replicas_device_setter() to name a few. While beneficial for certain scenarios, this also introduced hard-to-diagnose bugs that slowed training.
Does not scale well;
- Both the Inception V3 and ResNet-101 models were unable to leverage nearly half of our GPU resources.

New insights on parallel optimizations:

Facebook: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
- demostrated usage of data parallelism with an innovative learning rate adjustment technique.
- Train ResNet-50 in one hour on 256 GPUs.
- What is the effective GPU usage ratio???
Baidu: ring-allreduce: Bringing HPC Techniques to Deep Learning
- A different algorithm for averaging gradients and communicating those gradients to all nodes (Step 2 and 3 below), called ring-allreduce.
- A demonstrated implementation based on TensorFlow.
- Based on 2009 paper Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations

Horovod based on Baidu

Distributed training in steps:

Run multiple copies of the training script and each copy:
- a) reads a chunk of the data
- b) runs it through the model
- c) computes model updates (Gradients)
Average gradients among those multiple copies;
Update the model;
Repeat (from step 1a).

Parameter Server based approach:

Parameter Server to average gradients;
Worker server to process the training data, compute gradients, and send gradients to parameter servers;

Two challenges for this approach (standard distributed TensorFlow package):

Identify the right ratio of worker to parameter servers;
- If one parameter server, likely becoming a networking or computational bottleneck;
- If multiple parameter servers, the all-to-all connections may saturate the network;
Handling increased TensorFlow program complexity: A steep learning curve and a significant amount of code restructuring, taking time away from the acutal modeling.
- explicitly starts each worker and parameter server;
- pass around service discovery information such as hosts and ports of all the workers and parameter servers;
- modify the training program to construct tf.Server() with an approprate tf.ClusterSpec().
- ensure all the operations were placed appropriately using tf.train.device_replica_setter()
- modify code to use towers to leverage multiple GPUs within the server.

Baidu’s ring-allreduce:

Each of N nodes communicates with two of its peers 2*(N-1) times. Each node sends and receives chunks of the data buffer.
- First N-1 iterations: received values are added to the values in the node’s buffer.
- Second N-1 iterations: received values replace the values held in the node’s buffer.
Algorithm is bandwidth optimal: If the buffer is large enough, it will optimally utilize the avaiable network.
Much easier to understand and adopt.
- Users utilize a Message Passing Interface (MPI), such as OpenMPI, to launch all copies of the TensorFlow program.
- MPI then transparently sets up the distributed infrastructure necessary for workers to communicate with each other.
- All the users need to do is to modify their program to average gradients using an allreduce() operation.

Horovod:

Adopted Baidu’s draft implementation.
1. converted into standalone Python package called Horovod;
2. compatiable with different versions of TensorFlow.
3. Used NVIDIA’s NCCL implementation, a highly optimized version of ring-allreduce.
4. Added support for models that fit inside a single server, potentially on multiple GPUs
5. Original version only supported models that fit on a single GPU.
6. Serveral API improvements.
7. Broadcast operation that enforces consistent initialization of the model on all workers.
  - Cut down the num of operations a user had to introduce to their single GPU program to four.

Use Horovod

Reference: https://eng.uber.com/horovod/

use horovod

import tensorflow as tf
import horovod.tensorflow as hvd

##############################
# 1/4. Initialize Horovod
##############################
hvd.init()

##############################
# 2/4. GPU assignment
##############################
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
# Assigns a GPU to each of the TensorFlow processes
config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model…
loss = …
opt = tf.train.AdagradOptimizer(0.01)

##############################
# 3/4. Averaging.
##############################
# Add Horovod Distributed Optimizer
# This wraps any regular TensorFlow optimizer with Horovod optimizer which takes care of averaging gradients using ring-allreduce.
opt = hvd.DistributedOptimizer(opt)

##############################
# 4/4. Broadcasting
##############################
# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
# Note: If program does not use `MonitoredTrainingSession`, users can run the
#       hvd.broadcast_gloabl_variables(0) operations instead.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operation
train_op = opt.minimize(loss)

# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=“/tmp/train_logs”,
                                      config=config,
                                      hooks=hooks) as mon_sess:
 while not mon_sess.should_stop():
   # Perform synchronous training.
   mon_sess.run(train_op)

Run using the mpirun command:

$ mpirun -np 16 -x LD_LIBRARY_PATH -H server1:4, server2:4,server3:4,server4:4 python train.py

The command above distributes train.py to four nodes and runs it on four GPUs per node.

Horovod can also distribute Keras programs. Examples on Github

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?

Horovod

Motivation

Horovod based on Baidu

Use Horovod

More