References:
Problems in the standard distributed TensorFlow technique:
Many new concepts introduced hard-to-diagnose bugs that slowed training.
Does not scale well;
New insights on parallel optimizations:
Facebook: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Baidu: ring-allreduce: Bringing HPC Techniques to Deep Learning
Distributed training in steps:
Parameter Server based approach:
Two challenges for this approach (standard distributed TensorFlow package):
tf.Server()
with an approprate tf.ClusterSpec()
.tf.train.device_replica_setter()
Baidu’s ring-allreduce:
allreduce()
operation.Horovod:
Horovod
;Reference: https://eng.uber.com/horovod/
import tensorflow as tf
import horovod.tensorflow as hvd
##############################
# 1/4. Initialize Horovod
##############################
hvd.init()
##############################
# 2/4. GPU assignment
##############################
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
# Assigns a GPU to each of the TensorFlow processes
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model…
loss = …
opt = tf.train.AdagradOptimizer(0.01)
##############################
# 3/4. Averaging.
##############################
# Add Horovod Distributed Optimizer
# This wraps any regular TensorFlow optimizer with Horovod optimizer which takes care of averaging gradients using ring-allreduce.
opt = hvd.DistributedOptimizer(opt)
##############################
# 4/4. Broadcasting
##############################
# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
# Note: If program does not use `MonitoredTrainingSession`, users can run the
# hvd.broadcast_gloabl_variables(0) operations instead.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=“/tmp/train_logs”,
config=config,
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
{{/expand}}
Run using the mpirun
command:
$ mpirun -np 16 -x LD_LIBRARY_PATH -H server1:4, server2:4,server3:4,server4:4 python train.py
The command above distributes train.py
to four nodes and runs it on four GPUs per node.
Horovod can also distribute Keras programs. Examples on Github
If you could revise
the fundmental principles of
computer system design
to improve security...
... what would you change?