massive GPU cluster 上训练技巧,看起来是对 mini-batch size 有一个比较好的 control,以及 2D-Torus all-reduce 来做各个 GPU 梯度更新同步问题。刚刚提交到 arxiv,来自 SONY 团队。paper 题目也很有意思:ImageNet/ResNet-50 Training in 224 Seconds.
This work Tesla V100 x1088, Infiniband EDR x2, 91.62% GPU scaling efficiency
https://arxiv.org/abs/1811.05233
>>Click here to continue<<