It is common practice to decay the learning rate. Here we show one canusually obtain the same learning curve on both training and test sets byinstead increasing the batch size during training. This procedure is successfulfor stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum,and Adam. It reaches equivalent test accuracies after the same number oftraining epochs, but with fewer parameter updates, leading to greaterparallelism and shorter training times. We can further reduce the number ofparameter updates by increasing the learning rate $\epsilon$ and scaling thebatch size $B \propto \epsilon$. Finally, one can increase the momentumcoefficient $m$ and scale $B \propto 1/(1-m)$, although this tends to slightlyreduce the test accuracy. Crucially, our techniques allow us to repurposeexisting training schedules for large batch training with no hyper-parametertuning. We train ResNet-50 on ImageNet to $76.1\%$ validation accuracy in under30 minutes.
translated by 谷歌翻译