MegatronLM’s Supercharged V1.0

Published:

We recently released version 1.0 of Megatron-lm in our github repository. In addition to training support for the world’s largest BERT models which established state-of-the-art results on the RACE leaderboard, we performed several software optimizations to make the training of large NLP models even faster. As a result, our baseline model with 1.2 billion parameters now achieves 62.4 teraFLOPs which is 48% of the theoretical peak FLOPS for a single GPU in a DGX2-H server. This is a 60% improvement over our previously 39 teraFLOPs published number.

In addition, to test the effect of the optimizations on our model parallel scaling, we considered four GPT2 configurations ranging from 1.2B to 8.7B parameters with eight-way model parallelism. We fixed the batch size to 8 and increased the model parallel size as the model size increases. The scaling results are shown in Table 1. We observed excellent scaling numbers in both settings. For example, our experiments with 8.7 billion parameters and 8-way (8 GPUs) model parallelism achieved 79.6% of linear scaling.

Number of Parameters (billions) Model Parallel GPUs Iteration Time (ms) Weak Scaling
1.2 1 1288 Baseline (100%)
2.0 2 1242 90.7%
4.2 4 1357 86.5%
8.7 8 1508 79.6%
Table 1: Weak model parallel scaling.

We are constantly improving the computational efficiency of our code-base and will release the latest advancements in large scale LM training through our github repository.