Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Deploying deep learning (DL)models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communication overhead between devices. Additionally, a larger aggregate batch size per step leads to statistical efficiency loss, i.e., a larger number of epochs are required to converge to a desired accuracy. These factors affect overall training time and beyond a certain number of devices, the speedup from DP scales poorly. This work explores hybrid parallelization, where each data parallel worker comprises more than one device to accelerate each training step by exploiting model parallelism. We show that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. We project that, for Inception-V3, GNMT, and BigLSTM, the hybrid strategy provides an end-to-end training speedup of at least 26.5%, 8%, and 22%, respectively, compared to what DP alone can achieve at scale.

Authors

Saptadeep Pal (University of California)

Eiman Ebrahimi (NVIDIA)

Arslan Zulfiqar (NVIDIA)

Yaosheng Fu

Victor Zhang (NVIDIA)

Szymon Migacz (NVIDIA)

David Nellans

Puneet Gupta (University of California)

Publication Date

Monday, August 19, 2019

Published in

IEEE MICRO: Special Edition on Machine Learning Acceleration

Research Area

Computer Architecture

External Links

IEEE Digital Library

Copyright

This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org.