DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis

Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm. We demonstrate that our model is both accurate and robust for different CNNs and GPU architectures. We then show how this model can be used to carefully balance the scaling of different GPU resources for efficient CNN performance improvement.

Authors

Sankug Lym (University of Texas - Austin)

Donghyuk Lee

Niladrish Chatterjee

Mike O'Connor

Mattan Erez (University of Texas - Austin)

Publication Date

Tuesday, March 26, 2019

Published in

International Symposium on Performance Analysis of Systems and Software (ISPASS)

Research Area

Artificial Intelligence and Machine Learning

Computer Architecture

External Links

IEEE Digital Library

Uploaded Files

Published manuscript2.99 MB

Copyright

This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org.