Despite industrial investment in both on-die GPUs and next generation interconnects, the highest performing parallel accelerators shipping today continue to be discrete GPUs. Connected via PCIe, these GPUs utilize their own privately managed physical memory that is optimized for high bandwidth. These separate memories force GPU programmers to manage the movement of data between the CPU and GPU, in addition to the on-chip GPU memory hierarchy. To simplify this process, GPU vendors are developing software runtimes that automatically page memory in and out of the GPU on-demand, reducing programmer effort and enabling computation across datasets that exceed the GPU memory capacity. Because this memory migration occurs over a high latency and low bandwidth link (compared to GPU memory), these software runtimes may result in significant performance penalties. In this work, we explore the features needed in GPU hardware and software to close the performance gap of GPU paged memory versus legacy programmer directed memory management. Without modifying the GPU execution pipeline, we show it is possible to largely hide the performance overheads of GPU paged memory, converting an average 2× slowdown into a 12% speedup when compared to programmer directed transfers. Additionally, we examine the performance impact that GPU memory oversubscription has on application run times, enabling application designers to make informed decisions on how to shard their datasets across hosts and GPU instances.
This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to email@example.com.