1. [Publications](/index.php/publications)
2. On the Trend of Resilience for GPU-Dense Systems
 
 # On the Trend of Resilience for GPU-Dense Systems

  ![](/sites/default/files/styles/wide/public/publications/multi_gpu_resilience.JPG?itok=hXg08Uov)

 Emerging high-performance computing (HPC) systems show a tendency towards heterogeneous nodes that are dense with accelerators such as GPUs. They offer higher computational power at lower energy and cost than homogeneous CPU-only nodes. While an accelerator-rich machine reduces the total number of compute nodes required to achieve a performance target, a single node becomes susceptible to accelerator failures as well as sharing intra-node resources with many accelerators. Such failures must be recovered by end-to-end resilience schemes such as checkpoint-restart. However, preserving a large amount of local state within accelerators for checkpointing incurs significant overhead. This trend reveals a new challenge for the resilience in accelerator-dense systems. We study its impact in multi-level checkpointing systems and with burst buffers. We quantify the system-level efficiency for resilience, sweeping the failure rate, system scale, and GPU density. Our multi-level checkpoint-restart model shows that the efficiency begins to drop at a 16:1 GPU-to-CPU ratio in a 3.6 EFLOP system and a ratio of 64:1 degrades overall system efficiency by 5%. Furthermore, we quantify the system-level impact of possible design considerations for the resilience in GPU-dense systems to mitigate this challenge.



 ## Authors



Kyushick Lee (University of Texas at Austin)

[Michael B. Sullivan](/index.php/person/mike-sullivan)

[Siva Hari](/index.php/person/siva-hari)

Timothy Tsai (NVIDIA)

[Steve Keckler](/index.php/person/stephen-keckler)

Mattan Erez (University of Texas at Austin)

 

 

 ## Publication Date



Monday, June 24, 2019

 

 ## Published in



[International Conference on Dependable Systems and Networks, Supplemental (DSN-…](https://ieeexplore.ieee.org/document/8805794)

 

 ## Research Area



[Computer Architecture](/index.php/research-area/computer-architecture)

[High Performance Computing](/index.php/research-area/high-performance-computing)

[Resilience and Safety](/index.php/research-area/resilience)

 

 

 ## External Links



[IEEE Digital Library](https://ieeexplore.ieee.org/document/8805794)

 

 

 ## Uploaded Files



[Published manuscript](https://d1qx31qr3h6wln.cloudfront.net/publications/DSN_2019_GPU_Resilience.pdf "Open file in new window")394.71 KB

 

 

 ## Award



Best of SELSE (Workshop on Silicon Errors in Logic - System Effects)

 

 

 ## Copyright



This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to <pubs-permissions@ieee.org>.