1. [Publications](/index.php/publications)
2. On the Trend of Resilience for GPU-Dense Systems
 
 # On the Trend of Resilience for GPU-Dense Systems

  ![Publication image](/sites/default/files/styles/wide/public/default_images/default.jpeg?itok=qUFsuJCP "Publication image")

 Emerging high-performance computing (HPC) systems show a tendency towards heterogeneous nodes that are dense with accelerators such as GPUs. They offer higher computational power at lower energy and cost than homogeneous CPU-only nodes. While an accelerator-rich machine reduces the total number of compute nodes required to achieve a performance target, a single node becomes susceptible to accelerator failures as well as sharing intra-node resources with many accelerators. Such failures must be recovered by end-to-end resilience schemes such as checkpoint-restart. However, preserving a large amount of local state within accelerators for checkpointing incurs significant overhead. This trend reveals a new challenge for the resilience in accelerator-dense systems. We study its impact in multi-level checkpointing systems and with burst buffers. We quantify the system-level efficiency for resilience, sweeping the failure rate, system scale, and GPU density. Our multi-level checkpoint-restart model shows that the efficiency begins to drop at a 16:1 GPU-to-CPU ratio in a 3.6 EFLOP system and a ratio of 64:1 degrades overall system efficiency by 5%. Furthermore, we quantify the system-level impact of possible design considerations for the resilience in GPU-dense systems to mitigate this challenge.



 ## Authors



Kyushick Lee (The University of Texas at Austin)

[Michael B. Sullivan](/index.php/person/mike-sullivan)

[Siva Hari](/index.php/person/siva-hari)

Timothy Tsai (NVIDIA)

[Steve Keckler](/index.php/person/stephen-keckler)

Mattan Erez (The University of Texas at Austin)

 

 

 ## Publication Date



Wednesday, March 27, 2019

 

 ## Published in



[IEEE Workshop on Silicon Errors in Logic – System Effects (SELSE)](https://selse.org/2019-archive/)

 

 ## Research Area



[Computer Architecture](/index.php/research-area/computer-architecture)

[Resilience and Safety](/index.php/research-area/resilience)

 

 

 ## Uploaded Files



[Published manuscript](https://d1qx31qr3h6wln.cloudfront.net/publications/SELSE2019_GPUDenseChkptAnalysis.pdf "Open file in new window")249.26 KB

 

 

 ## Award



Award paper