1. [Publications](/publications)
2. Characterizing and Mitigating Soft Errors in GPU DRAM
 
 # Characterizing and Mitigating Soft Errors in GPU DRAM

  ![](/sites/default/files/styles/wide/public/publications/top_picks_img.JPG?itok=_ygJnoHY)

 While graphics processing units (GPUs) are used in high-reliability systems,wide GPU dynamic random-access memory (DRAM) interfaces make error protection difficult, as wide-device correction through error checking and correcting (ECC) is expensive and impractical. This challenge is compounded by worsening relative rates of multibit DRAM errors and increasing GPU memory capacities. This work uses high-energy neutron beam tests to inform the design and evaluation of GPU DRAM error-protection mechanisms. Based on observed locality in multibit error patterns, we propose several novel ECC schemes to decrease the silent data corruption (SDC) risk by up to five orders of magnitude relative to single-bit-error-correcting and double-bit-error-detecting (SECDED) ECC, while also reducing the number of uncorrectable errors by up to 7.87x. We compare novel binary and symbol-based ECC organizations that differ in their design complexity and hardware overheads, ultimately recommending two promising organizations. These schemes replace SEC-DED ECC with no additional redundancy, likely no performance degradation, and modest area and complexity costs.



 ## Authors



[Michael B. Sullivan](/person/mike-sullivan)

Nirmal R. Saxena (NVIDIA)

[Mike O'Connor](/person/mike-o-connor)

[Donghyuk Lee](/person/donghyuk-lee)

Paul Racunas (NVIDIA)

Saurabh Hukerikar (NVIDIA)

Timothy Tsai (NVIDIA)

[Siva Kumar Sastry Hari](/person/siva-hari)

[Stephen W. Keckler](/person/stephen-keckler)

 

 

 ## Publication Date



Tuesday, March 29, 2022

 

 ## Published in



[IEEE Micro (Issue: Top Picks of the 2021 Computer Architecture Conferences)](https://ieeexplore.ieee.org/document/9744333)

 

 ## Research Area



[Computer Architecture](/research-area/computer-architecture)

[Resilience and Safety](/research-area/resilience)

 

 

 ## External Links



[IEEE Digital Library](https://ieeexplore.ieee.org/document/9744333)

 

 

 ## Copyright



This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to <pubs-permissions@ieee.org>.