Learning Sparse Matrix Row Permutations for Efficient SpMM on GPU Architectures

Achieving peak performance on sparse operations is challenging. The distribution of the non-zero elements and underlying hardware platform affect the execution efficiency. Given the diversity in workloads and architectures, no uniquesolution always wins. In this paper, we improve SpMM efficiency on GPUs. We propose several simple, but effective, sparse data permutations on the CSR data structure. Picking the right permutation over 1,688 datasets improves performance by 1.4x, on average, compared to plain CSR and 2.6x against NVIDIA cuSPARSE. Furthermore, we propose a set of novel features to describe sparsity patterns and their interactions with the kernel and hardware. Using these features, we develop a predictor to select the best permutation for each matrix. Predicted permutations’ average gain achieves 96% of oracle gains.

Authors

Atefeh Mehrabi (Duke University)

Donghyuk Lee

Niladrish Chatterjee

Danial J. Sorin (Duke University)

Benjamin C. Lee (University of Pennsylvania)

Mike O'Connor

Publication Date

Monday, March 29, 2021

Published in

International Symposium on Performance Analysis of Systems and Software (ISPASS)

Research Area

Artificial Intelligence and Machine Learning

Computer Architecture

High Performance Computing

External Links

IEEE Digital Library

Uploaded Files

Published manuscript2.45 MB

Copyright

This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org.