GAMMA: Exploiting Gustavson’s Algorithm to Accelerate Sparse Matrix Multiplication

Sparse matrix-sparse matrix multiplication (spMspM) is at the heart of a wide range of scientific and machine learning applications. spMspM is inefficient on general-purpose architectures, making accelerators attractive. However, prior spMspM accelerators use inner- or outer-product dataflows that suffer poor input or output reuse, leading to high traffic and poor performance. These prior accelerators have not explored Gustavson’s algorithm, an alterna- tive spMspM dataflow that does not suffer from these problems but features irregular memory access patterns that prior accelerators do not support.

We present Gamma, an spMspM accelerator that uses Gustavson’s algorithm to address the challenges of prior work. Gamma performs spMspM’s computation using specialized processing elements with simple high-radix mergers, and performs many merges in parallel to achieve high throughput. Gamma uses a novel on-chip storage structure that combines features of both caches and explicitly man- aged buffers. This structure captures Gustavson’s irregular reuse patterns and streams thousands of concurrent sparse fibers (i.e., lists of coordinates and values for rows or columns) with explicitly decoupled data movement. Gamma features a new dynamic scheduling algorithm to achieve high utilization despite irregularity. We also present new preprocessing algorithms that boost Gamma’s efficiency and versatility. As a result, Gamma outperforms prior accelerators by gmean 2.1×, and reduces memory traffic by gmean 2.2× and by up to 13×.

Authors

Guowei Zhang (Massachusetts Institute of Technology)

Nithya Attaluri (Massachusetts Institute of Technology)

Joel Emer

Daniel Sanchez (Massachusetts Institute of Technology)

Publication Date

Saturday, April 17, 2021

Published in

International Conference on Architectural Support for Programming Languages and…

Research Area

Artificial Intelligence and Machine Learning

Computer Architecture

External Links

ACM Digital Library

Uploaded Files

Published manuscript673.55 KB

Copyright

Copyright by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org. The definitive version of this paper can be found at ACM's Digital Library http://www.acm.org/dl/.