Sparse deep neural network (DNN) accelerators exploit the intrinsic redundancy in data representation to achieve high performance and energy efficiency. However, sparse weight and input activation arrays are unstructured, and their processing cannot take advantage of the regular data-access patterns offered by dense arrays, thus the processing incurs increased complexities in dataflow orchestra- tion and resource management. In this work, we first present the importance of the data reduction mechanism, i.e., how partial sums are accumulated spatially or temporally, a perspective that has not been fully explored in current literature. Motivated by the reduction analysis, we propose Stitch-X, a novel DNN inference accelerator architecture that can stitch together sparse weights and input acti- vations for parallel execution. Specifically, Stitch-X employs a novel dataflow that leverages both spatial and temporal reduction to bal- ance energy efficiency and dataflow control complexity. Moreover, Stitch-X adopts a new runtime Parallelism Discovery Unit (PDU) to efficiently extract fine-grained parallelizable operations from irregular sparse data arrays to enable a higher performance over a wide range of input data densities and for a variety of DNN layers. Our evaluations show that Stitch-X consistently achieves a 3.8× speedup and improves energy-delay-squared-product (ED2P) by a factor of 10.3× over an efficient, dense DNN accelerator. Compared to a state-of-the-art sparse DNN accelerator, Stitch-X delivers 1.6× better performance. A silicon prototype of the Stitch-X architecture is scheduled for April 2018.
Published manuscript1.74 MB