MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings

The efficiency of an accelerator depends on three factors -- mapping, deep neural network (DNN) layers, and hardware -- constructing extremely complicated design space of DNN accelerators. To demystify such complicated design space and guide the DNN accelerator design for better efficiency, we propose an analytical cost model, MAESTRO. MAESTRO receives DNN model description and hardware resources information as a list, and mapping described in a data-centric representation we propose as inputs.

Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space Search

Modern day computing increasingly relies on specialization to satiate growing performance and efficiency requirements. A core challenge in designing such specialized hardware architectures is how to perform mapping space search, i.e., search for an optimal mapping from algorithm to hardware. Prior work shows that choosing an inefficient mapping can lead to multiplicative-factor efficiency overheads. Additionally, the search space is not only large but also non-convex and non-smooth, precluding advanced search techniques.

ExTensor: An Accelerator for Sparse Tensor Algebra

Generalized tensor algebra is a prime candidate for acceleration via customized ASICs. Modern tensors feature a wide range of data sparsity, with the density of non-zero elements ranging from 10^-6% to 50%. This paper proposes a novel approach to accelerate tensor kernels based on the principle of hierarchical elimination of computation in the presence of sparsity.

Sparseloop: An Analytical Approach to Sparse Tensor Accelerator Modeling

In recent years, many accelerators have been proposed to efficiently process sparse tensor algebra applications (e.g., sparse neural networks). However, these proposals are single points in a large and diverse design space. The lack of systematic description and modeling support for these sparse tensor accelerators impedes hardware designers from efficient and effective design space exploration. This paper first presents a unified taxonomy to systematically describe the diverse sparse tensor accelerator design space.

Ruby: Improving Hardware Efficiency for Tensor Algebra Accelerators Through Imperfect Factorization

Finding high-quality mappings of Deep Neural Network (DNN) models onto tensor accelerators is critical for efficiency. State-of-the-art mapping exploration tools use remainderless (i.e., perfect) factorization to allocate hardware resources, through tiling the tensors, based on factors of tensor dimensions. This limits the size of the search space, (i.e., mapspace), but can lead to low resource utilization. We introduce a new mapspace, Ruby, that adds remainders (i.e., imperfect factorization) to expand the mapspace with high-quality mappings for user-defined architectures.

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach.

The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, and they directly impact the performance and energy efficiency of DNN accelerator designs. An accelerator microarchitecture dictates the dataflow(s) that can be employed to execute a layer or network.

A Formalism of DNN Accelerator Flexibility

The high efficiency of domain-specific hardware accelerators for machine learning (ML) has come from

specialization, with the trade-off of less configurability/ flexibility. There is growing interest in developing

flexible ML accelerators to make them future-proof to the rapid evolution of Deep Neural Networks (DNNs).

However, the notion of accelerator flexibility has always been used in an informal manner, restricting computer

architects from conducting systematic apples-to-apples design-space exploration (DSE) across trillions of

Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration

Accelerators spend significant area and effort on custom onchip buffering. Unfortunately, these solutions are strongly tied to particular designs, hampering re-usability across other accelerators or domains.We present buffets, an efficient and composable storage idiom for the needs of accelerators that is independent of any particular design. Buffets have several distinguishing characteristics, including efficient decoupled fills and accesses with fine-grained synchronization, hierarchical composition, and efficient multi-casting.

SIMD^2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add.

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

A spatial accelerator’s efficiency depends heavily on both its mapper and cost models to generate optimized mappings for various operators of DNN models. However, existing cost models lack a formal boundary over their input programs (operators) for accurate and tractable cost analysis of the mappings, and this results in adaptability challenges to the cost models for new operators.