Ruby: Improving Hardware Efficiency for Tensor Algebra Accelerators Through Imperfect Factorization

Finding high-quality mappings of Deep Neural Network (DNN) models onto tensor accelerators is critical for efficiency. State-of-the-art mapping exploration tools use remainderless (i.e., perfect) factorization to allocate hardware resources, through tiling the tensors, based on factors of tensor dimensions. This limits the size of the search space, (i.e., mapspace), but can lead to low resource utilization. We introduce a new mapspace, Ruby, that adds remainders (i.e., imperfect factorization) to expand the mapspace with high-quality mappings for user-defined architectures.

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach.

The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, and they directly impact the performance and energy efficiency of DNN accelerator designs. An accelerator microarchitecture dictates the dataflow(s) that can be employed to execute a layer or network.

A Formalism of DNN Accelerator Flexibility

The high efficiency of domain-specific hardware accelerators for machine learning (ML) has come from

specialization, with the trade-off of less configurability/ flexibility. There is growing interest in developing

flexible ML accelerators to make them future-proof to the rapid evolution of Deep Neural Networks (DNNs).

However, the notion of accelerator flexibility has always been used in an informal manner, restricting computer

architects from conducting systematic apples-to-apples design-space exploration (DSE) across trillions of

Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration

Accelerators spend significant area and effort on custom onchip buffering. Unfortunately, these solutions are strongly tied to particular designs, hampering re-usability across other accelerators or domains.We present buffets, an efficient and composable storage idiom for the needs of accelerators that is independent of any particular design. Buffets have several distinguishing characteristics, including efficient decoupled fills and accesses with fine-grained synchronization, hierarchical composition, and efficient multi-casting.

SIMD^2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add.

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

A spatial accelerator’s efficiency depends heavily on both its mapper and cost models to generate optimized mappings for various operators of DNN models. However, existing cost models lack a formal boundary over their input programs (operators) for accurate and tractable cost analysis of the mappings, and this results in adaptability challenges to the cost models for new operators.

Demystifying Map Space Exploration for NPUs

Map Space Exploration is the problem of finding optimized mappings of a Deep Neural Network (DNN) model on an accelerator. It is known to be extremely computationally expensive, and there has been active research looking at both heuristics and learning-based methods to make the problem computationally tractable.

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these “domain-specific” accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms for a target problem on new hardware. Previous works have addressed these challenges individually.

Stitch-X: An Accelerator Architecture for Exploiting Unstructured Sparsity in Deep Neural Networks

Sparse deep neural network (DNN) accelerators exploit the intrinsic redundancy in data representation to achieve high performance and energy efficiency. However, sparse weight and input activation arrays are unstructured, and their processing cannot take advantage of the regular data-access patterns offered by dense arrays, thus the processing incurs increased complexities in dataflow orchestra- tion and resource management.

QuadStream: A Quad-Based Scene Streaming Architecture for Novel Viewpoint Reconstruction

Cloud rendering is attractive when targeting thin client devices such as phones or VR/AR headsets, or any situation where a high-end GPU is not available due to thermal or power constraints. However, it introduces the challenge of streaming rendered data over a network in a manner that is robust to latency and potential dropouts. Current approaches range from streaming transmitted video and correcting it on the client---which fails in the presence of disocclusion events---to solutions where the server sends geometry and all rendering is performed on the client.