A Formalism of DNN Accelerator Flexibility

The high efficiency of domain-specific hardware accelerators for machine learning (ML) has come from

specialization, with the trade-off of less configurability/ flexibility. There is growing interest in developing

flexible ML accelerators to make them future-proof to the rapid evolution of Deep Neural Networks (DNNs).

However, the notion of accelerator flexibility has always been used in an informal manner, restricting computer

architects from conducting systematic apples-to-apples design-space exploration (DSE) across trillions of

Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration

Accelerators spend significant area and effort on custom onchip buffering. Unfortunately, these solutions are strongly tied to particular designs, hampering re-usability across other accelerators or domains.We present buffets, an efficient and composable storage idiom for the needs of accelerators that is independent of any particular design. Buffets have several distinguishing characteristics, including efficient decoupled fills and accesses with fine-grained synchronization, hierarchical composition, and efficient multi-casting.

SIMD^2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add.

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

A spatial accelerator’s efficiency depends heavily on both its mapper and cost models to generate optimized mappings for various operators of DNN models. However, existing cost models lack a formal boundary over their input programs (operators) for accurate and tractable cost analysis of the mappings, and this results in adaptability challenges to the cost models for new operators.

Demystifying Map Space Exploration for NPUs

Map Space Exploration is the problem of finding optimized mappings of a Deep Neural Network (DNN) model on an accelerator. It is known to be extremely computationally expensive, and there has been active research looking at both heuristics and learning-based methods to make the problem computationally tractable.

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these “domain-specific” accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms for a target problem on new hardware. Previous works have addressed these challenges individually.

Stitch-X: An Accelerator Architecture for Exploiting Unstructured Sparsity in Deep Neural Networks

Sparse deep neural network (DNN) accelerators exploit the intrinsic redundancy in data representation to achieve high performance and energy efficiency. However, sparse weight and input activation arrays are unstructured, and their processing cannot take advantage of the regular data-access patterns offered by dense arrays, thus the processing incurs increased complexities in dataflow orchestra- tion and resource management.

QuadStream: A Quad-Based Scene Streaming Architecture for Novel Viewpoint Reconstruction

Cloud rendering is attractive when targeting thin client devices such as phones or VR/AR headsets, or any situation where a high-end GPU is not available due to thermal or power constraints. However, it introduces the challenge of streaming rendered data over a network in a manner that is robust to latency and potential dropouts. Current approaches range from streaming transmitted video and correcting it on the client---which fails in the presence of disocclusion events---to solutions where the server sends geometry and all rendering is performed on the client.

CreatureShop: Interactive 3D Character Modeling and Texturing from a Single Color Drawing

Creating 3D shapes from 2D drawings is an important problem with applications in content creation for computer animation and virtual reality. We introduce a new sketch-based system, CreatureShop, that enables amateurs to create high-quality textured 3D character models from 2D drawings with ease and efficiency.

Learning A Continuous and Reconstructible Latent Space for Hardware Accelerator Design

The hardware design space is high-dimensional and discrete. Systematic and efficient exploration of this space has been a significant challenge. Central to this problem is the intractable search complexity that grows exponentially with the design choices and the discrete nature of the search space. This work investigates the feasibility of learning a meaningful low-dimensional continuous representation for hardware designs to reduce such complexity and facilitate the search process.