Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

Transformers have transformed the field of natural language processing. This performance is largely attributed to the use of stacked self-attention layers, each of which consists of matrix multiplies as well as softmax operations. As a result, unlike other neural networks, the softmax operation accounts for a significant fraction of the total run-time of Transformers. To address this, we propose Softermax, a hardware-friendly softmax design. Softermax consists of base replacement, low-precision softmax computations, and an online normalization calculation.

Simba: scaling deep-learning inference with chiplet-based architecture

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication.

Verifying High-Level Latency-Insensitive Designs with Formal Model Checking

Latency-insensitive design mitigates increasing interconnect delay and enables productive component reuse in complex digital systems. This design style has been adopted in high-level design flows because untimed functional blocks connected through latency-insensitive interfaces provide a natural communication abstraction. However, latency-insensitive design with high-level languages also introduces a unique set of verification challenges that jeopardize functional correctness.

Opportunities for RTL and Gate Level Simulation using GPUs

This paper summarizes the opportunities in accelerating simulation on parallel processing hardware platforms such as GPUs.  First, we give a summary of prior art. Then, we propose the idea that coding frameworks usually used for popular machine learning (ML) topics, such as PyTorch/DGL.ai, can also be used for exploring simulation purposes. We demo a crude oblivious two-value cycle gate-level simulator using the higher level ML framework APIs that exhibits >20X speedup, despite its simplistic construction.

Standard Cell Routing with Reinforcement Learning and Genetic Algorithm in Advanced Technology Nodes

Automated standard cell routing in advanced technology nodes with unidirectional metal are challenging because of the constraints of exploding design rules. Previous approaches leveraged mathematical optimization methods such as SAT and MILP to find optimum solution under those constraints. The assumption those methods relied on is that all the design rules can be expressed in the optimization framework and the solver is powerful enough to solve them. In this paper we propose a machine learning based approach that does not depend on this assumption.

MAVIREC: ML-Aided Vectored IR-Drop Estimation and Classification

Vectored IR drop analysis is a critical step in chip signoff that checks the power integrity of an on-chip power delivery network. Due to the prohibitive runtimes of dynamic IR drop analysis, the large number of test patterns must be whittled down to a small subset of worstcase IR vectors. Unlike the traditional slow heuristic method that select a few vectors with incomplete coverage, MAVIREC uses machine learning techniques—3D convolutions and regression-like layers—for accurately recommending a larger subset of test patterns that exercise worst-case scenarios.

Parasitic-Aware Analog Circuit Sizing with Graph Neural Networks and Bayesian Optimization

Layout parasitics significantly impact the performance of analog integrated circuits, leading to discrepancies between schematic and post-layout performance and requiring several iterations to achieve design convergence. Prior work has accounted for parasitic effects during the initial design phase but relies on automated layout generation for estimating parasitics. In this work, we leverage recent developments in parasitic prediction using graph neural networks to eliminate the need for in-the-loop layout generation.

 

VS-QUANT: Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Quantization enables efficient acceleration of deep neural networks by reducing model memory footprint and exploiting low-cost integer math hardware units. Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors. Excessive quantization, reducing precision too aggressively, results in accuracy degradation. When scale factors are shared at a coarse granularity across many dimensions of each tensor, effective precision of individual elements within the tensor are limited.

Siddharth Gururani

Siddharth Gururani is a Research Scientist at NVIDIA. Prior to joining NVIDIA he was an AI Scientist at EA, where he worked on expressive speech synthesis focusing on low-resource regimes and approaches based on interpretable features to encode prosody. He received his Ph.D.