Arbitrary Modulus Indexing

Modern high performance processors require memory systems that can provide access to data at a rate that is well matched to the processor’s computation rate. Common to such systems is the organization of memory into local high speed memory banks that can be accessed in parallel. Associative look up of values is made efficient through indexing instead of associative memories. These techniques lose effectiveness when data locations are not mapped uniformly to the banks or cache locations, leading to bottlenecks that arise from excess demand on a subset of locations.

21st Century Digital Design Tools

Most chips today are designed with 20th century CAD tools. These tools, and the abstractions they are based on, were originally intended to handle designs of millions of gates or less. They are not up to the task of handling today's billion-gate designs. The result is months of delay and considerable labor from final RTL to tapeout. Surprises in timing closure, global congestion, and power consumption are common. Even taking an existing design to a new process node is a time-consuming and laborious process.

Convergence and Scalarization for Data-Parallel Architectures

Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One drawback of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations.

Visibility Algorithms for Dynamic Dependence Analysis and Distributed Coherence

Implicitly parallel programming systems must solve the joint problems of dependence analysis and coherence to ensure apparently-sequential semantics for applications run on distributed memory machines. Solving these problems in the presence of data-dependent control flow and arbitrary aliasing is a challenge that most existing systems eschew by compromising the expressivity of their programming models and/or the performance of their implementations. We demonstrate a general class of solutions to these problems via a reduction to the visibility problem from computer graphics.

A 0.297-pJ/bit 50.4-Gb/s/wire Inverter-Based Short-Reach Simultaneous Bidirectional Transceiver for Die-to-Die Interface in 5nm CMOS

This paper presents a clock-forwarded, Inverter-based Short-Reach Simultaneous Bi-Directional (ISR-SBD) PHY targeted for die-to-die communication over silicon interposer or similar high-density interconnect. Fabricated in a 5nm standard CMOS process, ISR-SBD PHY demonstrates 50.4Gb/s/wire (25.2Gb/s each direction) and 0.297pJ/bit on a 0.75V supply over a 1.2mm on-chip channel.

LNS-Madam: Low-Precision Training in Logarithmic Number System Using Multiplicative Weight Update

Representing deep neural networks (DNNs) in low-precision is a promising approach to enable efficient acceleration and memory reduction. Previous methods that train DNNs in low-precision typically keep a copy of weights in high-precision during the weight updates. Directly training with low-precision weights leads to accuracy degradation due to complex interactions between the low-precision number systems and the learning algorithms.

An Adversarial Active Sampling-based Data Augmentation Framework for Manufacturable Chip Design

Lithography modeling is a crucial problem in chip design to ensure a chip design mask is manufacturable. It requires rigorous simulations of optical and chemical models that are computationally expensive. Recent developments in machine learning have provided alternative solutions in replacing the time-consuming lithography simulations with deep neural networks. However, the considerable accuracy drop still impedes its industrial adoption. Most importantly, the quality and quantity of the training dataset directly affect the model performance.

TAG: Learning Circuit Spatial Embedding from Layouts

Analog and mixed-signal (AMS) circuit designs still rely on human design expertise. Machine learning has been assisting circuit design automation by replacing human experience with artificial intelligence. This paper presents TAG, a new paradigm of learning the circuit representation from layouts leveraging Text, self Attention and Graph. The embedding network model learns spatial information without manual labeling. We introduce text embedding and a self-attention mechanism to AMS circuit learning.

TransSizer: A Novel Transformer-Based Fast Gate Sizer

Gate sizing is a fundamental netlist optimization move and researchers have used supervised learning-based models in gate sizers. Recently, Reinforcement Learning (RL) has been tried for sizing gates (and other EDA optimization problems) but are very runtime-intensive. In this work, we explore a novel Transformer-based gate sizer, TransSizer, to directly generate optimized gate sizes given a placed and unoptimized netlist. TransSizer is trained on datasets obtained from real tapeout-quality industrial designs in a foundry 5nm technology node.