Vinu Joseph

Working on Privacy Preserving Deep Learning using:

Amrita Mazumdar

Amrita Mazumdar joined NVIDIA Research in 2021. Her research interests are at the intersection of computer systems and computer graphics.

She received her PhD (2020) and MS (2017) from the University of Washington, and her BS (2014) from Columbia University. Previously, she founded Vignette AI, where she worked on perception-aware video compression and storage.

Karen Leung

I am a research scientist working with NVIDIA's Autonomous Vehicle Research Group. My research interests include safe and interaction-aware planning and control for autonomous vehicles, and developing structured and interpretable deep learning models grounded by logic.

Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

Transformers have transformed the field of natural language processing. This performance is largely attributed to the use of stacked self-attention layers, each of which consists of matrix multiplies as well as softmax operations. As a result, unlike other neural networks, the softmax operation accounts for a significant fraction of the total run-time of Transformers. To address this, we propose Softermax, a hardware-friendly softmax design. Softermax consists of base replacement, low-precision softmax computations, and an online normalization calculation.

Simba: scaling deep-learning inference with chiplet-based architecture

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication.

Verifying High-Level Latency-Insensitive Designs with Formal Model Checking

Latency-insensitive design mitigates increasing interconnect delay and enables productive component reuse in complex digital systems. This design style has been adopted in high-level design flows because untimed functional blocks connected through latency-insensitive interfaces provide a natural communication abstraction. However, latency-insensitive design with high-level languages also introduces a unique set of verification challenges that jeopardize functional correctness.