| Research

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. Our method leverages global context self-attention modules, joint with standard local self-attention, to effectively and efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.

Read more about Global Context Vision Transformers

FasterViT: Fast Vision Transformers with Hierarchical Attention

We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention.

Read more about FasterViT: Fast Vision Transformers with Hierarchical Attention

CircuitOps: An ML Infrastructure Enabling Generative AI for VLSI Circuit Optimization

An innovative ML infrastructure named CircuitOps is developed to streamline dataset generation and model inference for various generative AI (GAI)-based circuit optimization tasks. Addressing the challenges of the absence of a shared Intermediate Representation (IR), steep EDA learning curves, and AI-unfriendly data structures, we propose solutions that empower efficient data handling.

Read more about CircuitOps: An ML Infrastructure Enabling Generative AI for VLSI Circuit Optimization

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR.

Read more about Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

A 0.190-pJ/bit 25.2-Gb/s/wire Inverter-Based AC-Coupled Transceiver for Short-Reach Die-to-Die Interfaces in 5-nm CMOS

This paper presents an Inverter-based AC-coupled Toggle (ISR-ACT) transceiver targeted for short-reach die-to-die communication over silicon interposer or similar high-density interconnect. The ISR-ACT’s transmitter sends NRZ data through a small on-chip capacitor into the line. The receiver amplifies the low-swing pulses using a 1st-stage TIA to fully toggle the 2nd-stage output, where positive feedback to the input pad maintains the DC level on the line.

Read more about A 0.190-pJ/bit 25.2-Gb/s/wire Inverter-Based AC-Coupled Transceiver for Short-Reach Die-to-Die Interfaces in 5-nm CMOS

Omer Shapira

Omer Shapira joined NVIDIA Research in 2023. His research interests are in Computer Graphics, Virtual and Extended Reality (XR), Human-Computer Interaction and Haptics - with current focus on AI-Mediated Human Interaction.
Since 2016, he worked at NVIDIA's Simulation Technology group, leading the Omniverse XR group and contributing to Isaac, Drivesim, and NVIDIA's core XR technology and research.

Read more about Omer Shapira

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

Read more about Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

Read more about HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization

Training deep networks with limited labeled data while achieving a strong generalization ability is key in the quest to reduce human annotation efforts. This is the goal of semi-supervised learning, which exploits more widely available unlabeled data to complement small labeled data sets. In this paper, we propose a novel framework for discriminative pixel-level tasks using a generative model of both images and labels.

Read more about Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization

Subscribe to