FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model

In this study, we aim to explore Multitask Speech Language Model (SpeechLM) efficient inference via token reduction. Unlike other modalities such as vision or text, speech has unique temporal dependencies, making previous efficient inference works on other modalities not directly applicable. Furthermore, methods for efficient SpeechLM inference on long sequence and sparse signals remain largely unexplored. Then we propose FastAdaSP, a weighted token merging framework specifically designed for various speech-related tasks to improve the trade-off between efficiency and performance.

Large Étendue 3D Holographic Display with Content-adpative Dynamic Fourier Modulation

Abstract

Emerging holographic display technology offers unique capabilities for next-generation virtual reality systems. Current holographic near-eye displays, however, only support a small étendue, which results in a direct tradeoff between achievable field of view and eyebox size. Étendue expansion has recently been explored, but existing approaches are either fundamentally limited in the image quality that can be achieved or they require extremely high-speed spatial light modulators.

Sanjay Kariyappa

Sanjay Kariyappa is a Sr. Research Scientist at NVIDIA's security and privacy research team. His research is focused on enabling secure and private agentic/compound AI systems.

Sanjay received his PhD from Georgia Tech in 2022, where he worked on developing attacks/defenses for model stealing and data privacy. In addition to his work on trustworthy AI, he has published in the areas of computer architecture, hardware security and AI accelerators.

For more details about my publications and work experience, please visit my personal website:

Deep Learning Approaches to Grasp Synthesis: A Review

Grasping is the process of picking up an object by applying forces and torques at a set of contacts. Recent advances in deep learning methods have allowed rapid progress in robotic object grasping. In this systematic review, we surveyed the publications over the last decade, with a particular interest in grasping an object using all six degrees of freedom of the end-effector pose.

Fugatto 1 - Foundational Generative Audio Transformer Opus 1

Fugatto is a versatile audio synthesis and transformation model capable of following
free-form text instructions with optional audio inputs. While large language
models (LLMs) trained with text on a simple next-token prediction objective can
learn to infer instructions directly from the data, models trained solely on audio
data lack this capacity. This is because audio data does not inherently contain the
instructions that were used to generate it. To overcome this challenge, we introduce

Conformer without Convolutions

We analyze the weights of a trained speech-to-text neural network and discover a surprising amount of structure in the temporal convolutions. Based on our observations we propose to completely remove learnable temporal convolutions, and replace them with fixed averaging and shift operations which have no learnable parameters and open the way for significantly faster implementations. In the state-of-the-art models Conformer, Squeezeformer and FastConformer, this improves WER by 0.12%, 0.62%, and 0.20% respectively, while reducing the computational cost.

One-Shot Transfer of Long-Horizon Extrinsic Manipulation Through Contact Retargeting

Extrinsic manipulation, the use of environment contacts to achieve manipulation objectives, enables strategies that are otherwise impossible with a parallel jaw gripper. However, orchestrating a long-horizon sequence of contact interactions between the robot, object, and environment is notoriously challenging due to the scene diversity, large action space, and difficult contact dynamics. We observe that most extrinsic manipulation are combinations of short-horizon primitives, each of which depend strongly on initializing from a desirable contact configuration to succeed.

Lars Johannsmeier

I am a research scientist at the Seattle Robotics Lab. I obtained my PhD from the Technical University Munich under the supervision of Prof. Sami Haddadin. Before joining NVIDIA, I was the head of the AI department at Franka Robotics GmbH, the creator of the most popular research robot worldwide. My research at NVIDIA focuses on two main aspects. First, how to design intelligent robotic systems such that they are deployable in the real world. Second, how to model manipulation such that robots can solve complex tasks with similar performance and robustness as humans.

TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models

Diffusion models have opened the path to a wide range of text-based image editing frameworks. However, these typically build on the multi-step nature of the diffusion backwards process, and adapting them to distilled, fast-sampling methods has proven surprisingly challenging. Here, we focus on a popular line of text-based editing frameworks - the ``edit-friendly'' DDPM-noise inversion approach. We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength.