Deep Learning Approaches to Grasp Synthesis: A Review

Grasping is the process of picking up an object by applying forces and torques at a set of contacts. Recent advances in deep learning methods have allowed rapid progress in robotic object grasping. In this systematic review, we surveyed the publications over the last decade, with a particular interest in grasping an object using all six degrees of freedom of the end-effector pose.

Fugatto 1 - Foundational Generative Audio Transformer Opus 1

Fugatto is a versatile audio synthesis and transformation model capable of following
free-form text instructions with optional audio inputs. While large language
models (LLMs) trained with text on a simple next-token prediction objective can
learn to infer instructions directly from the data, models trained solely on audio
data lack this capacity. This is because audio data does not inherently contain the
instructions that were used to generate it. To overcome this challenge, we introduce

Conformer without Convolutions

We analyze the weights of a trained speech-to-text neural network and discover a surprising amount of structure in the temporal convolutions. Based on our observations we propose to completely remove learnable temporal convolutions, and replace them with fixed averaging and shift operations which have no learnable parameters and open the way for significantly faster implementations. In the state-of-the-art models Conformer, Squeezeformer and FastConformer, this improves WER by 0.12%, 0.62%, and 0.20% respectively, while reducing the computational cost.

One-Shot Transfer of Long-Horizon Extrinsic Manipulation Through Contact Retargeting

Extrinsic manipulation, the use of environment contacts to achieve manipulation objectives, enables strategies that are otherwise impossible with a parallel jaw gripper. However, orchestrating a long-horizon sequence of contact interactions between the robot, object, and environment is notoriously challenging due to the scene diversity, large action space, and difficult contact dynamics. We observe that most extrinsic manipulation are combinations of short-horizon primitives, each of which depend strongly on initializing from a desirable contact configuration to succeed.

Lars Johannsmeier

I am a research scientist at the Seattle Robotics Lab. I obtained my PhD from the Technical University Munich under the supervision of Prof. Sami Haddadin. Before joining NVIDIA, I was the head of the AI department at Franka Robotics GmbH, the creator of the most popular research robot worldwide. My research at NVIDIA focuses on two main aspects. First, how to design intelligent robotic systems such that they are deployable in the real world. Second, how to model manipulation such that robots can solve complex tasks with similar performance and robustness as humans.

TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models

Diffusion models have opened the path to a wide range of text-based image editing frameworks. However, these typically build on the multi-step nature of the diffusion backwards process, and adapting them to distilled, fast-sampling methods has proven surprisingly challenging. Here, we focus on a popular line of text-based editing frameworks - the ``edit-friendly'' DDPM-noise inversion approach. We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength.

LCM-Lookahead for Encoder-based Text-to-Image Personalization

Recent advancements in diffusion models have introduced fast sampling methods that can effectively produce high-quality images in just one or a few denoising steps. Interestingly, when these are distilled from existing diffusion models, they often maintain alignment with the original model, retaining similar outputs for similar prompts and seeds. These properties present opportunities to leverage fast sampling methods as a shortcut-mechanism, using them to create a preview of denoised outputs through which we can backpropagate image-space losses.

Consolidating Attention Features for Multi-view Image Editing

Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views.

Enze Xie

Xie Enze is a Senior Research Scientist at NVIDIA Research. Previously, he was a Principal Researcher and Research Lead at Huawei Noah's Ark Lab (Hong Kong). He obtained his PhD from HKU MMLab in 2022. His current research focuses mainly on multimodal generation, understanding, and acceleration.