Jindong Jiang

Jindong is a research scientist in the Learning and Perception Research (LPR) team of NVIDIA Research. Prior to joining NVIDIA, Jindong was a PhD student at Rutgers University under the supervision of Prof. Sungjin Ahn. His research interests lie at the intersection of representation learning and visual reasoning, with a strong interests in developing novel architectures that can improve agent's visual reasoning capabilities.

Zhongzhi Yu

Zhongzhi Yu received his Ph.D. in Computer Science from Georgia Tech in 2025, advised by Dr. Yingyan (Celine) Lin. He holds an M.S. from Columbia University and a B.Eng. from Zhejiang University. His research focuses on two primary areas: (1) developing adaptation techniques to enable hardware-aware AI design and deployment, with recent work in RTL coding and broader interests in hardware design automation; and (2) creating efficient methods to bring advanced AI capabilities to everyday devices, with experience in large language models, vision-language models, and vision transformers.

Factory: Fast Contact for Robotic Assembly

Robotic assembly is one of the oldest and most challenging applications of robotics. In other areas of robotics, such as perception and grasping, simulation has rapidly accelerated research progress, particularly when combined with modern deep learning. However, accurately, efficiently, and robustly simulating the range of contact-rich interactions in assembly remains a longstanding challenge. In this work, we present Factory, a set of physics simulation methods and robot learning tools for such applications.

Multi-student Diffusion Distillation for Better One-step Generators

Diffusion models achieve high-quality sample generation at the cost of a lengthy multistep inference procedure. To overcome this, diffusion distillation techniques produce student generators capable of matching or surpassing the teacher in a single step. However, the student model’s inference speed is limited by the size of the teacher architecture, preventing real-time generation for computationally heavy applications. In this work, we introduce Multi-Student Distillation (MSD), a framework to distill a conditional teacher diffusion model into multiple single-step generators.

Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond

We introduce Audio-SDS, a generalization of Score Distillation Sampling (SDS) to text-conditioned audio diffusion models. While SDS was initially designed for text-to-3D generation using image diffusion, its core idea of distilling a powerful generative prior into a separate parametric representation extends to the audio domain. Leveraging a single pretrained model, Audio-SDS enables a broad range of tasks without requiring specialized datasets.

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) enabling conversational 3D generation and mesh understanding. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly.