SLIM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression

Conventional model compression techniques for LLMs address high memory consumption and slow inference challenges but typically require computationally expensive retraining to preserve accuracy. In contrast, one-shot compression methods eliminate retraining costs, but struggle to achieve accuracy comparable to dense models. This paper presents SLIM, a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation into a unified process.

Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

As inference scales to multi-node deployments, disaggregation—splitting inference into distinct phases—offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations.

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive.

Yujun Lin

Yujun Lin is a research scientist at NVIDIA. He finished his PhD at MIT, advised by Prof. Song Han. His research area is efficient deep learning, with a special focus on the co-design of algorithm, system and hardware for foundation models (diffusion models, LLMs, etc). His work has been featured as oral and spotlight presentations at conferences such as ICLR, NeurIPS, Micro, HPCA and MLSys.

Yonggan Fu

Yonggan Fu obtained his PhD from Georgia Institute of Technology in May 2025. Prior to that, he received his Bachelor's degree with a dual major in Applied Physics and Computer Science from the School of The Gifted Young at the University of Science and Technology of China in 2019. He is a recipient of IBM PhD Fellowship and was selected as Machine Learning and Systems Rising Stars 2023.

Yukang Chen

Hello! I am a Research Scientist in NVIDIA Research, working with Prof. Song Han. I got my Ph.D degree in CUHK. My research projects focuses on LongAI, that is "Boost AI's Long ability while staying Efficient". My representative works include VoxelNeXt, LongLoRA, and LongVILA.

Baptiste Nicolet

Baptiste Nicolet is a Senior Research Scientist in the Graphics, Communications, and Machine Learning team at NVIDIA Research. He obtained a Ph.D. in Computer Science from EPFL in 2025, focusing on inverse light transport simulation. In his free time, Baptiste likes to climb, swim, and tinker with 3D printers.

Yuyang Zhao

Dr. Yuyang Zhao is a Research Scientist at NVIDIA Research, working with Prof. Song Han. He obtained his Ph.D. degree from National University of Singapore, advised by A/P Gim Hee Lee.  

His research interests mainly lie in image, video and 3D generation.