| Research

Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

As LLMs scale to multi-million-token KV histories, real-time autoregressive decoding under tight Token-to-Token Latency (TTL) constraints faces growing pressure. Two core bottlenecks dominate: accessing Feed-Forward Network (FFN) weights and reading long KV caches. While Tensor Parallelism (TP) helps mitigate the cost of FFN weight reads, it does not scale well for attention. When TP width exceeds the number of KV heads, it leads to inefficient KV duplication, limits parallelism, and constrains batch size.

Read more about Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

Bertrand Douillard

Bertrand has focused on AI for robotics since his Ph.D. in the field. He brings experience from engineering and research roles at JPL, Zoox, Toyota Research Institute, and Waymo. His hands-on work has spanned the full range of robotics systems, from classical to end-to-end learned stacks, including perception, planning, controls, and the offline ML pipelines that support them. As part of his transition to NVIDIA Research, his current focus is on end-to-end autonomous vehicle models built on Recurrent State Space Models (RSSMs) and refined with Reinforcement Fine Tuning.

Read more about Bertrand Douillard

Thomas Tian

Read more about Thomas Tian

GauRast: Enhancing GPU Triangle Rasterizers to Accelerate 3D Gaussian Splatting

3D intelligence leverages rich 3D features and stands as a promising frontier in AI, with 3D rendering fundamental to many downstream applications. 3D Gaussian Splatting (3DGS), an emerging high-quality 3D rendering method, requires significant computation, making real-time execution on existing GPU-equipped edge devices infeasible. Previous efforts to accelerate 3DGS rely on dedicated accelerators that require substantial integration overhead and hardware costs.

Read more about GauRast: Enhancing GPU Triangle Rasterizers to Accelerate 3D Gaussian Splatting

GEM: GPU-Accelerated Emulator-Inspired RTL Simulation

We present a GPU-accelerated RTL simulator addressing critical challenges in high-speed circuit verification.Traditional CPU-based RTL simulators struggle with scalability and performance, and while FPGA-based emulators offer acceleration, they are costly and less accessible. Previous GPU-based attempts have failed to speed up RTL simulation due to the heterogeneous nature of circuit partitions, which conflicts with the SIMT (Single Instruction, Multiple Thread) paradigm of GPUs.

Read more about GEM: GPU-Accelerated Emulator-Inspired RTL Simulation

Zeyuan Hu

Read more about Zeyuan Hu

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Read more about VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Adaptive Algebraic Reuse of Reordering in Cholesky Factorizations with Dynamic Sparsity Patterns

We introduce Parth, a fill-reducing ordering method for sparse Cholesky solvers with dynamic sparsity patterns (e.g., in physics simulations with contact or geometry processing with local remeshing). Parth facilitates the selective reuse of fill-reducing orderings when sparsity patterns exhibit temporal coherence, avoiding full symbolic analysis by localizing the effect of dynamic sparsity changes on the ordering vector.

Read more about Adaptive Algebraic Reuse of Reordering in Cholesky Factorizations with Dynamic Sparsity Patterns

SLIM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression

Conventional model compression techniques for LLMs address high memory consumption and slow inference challenges but typically require computationally expensive retraining to preserve accuracy. In contrast, one-shot compression methods eliminate retraining costs, but struggle to achieve accuracy comparable to dense models. This paper presents SLIM, a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation into a unified process.

Read more about SLIM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression

Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

As inference scales to multi-node deployments, disaggregation—splitting inference into distinct phases—offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations.

Read more about Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

Subscribe to