Learning to Compare Hardware Designs for High-Level Synthesis

High-level synthesis (HLS) is an automated design process that transforms high-level code into optimized hardware designs, enabling the rapid development of efficient hardware accelerators for various applications such as image processing, machine learning, and signal processing. To achieve optimal performance, HLS tools rely on pragmas, which are directives inserted into the source code to guide the synthesis process, and these pragmas can have various settings and values that significantly impact the resulting hardware design.

AssertionForge: Enhancing Formal Verification Assertion Generation with Structured Representation of Specifications and RTL

Generating SystemVerilog Assertions (SVAs) from natural language specifications remains a major challenge in formal verification (FV) due to the inherent ambiguity and incompleteness of specifications. Existing LLM-based approaches, such as ASSERTLLM, focus on extracting information solely from specification documents, often failing to capture essential internal signal interactions and design details present in the RTL code, leading to incomplete or incorrect assertions.

Sanja Fidler

Sanja Fidler is vice president of AI research at NVIDIA, leading the company’s Spatial Intelligence Lab research lab in Toronto. She is also an associate professor at the University of Toronto, and an affiliate faculty member at the Vector Institute, which she co-founded. Previously, she was a research assistant professor at Toyota Technological Institute at Chicago, a philanthropically endowed academic institute located in the University of Chicago campus. 

GRS: Generating robotic simulation tasks from real-world images

Game design hinges on understanding how static rules and content translate into dynamic player behavior---something modern generative systems that inspect only a game's code or assets struggle to capture. We present an automated design iteration framework that closes this gap by pairing a reinforcement learning (RL) agent, which playtests the game, with a large multimodal model (LMM), which revises the game based on what the agent does.

Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

As LLMs scale to multi-million-token KV histories, real-time  autoregressive decoding under tight Token-to-Token Latency (TTL) constraints faces growing pressure. Two core bottlenecks dominate: accessing Feed-Forward Network (FFN) weights and reading long KV caches. While Tensor Parallelism (TP) helps mitigate the cost of FFN weight reads, it does not scale well for attention. When TP width exceeds the number of KV heads, it leads to inefficient KV duplication, limits parallelism, and constrains batch size.

Bertrand Douillard

Bertrand has focused on AI for robotics since his Ph.D. in the field. He brings experience from engineering and research roles at JPL, Zoox, Toyota Research Institute, and Waymo. His hands-on work has spanned the full range of robotics systems, from classical to end-to-end learned stacks, including perception, planning, controls, and the offline ML pipelines that support them. As part of his transition to NVIDIA Research, his current focus is on end-to-end autonomous vehicle models built on Recurrent State Space Models (RSSMs) and refined with Reinforcement Fine Tuning.

GauRast: Enhancing GPU Triangle Rasterizers to Accelerate 3D Gaussian Splatting

3D intelligence leverages rich 3D features and stands as a promising frontier in AI, with 3D rendering fundamental to many downstream applications. 3D Gaussian Splatting (3DGS), an emerging high-quality 3D rendering method, requires significant computation, making real-time execution on existing GPU-equipped edge devices infeasible. Previous efforts to accelerate 3DGS rely on dedicated accelerators that require substantial integration overhead and hardware costs.

GEM: GPU-Accelerated Emulator-Inspired RTL Simulation

We present a GPU-accelerated RTL simulator addressing critical challenges in high-speed circuit verification.Traditional CPU-based RTL simulators struggle with scalability and performance, and while FPGA-based emulators offer acceleration, they are costly and less accessible. Previous GPU-based attempts have failed to speed up RTL simulation due to the heterogeneous nature of circuit partitions, which conflicts with the SIMT (Single Instruction, Multiple Thread) paradigm of GPUs.