SPARR - NVIDIA Research, Seattle Robotics Labs

ICRA 2026

SPARR: Simulation-based Policies with Asymmetric Real-world Residuals for Assembly

Yijie Guo, Iretiayo Akinola, Lars Johannsmeier, Hugo Hadfield, Abhishek Gupta, and Yashraj Narang

Project Overview

The following video explains the pipeline of our real-robot reinforcement learning (RL) method.

SPARR is a hybrid framework to pre-train a base policy with low-dimensional state observations in simulation and then learn a residual policy with visual observations in the real world. The base policy provides successful demonstrations, a structured prior and safe early exploration, while the residual policy corrects for discrepancies in physical properties, state estimation errors, and visual or environmental differences. This asymmetric design enables efficient adaptation to real-world environments without reliance on human supervision.

Abstract

Robotic assembly presents a long-standing challenge due to its requirement for precise, contact-rich manipulation. While simulation-based learning has enabled the development of robust assembly policies, their performance often degrades when deployed in real-world settings due to the sim-to-real gap. Conversely, real-world reinforcement learning (RL) methods avoid the sim-to-real gap, but rely heavily on human supervision and lack generalization ability to environmental changes. In this work, we propose a hybrid approach that combines a simulation-trained base policy with a real-world residual policy to efficiently adapt to real-world variations. The base policy, trained in simulation using low-level state observations and dense rewards, provides strong priors for initial behavior. The residual policy, learned in the real world using visual observations and sparse rewards, compensates for discrepancies in dynamics and sensor noise. Extensive real-world experiments demonstrate that our method, SPARR, achieves near-perfect success rates across diverse two-part assembly tasks. Compared to the state-of-the-art zero-shot sim-to-real methods, SPARR improves success rates by 38.4% while reducing cycle time by 29.7%. Moreover, SPARR requires no human expertise, in contrast to the state-of-the-art real-world RL approaches that depend heavily on human supervision.

Illustration of our approach, SPARR. (a) A specialist policy is pre-trained in simulation. (b) The simulation policy is deployed zero-shot in the real world, achieving a moderate success rate (e.g., up to 80%). Successful trajectories are collected as demonstrations. (c) A residual policy is trained in the real world on top of the simulation policy, leveraging both the demonstration buffer and the online RL buffer. During training, high-quality trajectories that achieve success quickly are added in demonstrations for further exploitation.

Experiments

We investigate 10 real-world robotic assembly tasks from the AutoMate dataset. We pre-train base policies in Isaac Lab. We select 10 out of 100 tasks that achieve over 99% success in simulation. We choose them for their diverse geometries and strong simulation performance, which are expected to perform well in the real world. We compare approaches of SERL, AutoMate and SPARR that deploy simulation policies in the real world without human demonstrations.

SERL exhibits poor performance due to hard exploration in the sparse-reward setting. With only 20 demonstrations and 0.5 hours of real-world training, it struggles to discover reasonable behaviors and collect positive rewards, as no human intervention is provided during online training. While a few successful trials are observed during real-world learning, SERL cannot efficiently learn to reproduce these successes. AutoMate shows a moderate success rate despite strong simulation performance, indicating the sim-to-real gap. In comparison to AutoMate, SPARR achieves a relative improvement of 38.4% in success rate and 29.7% in cycle time, highlighting the effectiveness of the residual policy in correcting actions of the simulation policy. Notably, SPARR attains a 95–100% success rate, without any human supervision or interventions.

Performance on 10 AutoMate tasks. We evaluate the success rate (higher is better) and cycle time (lower is better) averaged over 20 episodes. SERL, AutoMate, and SPARR (Ours) transfer simulation-trained policies to the real world without human effort, where SPARR achieves substantially higher success rates and shorter cycle times. HIL-SERL (Oracle) serves as an upper bound, assuming access to near-optimal human demonstrations and continuous human supervision.

We evaluate the generalizability of SPARR on NIST assemblies that were not seen during pre-training on AutoMate tasks. We aim to achieve strong real-world performance by adapting the base policy from relevant prior tasks. When deploying the base policy, differences in dynamics between simulation and reality degrade the performance. However, the policy is largely unaffected by differences in visual observations because it only conditions on low-dimensional state information. As a result, the base policy still generates some successful demonstrations and serves as a functional prior for real-world training. We then train a residual policy using SPARR to improve real-world success. Overall, SPARR achieves a relative improvement of 74.5% in success rate and 36.5% in cycle time across these NIST tasks.

Adaptation of simulation policies from AutoMate tasks to NIST tasks. On NIST assembly tasks, SPARR outperforms the baseline in success rate (higher is better) and cycle time (lower is better).