IPA 2026: Interactive Physical AI Workshop at CVPR 2026

Introduction

The 1st International Workshop on Interactive Physical AI (IPA 2026) at CVPR 2026 will bring together researchers from computer vision, robotics, and multimodal AI, providing the first comprehensive forum to address the full scope of interactive physical AI systems while building upon prior workshops that have explored subsets of this space. The workshop topics include (but are not limited to):

Human-AI interaction in physical environments
Embodied conversational AI and multimodal learning
Full-duplex multimodal conversational models
Social intelligence and communication for robots and avatars
Egocentric vision and first-person perception
Real-time audio-visual processing for interactive systems
Safe and cooperative human-robot interaction
Personalization and lifelong learning for physical AI
Privacy-aware learning in interactive settings
Physically authentic perception and generation for avatars and agents

We will be hosting invited speakers and will also be accepting the submission of full unpublished papers. These papers will be peer-reviewed via a double-blind process, and will be published in the official CVPR 2026 workshop proceedings and be presented at the workshop itself.

What is Interactive Physical AI?

Advances in multimodal learning, embodied intelligence, and conversational AI are transforming how humans interact with intelligent AI systems situated alongside us in our physical world. We define such systems as Interactive Physical AI (IPA). IPA systems simultaneously

Perceive humans and scenes using audio-visual signals
Generate communication signals via verbal and nonverbal behaviors (speech, prosody, backchannels, visual cues such as gaze and gestures)
Act safely and effectively under physical-world constraints in shared spaces

Embodiments of IPA include:

Robots (both humanoids and non-humanoids)
Physically-grounded and environment-aware avatars (e.g., AR telepresence)
On-device audio-visual agents

that interact with humans in the physical world.

Call for Papers

Submission: We invite authors to submit unpublished papers (8-page CVPR format) to our workshop, to be presented at a poster session upon acceptance. All submissions will go through a double-blind review process. All contributions must be submitted (along with supplementary materials, if any) on OpenReview (The link will be provided soon).

Accepted papers will be published in the official CVPR Workshops proceedings and the Computer Vision Foundation (CVF) Open Access archive.

Note: Authors of previously rejected main conference submissions are also welcome to submit their work to our workshop. When doing so, you must submit the previous reviewers' comments (named as previous_reviews.pdf) and a letter of changes (named as letter_of_changes.pdf) as part of your supplementary materials to clearly demonstrate the changes made to address the comments made by previous reviewers.

Important Dates

Paper Submission Deadline	March 10, 2026 (23:59 PST)
Notification to Authors	March 24, 2026
Camera-Ready Deadline	April 10, 2026

Schedule

Wednesday, 3 June 2026 · 8:20 AM – 1:00 PM MDT
Rooms 203

Time in Denver (MDT)	Start time in UTC	Item
8:20 – 8:30	3 Jun 2026 8:20:00 MDT	Opening Remarks
8:30 – 9:10	3 Jun 2026 8:30:00 MDT	Oral Session A Learning Physical Principles from Interaction with Test-Time Memory Gesture2Music: A Low-Latency Real-Time Framework for Continuous Gesture-Driven Music Generation MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
9:10 – 9:50	3 Jun 2026 9:10:00 MDT	Keynote Talk by Alexander Richard
9:50 – 10:20	3 Jun 2026 9:50:00 MDT	Oral Session B FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size MIBURI: Towards Expressive Interactive Gesture Synthesis
10:20 – 10:30	3 Jun 2026 10:20:00 MDT	Coffee Break
10:30 – 11:10	3 Jun 2026 10:30:00 MDT	Keynote Talk by Agon Serifi
11:10 – 11:50	3 Jun 2026 11:10:00 MDT	Keynote Talk by Maja Matarić
11:50 – 12:00	3 Jun 2026 11:50:00 MDT	Closing Remarks
12:00 – 13:00	3 Jun 2026 12:00:00 MDT	Poster Presentation

Note: Time offset detected from your browser; may differ from your actual timezone.

Keynote Speakers

Alexander Richard

Director, Research Scientist at Meta

Towards Embodied Social Agents in XR

Humans are extraordinarily skilled at reading not just acoustic cues but also facial expressions, body language, and proxemics — abilities honed over a lifetime of social interaction. The most immersive interface with a machine, therefore, is one that is indistinguishable from interacting with another person. We are building toward this vision with immersive social agents, embodied as 3D Codec Avatars, that consume multimodal user input — text, speech, body motion, facial expressions, and gaze — and engage in full-duplex dyadic conversation with spatial understanding of 3D motion and gesture. Our north star is the embodied Turing test: agents that move, speak, and behave so naturally that human-machine interaction becomes indistinguishable from human-human interaction. In this talk, I will present joint generative modeling approaches for producing all relevant output modalities in concert, as well as data strategies for scaling effectively beyond dedicated 3D capture sessions.

Bio

Alexander Richard is a Director and Principal AI Research Scientist at Meta, where he leads research on multimodal generative models for socially intelligent avatars in the Codec Avatars Lab in Pittsburgh. His work focuses on building immersive social agents that can see, hear, speak, and move — combining audio-visual speech synthesis, full-body motion generation, and neural audio codecs to enable natural, full-duplex conversations in virtual reality. Alexander received his PhD from the University of Bonn for his work on temporal action segmentation in video, and his Master's and Bachelor's degrees from RWTH Aachen University with a focus on speech recognition.

Agon Serifi

Associate Research Scientist at Disney Research

From Human Motion to Robot Behavior: Building Blocks for Lifelike and Autonomous Robots

Recent advances in simulation, reinforcement learning, and generative modeling have made it possible to bring expressive, dynamic motions to physical robots. However, as systems grow in complexity, monolithic architectures often struggle with sample inefficiency, lack of interpretability, and the immense difficulty of generalizing across diverse tasks. In this talk, I present modular building blocks that leverage human motion data to make robots move in lifelike ways.

This approach centers on three fundamental blocks: Understanding (perceiving intent and environment), Generation (synthesizing feasible motion plans), and Control (grounding motion in real-world physics). I will detail our latest research in motion tracking and generation, and how tailored learning components scale independently while remaining contextually aware of the requirements and constraints of adjacent modules. Finally, I outline how the interfaces between these blocks provide a robust pathway for world-understanding models to translate high-level intent into grounded action, highlighting why the interface between these components is critical for future robotic intelligence.

Bio

Agon Serifi is an Associate Research Scientist with the Disney Research Robotics team. His work focuses on deep reinforcement learning and human motion to develop control methods that enable robots to generalize and exhibit rich, diverse behaviors. This includes motion imitation, generation, and the reproduction of lifelike movement. Agon received his PhD in Computer Science (2025) from the Computer Graphics Laboratory (CGL) at ETH Zurich under Prof. Markus Gross and Dr. Moritz Bächer, in collaboration with Disney Research. He also holds a Bachelor’s (2019) and a Master’s (2021) degree in Computer Science from ETH Zurich.

Maja Matarić

Professor at University of Southern California

Principal Scientist at Google DeepMind

Founding Director, Robotics and Autonomous Systems Center (RASC)

Founding Director, Interaction Lab

The Challenges of Human-Centered AI and Robotics: What We Want, Need, and are Getting From Human-Machine Interaction

Language-based AI is now ubiquitous, and user expectations for intelligent machines are scaling along with it: we expect machines to understand us, predict our needs and wants, do what we enjoy and prefer, and adapt as we change our moods and minds, learn, grow, and age. Physical AI, in the form of robotics, is the next major AI challenge, and it is not ready to leap into our daily lives yet. While massive investment is focused on functional behavior of humanoid robots (perceiving the world, moving around, and manipulating objects), human-robot interaction (HRI) is relegated to an afterthought. It is assumed that once a robot can move around and do things, it will be useful and wanted, yet over 25 years of research in HRI tells us otherwise. While the needs for human-centered services continue to grow, research and development is minimal. This talk will discuss how bringing together robotics, AI, and machine learning for long-term user modeling, real-time multimodal behavioral signal processing, and affective computing is enabling machines to understand, interact, and adapt to users' specific and ever-changing needs. We will overview methods and challenges of sparse and noisy heterogeneous, multi-modal, personal interaction data and of creating expressive agent and robot behavior toward understanding, coaching, motivating, and supporting a wide variety of user populations across the age span (infants, children, adults, elderly), ability span (typically developing, autism, anxiety, stroke, dementia), contexts (schools, therapy centers, homes), and deployment durations (from weeks to 6 months) through socially assistive robotics. We will discuss the challenges of understanding what we humans want from interactions with machines vs. what we need vs. what we are getting, and how those distinctions are shaping the future of not just AI and ML but society at large.

Bio

Maja Matarić is the Chan Soon-Shiong Chair and Distinguished Professor of Computer Science, with appointments in Neuroscience, and Pediatrics at the University of Southern California (USC), founding director of the USC Robotics and Autonomous Systems Center, and a Principal Scientist at Google DeepMind. She is a member of the NAE and the AMACAD, fellow of AAAS, IEEE, AAAI, and ACM, and recipient of the US Presidential Award for Excellence in Science, Mathematics & Engineering Mentoring, Anita Borg Institute Women of Vision, ACM Athena Lecture, ACM Eugene Lawler, Mass Robotics Medal, NSF Career, MIT TR35 Innovation, and IEEE RAS Early Career Awards.

A pioneer of the field of socially assistive robotics, her USC Interaction Lab's research is aimed at endowing machines with the ability to provide users with personalized motivation and support to empower them to reach their potential. Her lab's research supports users with differences, including children on the autism spectrum, stroke patients, dementia patients, and students and adults with anxiety or depression, among others.

Accepted Full Papers

Peer-reviewed papers accepted to IPA 2026 through our double-blind review process. Each will be presented as a 10-minute oral talk in Oral Session A, followed by a poster during the poster session.

Learning Physical Principles from Interaction with Test-Time Memory Haoyang Li, Yang You, Hao Su, Leonidas Guibas

CVF arXiv

Gesture2Music: A Low-Latency Real-Time Framework for Continuous Gesture-Driven Music Generation Rathinaraja Jeyaraj, Barathi Subramanian, Kapilya Gangadharan, Anand Paul

CVF arXiv

Invited CVPR Papers

We are honoured to host a selection of CVPR 2026 main-conference papers whose contributions advance interactive physical AI.

Oral Presentations

MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents Ruoxuan Zhang, Qiyun Zheng, Zhiyu Zhou, Ziqi Liao, Siyu Wu, Jian-Yu Jiang-Lin, Bin Wen, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng

CVF arXiv Project

Omni-MMSI: Toward Identity-attributed Social Interaction Understanding Xinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie, Yuyin Zhou, James Matthew Rehg, Yapeng Tian

CVF arXiv Project

FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos Alexandros Delitzas, Chenyangguang Zhang, Alexey Gavryushin, Tommaso Di Mario, Boyang Sun, Rishabh Dabral, Leonidas Guibas, Christian Theobalt, Marc Pollefeys, Francis Engelmann, Daniel Barath

CVF arXiv Project

TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size Stefan Lionar, Gim Hee Lee

CVF arXiv Project Code

MIBURI: Towards Expressive Interactive Gesture Synthesis M. Hamza Mughal, Rishabh Dabral, Vera Demberg, Christian Theobalt

CVF arXiv Project

Poster Presentations

UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos Gu Zhang, Qicheng Xu, Haozhe Zhang, Jianhan Ma, Long He, Yiming Bao, Zeyu Ping, Zhecheng Yuan, Chenhao Lu, Chengbo Yuan, Tianhai Liang, Xiaoyu Tian, Maanping Shao, Feihong Zhang, Mingyu Ding, Yang Gao, Hao Zhao, Hang Zhao, Huazhe Xu

CVF arXiv Project

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions Sirui Xu, Samuel Schulter, Morteza Ziyadi, Xialin He, Xiaohan Fei, Yu-Xiong Wang, Liang-Yan Gui

CVF arXiv Project

OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery Yiwen Zhao, Ce Zheng, Yufu Wang, Hsueh-Han Daniel Yang, Liting Wen, Laszlo A. Jeni

CVF arXiv Project Code

MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images Chentao Song, He Zhang, Haolei Yuan, Haozhe Lin, Jianhua Tao, Hongwen Zhang, Tao Yu

CVF arXiv Project Code

EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions Taegyoon Yoon, Yegyu Han, Seojin Ji, Jaewoo Park, Sojeong Kim, Taein Kwon, Hyung-Sin Kim

CVF arXiv Project

Interactive Episodic Memory with User Feedback Nikesh Subedi, Loris Bazzani, Ziad Al-Halah

CVF arXiv Project

EgoAVU: Egocentric Audio-Visual Understanding Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai

CVF arXiv Project Code Dataset

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee

CVF arXiv Project Code

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking Xuangeng Chu, Ruicong Liu, Yifei Huang, Yun Liu, Yichen Peng, Bo Zheng

CVF arXiv Project Code

ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and TEst-time Generative Adaptation Kim Youwang, Lee Hyoseok, Subin Park, Gerard Pons-Moll, Tae-Hyun Oh

CVF arXiv Project Code

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi K. Lakshmikanth, Ehsan Adeli

CVF arXiv Project

PolySLGen: Online Multimodal Speaking–Listening Reaction Generation in Polyadic Interaction Zhi-Yi Lin, Thomas Markhorst, Jouh Yeong Chew, Xucong Zhang

CVF arXiv Code

DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, Kris Kitani

CVF arXiv Project

HandX: Scaling Bimanual Motion and Interaction Generation Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu, Kai Zhou, Bing Zhou, Chuan Guo, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui

CVF arXiv Project

Organizers

Seonwook Park

NVIDIA

Amrita Mazumdar

NVIDIA

Shengze Wang

NVIDIA

Leena Mathur

Carnegie Mellon University

Koki Nagano

NVIDIA

Shalini De Mello

Keynote Speakers

Introduction

What is Interactive Physical AI?

Call for Papers

Full Workshop Papers (submissions closed)

Important Dates

Schedule

Keynote Speakers

Director, Research Scientist at Meta

Towards Embodied Social Agents in XR

Bio

Associate Research Scientist at Disney Research

From Human Motion to Robot Behavior: Building Blocks for Lifelike and Autonomous Robots

Bio

Professor at University of Southern California

Principal Scientist at Google DeepMind

Founding Director, Robotics and Autonomous Systems Center (RASC)

Founding Director, Interaction Lab

The Challenges of Human-Centered AI and Robotics: What We Want, Need, and are Getting From Human-Machine Interaction

Bio

Accepted Full Papers

Invited CVPR Papers

Oral Presentations

Poster Presentations

Organizers

NVIDIA

NVIDIA

NVIDIA

Carnegie Mellon University

NVIDIA

NVIDIA

Workshop sponsored by: