Wednesday, 3 June 2026
8:25 AM – 1:00 PM MDT
Rooms 210 / 212


Keynote Speakers

Alexander Richard
Director, Research Scientist at Meta
Maja Matarić
Maja Matarić
Professor at USC
Principal Scientist at Google DeepMind
Founding Director, Interaction Lab
Agon Serifi
Agon Serifi
Associate Research Scientist at Disney Research Robotics

Introduction

The 1st International Workshop on Interactive Physical AI (IPA 2026) at CVPR 2026 will bring together researchers from computer vision, robotics, and multimodal AI, providing the first comprehensive forum to address the full scope of interactive physical AI systems while building upon prior workshops that have explored subsets of this space. The workshop topics include (but are not limited to):

  • Human-AI interaction in physical environments
  • Embodied conversational AI and multimodal learning
  • Full-duplex multimodal conversational models
  • Social intelligence and communication for robots and avatars
  • Egocentric vision and first-person perception
  • Real-time audio-visual processing for interactive systems
  • Safe and cooperative human-robot interaction
  • Personalization and lifelong learning for physical AI
  • Privacy-aware learning in interactive settings
  • Physically authentic perception and generation for avatars and agents

We will be hosting invited speakers and will also be accepting the submission of full unpublished papers. These papers will be peer-reviewed via a double-blind process, and will be published in the official CVPR 2026 workshop proceedings and be presented at the workshop itself.

What is Interactive Physical AI?

Advances in multimodal learning, embodied intelligence, and conversational AI are transforming how humans interact with intelligent AI systems situated alongside us in our physical world. We define such systems as Interactive Physical AI (IPA). IPA systems simultaneously

  1. Perceive humans and scenes using audio-visual signals
  2. Generate communication signals via verbal and nonverbal behaviors (speech, prosody, backchannels, visual cues such as gaze and gestures)
  3. Act safely and effectively under physical-world constraints in shared spaces

Embodiments of IPA include:

  • Robots (both humanoids and non-humanoids)
  • Physically-grounded and environment-aware avatars (e.g., AR telepresence)
  • On-device audio-visual agents
that interact with humans in the physical world.


Call for Papers

Submission: We invite authors to submit unpublished papers (8-page CVPR format) to our workshop, to be presented at a poster session upon acceptance. All submissions will go through a double-blind review process. All contributions must be submitted (along with supplementary materials, if any) on OpenReview (The link will be provided soon).

Accepted papers will be published in the official CVPR Workshops proceedings and the Computer Vision Foundation (CVF) Open Access archive.

Note: Authors of previously rejected main conference submissions are also welcome to submit their work to our workshop. When doing so, you must submit the previous reviewers' comments (named as previous_reviews.pdf) and a letter of changes (named as letter_of_changes.pdf) as part of your supplementary materials to clearly demonstrate the changes made to address the comments made by previous reviewers.


Important Dates


Paper Submission Deadline March 10, 2026 (23:59 PST)
Notification to Authors March 24, 2026
Camera-Ready Deadline April 10, 2026


Tentative Schedule

  • Wednesday, 3 June 2026 · 8:25 AM – 1:00 PM MDT
  • Rooms 210 / 212

Note: The following schedule is tentative and will likely change before workshop day.

Time in Denver (MDT) Start time in UTC Item
8:25 – 8:30 3 Jun 2026 8:25:00 MDT Opening Remarks
8:30 – 9:10 3 Jun 2026 8:30:00 MDT Oral Session A
9:10 – 9:50 3 Jun 2026 9:10:00 MDT Keynote Talk by Alexander Richard
9:50 – 10:20 3 Jun 2026 9:50:00 MDT Oral Session B
10:20 – 10:30 3 Jun 2026 10:20:00 MDT Coffee Break
10:30 – 11:10 3 Jun 2026 10:30:00 MDT Keynote Talk by Agon Serifi
11:10 – 11:50 3 Jun 2026 11:10:00 MDT Keynote Talk by Maja Matarić
11:50 – 12:00 3 Jun 2026 11:50:00 MDT Closing Remarks
12:00 – 13:00 3 Jun 2026 12:00:00 MDT Poster Presentation

Note: Time offset detected from your browser; may differ from your actual timezone.


Keynote Speakers


Alexander Richard
Director, Research Scientist at Meta
Towards Embodied Social Agents in XR
Bio
Agon Serifi
Agon Serifi
Associate Research Scientist at Disney Research
From Human Motion to Robot Behavior: Building Blocks for Lifelike and Autonomous Robots
Bio
Maja Matarić
Maja Matarić
Professor at University of Southern California
Principal Scientist at Google DeepMind
Founding Director, Robotics and Autonomous Systems Center (RASC)
Founding Director, Interaction Lab
The Challenges of Human-Centered AI and Robotics: What We Want, Need, and are Getting From Human-Machine Interaction
Bio


Accepted Full Papers

Peer-reviewed papers accepted to IPA 2026 through our double-blind review process. Each will be presented as a 10-minute oral talk in Oral Session A, followed by a poster during the poster session.

Learning Physical Principles from Interaction with Test-Time Memory Haoyang Li, Yang You, Hao Su, Leonidas Guibas
arXiv
Gesture2Music: A Low-Latency Real-Time Framework for Continuous Gesture-Driven Music Generation Rathinaraja Jeyaraj, Barathi Subramanian, Kapilya Gangadharan, Anand Paul
arXiv


Invited CVPR Papers

We are honoured to host a selection of CVPR 2026 main-conference papers whose contributions advance interactive physical AI.

Oral Presentations

MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents Ruoxuan Zhang, Qiyun Zheng, Zhiyu Zhou, Ziqi Liao, Siyu Wu, Jian-Yu Jiang-Lin, Bin Wen, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng
arXiv Project
Omni-MMSI: Toward Identity-attributed Social Interaction Understanding Xinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie, Yuyin Zhou, James Matthew Rehg, Yapeng Tian
arXiv Project
FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos Alexandros Delitzas, Chenyangguang Zhang, Alexey Gavryushin, Tommaso Di Mario, Boyang Sun, Rishabh Dabral, Leonidas Guibas, Christian Theobalt, Marc Pollefeys, Francis Engelmann, Daniel Barath
arXiv Project
TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size Stefan Lionar, Gim Hee Lee
arXiv Project Code
MIBURI: Towards Expressive Interactive Gesture Synthesis M. Hamza Mughal, Rishabh Dabral, Vera Demberg, Christian Theobalt
arXiv Project

Poster Presentations

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking Xuangeng Chu, Ruicong Liu, Yifei Huang, Yun Liu, Yichen Peng, Bo Zheng
arXiv Project Code
ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and TEst-time Generative Adaptation Kim Youwang, Lee Hyoseok, Subin Park, Gerard Pons-Moll, Tae-Hyun Oh
arXiv Project Code
PolySLGen: Online Multimodal Speaking–Listening Reaction Generation in Polyadic Interaction Zhi-Yi Lin, Thomas Markhorst, Jouh Yeong Chew, Xucong Zhang
arXiv Code
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi K. Lakshmikanth, Ehsan Adeli
arXiv Project
DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, Kris Kitani
arXiv Project
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee
arXiv Project Code
HandX: Scaling Bimanual Motion and Interaction Generation Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu, Kai Zhou, Bing Zhou, Chuan Guo, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui
arXiv Project
InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions Sirui Xu, Samuel Schulter, Morteza Ziyadi, Xialin He, Xiaohan Fei, Yu-Xiong Wang, Liang-Yan Gui
arXiv Project
UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos Gu Zhang, Qicheng Xu, Haozhe Zhang, Jianhan Ma, Long He, Yiming Bao, Zeyu Ping, Zhecheng Yuan, Chenhao Lu, Chengbo Yuan, Tianhai Liang, Xiaoyu Tian, Maanping Shao, Feihong Zhang, Mingyu Ding, Yang Gao, Hao Zhao, Hang Zhao, Huazhe Xu
arXiv Project
OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery Yiwen Zhao, Ce Zheng, Yufu Wang, Hsueh-Han Daniel Yang, Liting Wen, Laszlo A. Jeni
arXiv Project Code
MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images Chentao Song, He Zhang, Haolei Yuan, Haozhe Lin, Jianhua Tao, Hongwen Zhang, Tao Yu
arXiv Project Code
EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions Taegyoon Yoon, Yegyu Han, Seojin Ji, Jaewoo Park, Sojeong Kim, Taein Kwon, Hyung-Sin Kim
arXiv Project
EgoAVU: Egocentric Audio-Visual Understanding Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai
arXiv Project Code Dataset
Interactive Episodic Memory with User Feedback Nikesh Subedi, Loris Bazzani, Ziad Al-Halah
arXiv Project


Organizers

Leena Mathur
Carnegie Mellon University
Koki Nagano
NVIDIA



Workshop sponsored by: