DuoGen teaser examples

Overview

We present DuoGen, a multimodal model designed to automatically switch between modalities to generate coherent, interleaved image-text sequences. DuoGen enables complex tasks that require synchronized text and visuals. These applications range from creating illustrated tutorials and narratives to robotic manipulation and autonomous navigation.

Key Features

Model Capabilities

Model Capabilities

Citation

@inproceedings{duogen2026duogen,
  title={DuoGen: Towards General Purpose Interleaved Multimodal Generation},
  author={Shi, Min and Zeng, Xiaohui and Huang, Jiannan and Cui, Yin and Ferroni, Francesco and Li, Jialuo and Pachori, Shubham and Li, Zhaoshuo and Balaji, Yogesh and Wang, Haoxiang and Lin, Tsung-Yi and Fu, Xiao and Zhao, Yue and Chen, Chieh-Yun and Liu, Ming-Yu and Shi, Humphrey},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}