1. [Publications](/index.php/publications)
2. RegionGPT: Towards Region Understanding Vision Language Model
 
 # RegionGPT: Towards Region Understanding Vision Language Model

  ![](/sites/default/files/styles/wide/public/publications/teaser-danny-v4.png?itok=3WPhSx-S)

 Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs, yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder, and the use of coarsegrained training data that lacks detailed, region-specific captions. To address this, we introduce RegionGPT (short as RGPT), a novel framework designed for complex regionlevel captioning and understanding. RGPT enhances the spatial awareness of regional representation with simple yet effective modifications to existing visual encoders in VLMs. We further improve performance on tasks requiring a specific output scope by integrating task-guided instruction prompts during both training and inference phases, while maintaining the model’s versatility for general-purpose tasks. Additionally, we develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of regionlevel tasks, including but not limited to complex region descriptions, reasoning, object classification, and referring expressions comprehension. Code will be released at the project page <https://guoqiushan.github.io/regiongpt.github.io/>.


 ## Authors


Qiushan Guo (The University of Hong Kong)

[Shalini De Mello](/index.php/person/shalini-de-mello)

[Hongxu Danny Yin](/index.php/person/danny-yin)

[Wonmin Byeon](/index.php/person/wonmin-byeon)

Ka Chun Cheung (NVIDIA)

Yizhou Yu (The University of Hong Kong)

Ping Luo (The University of Hong Kong)

[Sifei Liu](/index.php/person/sifei-liu)

 
 ## Publication Date


Monday, June 17, 2024

 
 ## Published in


[IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2024](https://openaccess.thecvf.com/content/CVPR2024/papers/Guo_RegionGPT_Towards_Region_Understanding_Vision_Language_Model_CVPR_2024_paper.pdf)

 
 ## Research Area


[Artificial Intelligence and Machine Learning ](/index.php/research-area/machine-learning-artificial-intelligence)

[Computer Vision](/index.php/research-area/computer-vision)

[Generative AI](/index.php/research-area/generative-ai)

 
 ## External Links


[Project Page](https://guoqiushan.github.io/regiongpt.github.io/)

[ArXiv](https://arxiv.org/abs/2403.02330)

 
 ## Copyright


This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to <pubs-permissions@ieee.org>.