Region-level Understanding

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks …

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of …