Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Visual Fact Checker
Enabling High-Fidelity Detailed Caption Generation

NVIDIA Corporation

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Paper (arXiv)

VisualFactChecker is a training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects.

VFC Caption Example

A happy little girl is standing in a green field, wearing a plaid shirt and holding onto a string of pink balloons. The balloons are floating in the air, creating a playful and joyful atmosphere. The girl appears to be enjoying the moment, with a smile on her face. In the background, there is a vast expanse of green grass, stretching out as far as the eye can see. The sky above is a bright blue, with a few white clouds scattered about. The overall mood of the scene is one of carefree happiness and joy, with the pink balloons adding a touch of whimsy and playfulness to the atmosphere.

Hover over a sentence to highlight the caption

Abstract

Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10 × smaller in model size.

Approach

2D Image Captioning

Pipeline of the VisualFactChecker for captioning 2D images. The process begins with the input being captioned by two multimodal captioning models (Captioner-1 and Captioner-2) to generate preliminary captions. These captions are then verified using a Large Language Model (LLM) to call object detection for fact-checking the captions. Finally, the LLM incorporates all the results and summarizes the final caption by following instructions.

3D Object Captioning

Pipeline of the VisualFactChecker for captioning 3D objects. For each individual view, the process begins with the input being captioned by two multimodal captioning models (Captioner-1 and Captioner-2) to generate preliminary captions. These captions are then verified using a Large Language Model (LLM) to call VQA models for fact-checking the captions. Finally, the LLM incorporates all the results and summarizes the final caption by following instructions. Once the caption for each individual view is complete, the LLM synthesizes these multiple perspectives into a singular, comprehensive caption for the entire 3D object.

Experiments

We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset.

CLIP-score and CLIP-Image-Score (overall)

Captioning Method	CLIP-Score (%) ↑	CLIP-Image-Score (%) ↑
Human Label (COCO GT)	30.36 (-2.54)	71.21 (-2.40)
BLIP2	30.11 (-2.79)	70.79 (-2.82)
InstructBLIP	31.45 (-1.45)	72.95 (-0.66)
LLaVA-1.5	32.08 (-0.82)	73.24 (-0.37)
Kosmos-2	32.32 (-0.58)	73.28 (-0.33)
VisualFactChecker (Ours)	32.90	73.61

Table 1. 2D Image captioning comparison with different metrics on 5000 COCO test set in Karpathy split, we use raw image and caption as input pairs for evaluation.

Captioning Method	CLIP-Score (%) ↑	CLIP-Image-Score (%) ↑
Cap3D	33.44 (-0.57)	79.88 (-0.44)
VisualFactChecker (Ours)	34.01	80.32

Table 2. 3D object captioning comparison with different metrics on 1000 objects in Objaverse.

CLIP-score and CLIP-Image-Score (pairwise)

Human and GPT-4V evaluation

Citation

@inproceedings{ge2024visual,
  title={Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation},
  author={Ge, Yunhao and Zeng, Xiaohui and Huffman, Jacob Samuel and Lin, Tsung-Yi and Liu, Ming-Yu and Cui, Yin},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition ({CVPR})},
  year={2024}
}

Visual Fact Checker Enabling High-Fidelity Detailed Caption Generation