TexFusion takes a text prompt and mesh geometry as input and produces a UV parameterized texture image matching the prompt and mesh using Stable Diffusion as the text-to-image diffusion backbone.
Key to TexFusion is our Sequential Interlaced Multiview Sampler (SIMS) - SIMS performs denoising diffusion iterations in multiple camera views, yet the trajectories are aggregated through a latent texture map after every denoising step. The output of SIMS is a set of 3D consistent latent images.
The latent images are decoded by the Stable Diffusion decoder into RGB images, and fused into a texture map via optimizing an intermediate neural color field.
We compare TexFusion to TEXTure, a text-driven texture generation method that also uses Stable Diffusion with depth conditioning.
In terms of local 3D consistency (consistency in neighborhoods on the surface of the mesh), textures produced by TexFusion is locally consistent - there are no visible seam-lines or stitching artifacts. In contrast, we often find severe artifacts when viewing the top and back side of the outputs of TEXTure. These artifacts are most noticeable when a clean color is expected, such as when texturing vehicles.
In terms of global consistency (semantic coherency of the entire texture, e.g. exactly 2 eyes and 1 nose to a face), TEXTure performs poorly and suffers from problems similar to DreamFusion's Janus face problem. This problem is ameliorated in TexFusion due to frequent communication between views in SIMS.
Quantitatively, our method achieves lower FID than TEXTure when measured against images genereted by depth-conditioned image diffusion models for the same mesh and text-prompt. TexFusion is also overall preferred by users in a user study.
TexFusion can be configured to hallucinate more material details by using deterministic DDIM. This is particularly beneficial on smooth/low-poly geometries, but can sometimes reduce cleaniness.
By using ControlNet as the diffusion backbone for TexFusion, texture generated by TexFusion will better adhere to the geometric details of the input mesh as ControlNet uses high resolution depth conditioning.
We further compare TexFusion + ControlNet in “normal mode” (apply classifier free guidance to text prompt only) and ControlNet’s “guess mode” (apply classifier free guidance to both text and depth) on meshes with fine geometric details. TexFusion produces high-contrast textures with the appearance of strong lighting under “guess mode”, and realistic textures with smooth lighting in “normal mode”
@InProceedings{cao2023texfusion,
title={TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models},
author={Cao, Tianshi and Kreis, Karsten and Fidler, Sanja and Sharp, Nicholas and Yin, KangXue},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2023},
}