TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models

Tianshi Cao^{1, 2, 3}, Karsten Kreis¹, Sanja Fidler^{1, 2, 3}, Nicholas Sharp^{1, *}, Kangxue Yin^{1, *}

¹NVIDIA ²University of Toronto ³Vector Institute

ICCV 2023 (Oral)

Paper Sample Outputs (Coming Soon) arXiv

We present TexFusion (Texture Diffusion), a new method to synthesize textures for given 3D geometries, using only large-scale text-guided image diffusion models. In contrast to recent works that leverage 2D text-to-image diffusion models to distill 3D objects using a slow and fragile optimization process, TexFusion introduces a new 3D-consistent generation technique specifically designed for texture synthesis that employs regular diffusion model sampling on different 2D rendered views. We achieve state-of-the-art text-guided texture synthesis performance using only image diffusion models, while avoiding the pitfalls of previous distillation-based methods. The text-conditioning offers detailed control and we also do not rely on any ground truth 3D textures for training. This makes our method very versatile and applicable to a broad range of geometries and texture types. We hope that TexFusion will advance AI-based texturing of 3D assets for applications in virtual reality, game design, simulation, and more.

Comparisons to TEXTure

We compare TexFusion to TEXTure, a text-driven texture generation method that also uses Stable Diffusion with depth conditioning.

In terms of local 3D consistency (consistency in neighborhoods on the surface of the mesh), textures produced by TexFusion is locally consistent - there are no visible seam-lines or stitching artifacts. In contrast, we often find severe artifacts when viewing the top and back side of the outputs of TEXTure. These artifacts are most noticeable when a clean color is expected, such as when texturing vehicles.

In terms of global consistency (semantic coherency of the entire texture, e.g. exactly 2 eyes and 1 nose to a face), TEXTure performs poorly and suffers from problems similar to DreamFusion's Janus face problem. This problem is ameliorated in TexFusion due to frequent communication between views in SIMS.

Quantitatively, our method achieves lower FID than TEXTure when measured against images genereted by depth-conditioned image diffusion models for the same mesh and text-prompt. TexFusion is also overall preferred by users in a user study.

Gallery

Fine Material Details via deterministic DDIM

TexFusion can be configured to hallucinate more material details by using deterministic DDIM. This is particularly beneficial on smooth/low-poly geometries, but can sometimes reduce cleaniness.

Capturing Geometric Details with ControlNet

By using ControlNet as the diffusion backbone for TexFusion, texture generated by TexFusion will better adhere to the geometric details of the input mesh as ControlNet uses high resolution depth conditioning.

We further compare TexFusion + ControlNet in “normal mode” (apply classifier free guidance to text prompt only) and ControlNet’s “guess mode” (apply classifier free guidance to both text and depth) on meshes with fine geometric details. TexFusion produces high-contrast textures with the appearance of strong lighting under “guess mode”, and realistic textures with smooth lighting in “normal mode”

TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models

Method

Comparisons to TEXTure

Gallery

Fine Material Details via deterministic DDIM

Capturing Geometric Details with ControlNet

BibTeX