TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models

1NVIDIA 2University of Toronto 3Vector Institute
ICCV 2023 (Oral)

We present TexFusion (Texture Diffusion), a new method to synthesize textures for given 3D geometries, using only large-scale text-guided image diffusion models. In contrast to recent works that leverage 2D text-to-image diffusion models to distill 3D objects using a slow and fragile optimization process, TexFusion introduces a new 3D-consistent generation technique specifically designed for texture synthesis that employs regular diffusion model sampling on different 2D rendered views. We achieve state-of-the-art text-guided texture synthesis performance using only image diffusion models, while avoiding the pitfalls of previous distillation-based methods. The text-conditioning offers detailed control and we also do not rely on any ground truth 3D textures for training. This makes our method very versatile and applicable to a broad range of geometries and texture types. We hope that TexFusion will advance AI-based texturing of 3D assets for applications in virtual reality, game design, simulation, and more.


TexFusion takes a text prompt and mesh geometry as input and produces a UV parameterized texture image matching the prompt and mesh using Stable Diffusion as the text-to-image diffusion backbone.

Key to TexFusion is our Sequential Interlaced Multiview Sampler (SIMS) - SIMS performs denoising diffusion iterations in multiple camera views, yet the trajectories are aggregated through a latent texture map after every denoising step. The output of SIMS is a set of 3D consistent latent images.

The latent images are decoded by the Stable Diffusion decoder into RGB images, and fused into a texture map via optimizing an intermediate neural color field.

Comparisons to TEXTure

FID and User Study Results.

We compare TexFusion to TEXTure, a text-driven texture generation method that also uses Stable Diffusion with depth conditioning.

In terms of local 3D consistency (consistency in neighborhoods on the surface of the mesh), textures produced by TexFusion is locally consistent - there are no visible seam-lines or stitching artifacts. In contrast, we often find severe artifacts when viewing the top and back side of the outputs of TEXTure. These artifacts are most noticeable when a clean color is expected, such as when texturing vehicles.

In terms of global consistency (semantic coherency of the entire texture, e.g. exactly 2 eyes and 1 nose to a face), TEXTure performs poorly and suffers from problems similar to DreamFusion's Janus face problem. This problem is ameliorated in TexFusion due to frequent communication between views in SIMS.

Quantitatively, our method achieves lower FID than TEXTure when measured against images genereted by depth-conditioned image diffusion models for the same mesh and text-prompt. TexFusion is also overall preferred by users in a user study.


Fine Material Details via deterministic DDIM

TexFusion can be configured to hallucinate more material details by using deterministic DDIM. This is particularly beneficial on smooth/low-poly geometries, but can sometimes reduce cleaniness.

Capturing Geometric Details with ControlNet

By using ControlNet as the diffusion backbone for TexFusion, texture generated by TexFusion will better adhere to the geometric details of the input mesh as ControlNet uses high resolution depth conditioning.

We further compare TexFusion + ControlNet in “normal mode” (apply classifier free guidance to text prompt only) and ControlNet’s “guess mode” (apply classifier free guidance to both text and depth) on meshes with fine geometric details. TexFusion produces high-contrast textures with the appearance of strong lighting under “guess mode”, and realistic textures with smooth lighting in “normal mode”


    title={TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models},
    author={Cao, Tianshi and Kreis, Karsten and Fidler, Sanja and Sharp, Nicholas and Yin, KangXue},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},