Magic3D: High-Resolution Text-to-3D Content Creation

Chen-Hsuan Lin*
Jun Gao*
Luming Tang*
Towaki Takikawa*
Xiaohui Zeng*
Xun Huang
Karsten Kreis
Sanja Fidler†
Ming-Yu Liu†
Tsung-Yi Lin
*† : equal contributions
NVIDIA Corporation
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023
HIGHLIGHT
Paper (arXiv)

Magic3D is a new text-to-3D content creation tool that creates 3D mesh models with unprecedented quality. Together with image conditioning techniques as well as prompt-based editing approach, we provide users with new ways to control 3D synthesis, opening up new avenues to various creative applications.

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing.

(best viewed with Google Chrome on a desktop/laptop)
Abstract

DreamFusion has recently demonstrated the utility of a pre-trained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF), achieving remarkable text-to-3D synthesis results. However, the method has two inherent limitations: (a) extremely slow optimization of NeRF and (b) low-resolution image space supervision on NeRF, leading to low-quality 3D models with a long processing time. In this paper, we address these limitations by utilizing a two-stage optimization framework. First, we obtain a coarse model using a low-resolution diffusion prior and accelerate with a sparse 3D hash grid structure. Using the coarse representation as the initialization, we further optimize a textured 3D mesh model with an efficient differentiable renderer interacting with a high-resolution latent diffusion model. Our method, dubbed Magic3D, can create high quality 3D mesh models in 40 minutes, which is 2× faster than DreamFusion (reportedly taking 1.5 hours on average), while also achieving higher resolution. User studies show 61.7% raters to prefer our approach over DreamFusion. Together with the image-conditioned generation capabilities, we provide users with new ways to control 3D synthesis, opening up new avenues to various creative applications.


Video
X

High-Resolution 3D Meshes

Magic3D can create high-quality 3D textured mesh models from input text prompts. It utilizes a coarse-to-fine strategy leveraging both low- and high-resolution diffusion priors for learning the 3D representation of the target content. Magic3D synthesizes 3D content with 8× higher-resolution supervision than DreamFusion while also being 2× faster.

[...] indicates helper captions added to improve quality, e.g. "A DSLR photo of".

Reveal 3D mesh!
Download 3D mesh!
A beautiful dress made out of garbage bags, on a mannequin. Studio lighting, high quality, high resolution.
Reveal 3D mesh!
Download 3D mesh!
A blue poison-dart frog sitting on a water lily.
Reveal 3D mesh!
Download 3D mesh!
[...] a car made out of sushi.
Reveal 3D mesh!
Download 3D mesh!
[...] a bagel filled with cream cheese and lox.
Reveal 3D mesh!
Download 3D mesh!
[...] an ice cream sundae.
Reveal 3D mesh!
Download 3D mesh!
[...] a peacock on a surfboard.
Reveal 3D mesh!
Download 3D mesh!
[...] a plate piled high with chocolate chip cookies.
Reveal 3D mesh!
Download 3D mesh!
[...] Neuschwanstein Castle, aerial view.
Reveal 3D mesh!
Download 3D mesh!
[...] the Imperial State Crown of England.
Reveal 3D mesh!
Download 3D mesh!
[...] the leaning tower of Pisa, aerial view.
Reveal 3D mesh!
Download 3D mesh!
A ripe strawberry.
Reveal 3D mesh!
Download 3D mesh!
A silver platter piled high with fruits.
Reveal 3D mesh!
Download 3D mesh!
[...] a silver candelabra sitting on a red velvet tablecloth, only one candle is lit.
Reveal 3D mesh!
Download 3D mesh!
[...] Sydney opera house, aerial view.
Reveal 3D mesh!
Download 3D mesh!
Michelangelo style statue of an astronaut.

Prompt-based Editing

Given a coarse model generated with a base text prompt, we can modify parts of the text in the prompt, and then fine-tune the NeRF and 3D mesh models to obtain an edited high-resolution 3D mesh.

A squirrel wearing a leather jacket riding a motorcycle.
A bunny riding a scooter.
A fairy riding a bike.
A steampunk squirrel riding a horse.
A baby bunny sitting on top of a stack of pancakes.
A lego bunny sitting on top of a stack of books.
A metal bunny sitting on top of a stack of broccoli.
A metal bunny sitting on top of a stack of chocolate cookies.

Other Editing Capabilities

Given input images for a subject instance, we can fine-tune the diffusion models with DreamBooth and optimize the 3D models with the given prompts. The identity of the subject can be well-preserved in the 3D models.

We can also condition the diffusion model (eDiff-I) on an input image to transfer its style to the output 3D model.


Approach

We utilize a two-stage coarse-to-fine optimization framework for fast and high-quality text-to-3D content creation. In the first stage, we obtain a coarse model using a low-resolution diffusion prior and accelerate this with a hash grid and sparse acceleration structure. In the second stage, we use a textured mesh model initialized from the coarse neural representation, allowing optimization with an efficient differentiable renderer interacting with a high-resolution latent diffusion model.


Citation
@inproceedings{lin2023magic3d,
  title={Magic3D: High-Resolution Text-to-3D Content Creation},
  author={Lin, Chen-Hsuan and Gao, Jun and Tang, Luming and Takikawa, Towaki and Zeng, Xiaohui and Huang, Xun and Kreis, Karsten and Fidler, Sanja and Liu, Ming-Yu and Lin, Tsung-Yi},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition ({CVPR})},
  year={2023}
}