Magic3D: High-Resolution Text-to-3D Content Creation

Chen-Hsuan Lin*
Jun Gao*
Luming Tang*
Towaki Takikawa*
Xiaohui Zeng*
Xun Huang
Karsten Kreis
Sanja Fidler†
Ming-Yu Liu†
Tsung-Yi Lin
*† : equal contributions
NVIDIA Corporation
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023
HIGHLIGHT
Paper (arXiv)

Magic3D is a new text-to-3D content creation tool that creates 3D mesh models with unprecedented quality. Together with image conditioning techniques as well as prompt-based editing approach, we provide users with new ways to control 3D synthesis, opening up new avenues to various creative applications.

Our latest text-to-3D models will be available through NVIDIA Picasso, our generative AI cloud service.
Please sign up to be notified of availability.

(best viewed with Google Chrome on a desktop/laptop)
Abstract

DreamFusion has recently demonstrated the utility of a pre-trained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF), achieving remarkable text-to-3D synthesis results. However, the method has two inherent limitations: (a) extremely slow optimization of NeRF and (b) low-resolution image space supervision on NeRF, leading to low-quality 3D models with a long processing time. In this paper, we address these limitations by utilizing a two-stage optimization framework. First, we obtain a coarse model using a low-resolution diffusion prior and accelerate with a sparse 3D hash grid structure. Using the coarse representation as the initialization, we further optimize a textured 3D mesh model with an efficient differentiable renderer interacting with a high-resolution latent diffusion model. Our method, dubbed Magic3D, can create high quality 3D mesh models in 40 minutes, which is 2× faster than DreamFusion (reportedly taking 1.5 hours on average), while also achieving higher resolution. User studies show 61.7% raters to prefer our approach over DreamFusion. Together with the image-conditioned generation capabilities, we provide users with new ways to control 3D synthesis, opening up new avenues to various creative applications.


Video
X

High-Resolution 3D Meshes

Magic3D can create high-quality 3D textured mesh models from input text prompts. It utilizes a coarse-to-fine strategy leveraging both low- and high-resolution diffusion priors for learning the 3D representation of the target content. Magic3D synthesizes 3D content with 8× higher-resolution supervision than DreamFusion while also being 2× faster.

[...] indicates helper captions added to improve quality, e.g. "A DSLR photo of".

Reveal 3D mesh!
Download 3D mesh!
A beautiful dress made out of garbage bags, on a mannequin. Studio lighting, high quality, high resolution.
Reveal 3D mesh!
Download 3D mesh!
A blue poison-dart frog sitting on a water lily.
Reveal 3D mesh!
Download 3D mesh!
[...] a car made out of sushi.
Reveal 3D mesh!
Download 3D mesh!
[...] a bagel filled with cream cheese and lox.
Reveal 3D mesh!
Download 3D mesh!
[...] an ice cream sundae.
Reveal 3D mesh!
Download 3D mesh!
[...] a peacock on a surfboard.
Reveal 3D mesh!
Download 3D mesh!
[...] a plate piled high with chocolate chip cookies.
Reveal 3D mesh!
Download 3D mesh!
[...] Neuschwanstein Castle, aerial view.
Reveal 3D mesh!
Download 3D mesh!
[...] the Imperial State Crown of England.
Reveal 3D mesh!
Download 3D mesh!
[...] the leaning tower of Pisa, aerial view.
Reveal 3D mesh!
Download 3D mesh!
A ripe strawberry.
Reveal 3D mesh!
Download 3D mesh!
A silver platter piled high with fruits.
Reveal 3D mesh!
Download 3D mesh!
[...] a silver candelabra sitting on a red velvet tablecloth, only one candle is lit.
Reveal 3D mesh!
Download 3D mesh!
[...] Sydney opera house, aerial view.
Reveal 3D mesh!
Download 3D mesh!
Michelangelo style statue of an astronaut.

Prompt-based Editing

Given a coarse model generated with a base text prompt, we can modify parts of the text in the prompt, and then fine-tune the NeRF and 3D mesh models to obtain an edited high-resolution 3D mesh.

A squirrel wearing a leather jacket riding a motorcycle.
A bunny riding a scooter.
A fairy riding a bike.
A steampunk squirrel riding a horse.
A baby bunny sitting on top of a stack of pancakes.
A lego bunny sitting on top of a stack of books.
A metal bunny sitting on top of a stack of broccoli.
A metal bunny sitting on top of a stack of chocolate cookies.

Other Editing Capabilities

Given input images for a subject instance, we can fine-tune the diffusion models with DreamBooth and optimize the 3D models with the given prompts. The identity of the subject can be well-preserved in the 3D models.

We can also condition the diffusion model (eDiff-I) on an input image to transfer its style to the output 3D model.


Approach

We utilize a two-stage coarse-to-fine optimization framework for fast and high-quality text-to-3D content creation. In the first stage, we obtain a coarse model using a low-resolution diffusion prior and accelerate this with a hash grid and sparse acceleration structure. In the second stage, we use a textured mesh model initialized from the coarse neural representation, allowing optimization with an efficient differentiable renderer interacting with a high-resolution latent diffusion model.


Presentation

Poster
(Click image to enlarge)

Citation
@inproceedings{lin2023magic3d,
  title={Magic3D: High-Resolution Text-to-3D Content Creation},
  author={Lin, Chen-Hsuan and Gao, Jun and Tang, Luming and Takikawa, Towaki and Zeng, Xiaohui and Huang, Xun and Kreis, Karsten and Fidler, Sanja and Liu, Ming-Yu and Lin, Tsung-Yi},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition ({CVPR})},
  year={2023}
}