Outputs from our model. The input block worlds are shown as insets.

Abstract

GANcraft aims at solving the world-to-world translation problem. Given a semantically labeled block world such as those from the popular game Minecraft, GANcraft is able to convert it to a new world which shares the same layout but with added photorealism. The new world can then be rendered from arbitrary viewpoints to produce images and videos that are both view-consistent and photorealistic. GANcraft simplifies the process of 3D modeling of complex landscape scenes, which will otherwise require years of expertise. GANcraft essentially turns every Minecraft player into a 3D artist!

Semantically Labeled Block World
Photorealistic Rendering
Block world to photorealistic rendering

Summary Videos

ICCV 2021 Oral Presentation Video

Previous Summary Video

Problem & Approach

The "Why don't you just use im2im translation?" Question

As the ground truth photorealistic renderings for a user-created block world simply doesn't exist, we have to train models with indirect supervision. Some existing approaches are strong candidates. For example, one can use image-to-image translation (im2im) methods such as MUNIT and SPADE, originally trained on 2D data only, to convert per-frame segmentation masks projected from the block world, to realistic looking images. One can also use wc-vid2vid, a 3D-aware method, to generate view-consistent images through 2D inpainting and 3D warping while using the voxel surfaces as the 3D geometry. These models have to be trained on translating real segmentation maps to real images due to paired training data requirements, and then used on Minecraft to real translation. As yet another alternative, one can train a NeRF-W, which learns a 3D radiance field from a non-photometric consistent, but posed and 3D consistent image collection.

Comparing the results from different methods, we can immediately notice a few issues:

In the last column, we present results from GANcraft, which are both view-consistent and high-quality.

Technical Innovations

Distribution Mismatch and Pseudo-Ground Truth

Assume that we have a suitable voxel-conditional neural rendering model which is capable of representing the photorealistic world. We still need a way to train it without any ground truth posed images. Adversarial training has achieved some success in small scale, unconditional neural rendering tasks when the posed images are not available. However, for GANcraft the problem is even more challenging.

No pseudo-ground truth

No pseudo-ground truth

W/ pseudo-ground truth

As shown in the first row, adversarial training using internet photos leads to unrealistic results, due to the complexity of the task. Producing and using pseudo-ground truths for training is one of the main contributions of our work, and significantly improves the result (second row).

Generating pseudo-ground truths

Hybrid Voxel-Conditional Neural Rendering

In GANcraft, we represent the photorealistic scene with a combination of 3D volumetric renderer and 2D image space renderer. We define a voxel-bounded neural radiance field: given a block world, we assign a learnable feature vector to every corner of the blocks, and use trilinear interpolation to define the location code at arbitrary locations within a voxel.

The complete GANcraft architecture

The complete GANcraft architecture

Capabilities

The generation process of GANcraft is conditional on a style image. During training, we use the pseudo-ground truth as the style image. During evaluation, we can control the output style by providing GANcraft with different style images. In the example below, we linearly interpolate the style code across 6 different style images.

Interpolation between multiple styles

Results

Summary

Citation

@inproceedings{hao2021GANcraft,
  title={GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds},
  author={Hao, Zekun and Mallya, Arun and Belongie, Serge and Liu, Ming-Yu},
  booktitle={IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021}
}