SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections
TPAMI 2023

TL;DR: SceneDreamer learns to generate unbounded 3D scenes from in-the-wild 2D image collections.
Our method can synthesize diverse landscapes across different styles, with 3D consistency, well-defined depth, and free camera trajectory.

Abstract

In this work, we present SceneDreamer, an unconditional generative model for unbounded 3D scenes, which synthesizes large-scale 3D landscapes from random noises. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations. At the core of SceneDreamer is a principled learning paradigm comprising 1) an efficient yet expressive 3D scene representation, 2) a generative scene parameterization, and 3) an effective renderer that can leverage the knowledge from 2D images. Our framework starts from an efficient bird's-eye-view (BEV) representation generated from simplex noise, which consists of a height field and a semantic field. The height field represents the surface elevation of 3D scenes, while the semantic field provides detailed scene semantics. This BEV scene representation enables 1) representing a 3D scene with quadratic complexity, 2) disentangled geometry and semantics, and 3) efficient training. Furthermore, we propose a novel generative neural hash grid to parameterize the latent space given 3D positions and the scene semantics, which aims to encode generalizable features across scenes and align content. Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. Extensive experiments demonstrate the effectiveness of SceneDreamer and superiority over state-of-the-art methods in generating vivid yet diverse unbounded 3D worlds.


Gallery

Recommend to enter full screen for better visual quality



Framework


Given a simplex noise and a style code as input, our model is capable of synthesizing large-scale 3D scenes where the camera can move freely and get realistic renderings. We first derive our BEV scene representation which consists of a height field and a semantic field. Then, we use a generative neural hash grid to parameterize the hyperspace of space-varied and scene-varied latent features given scene semantics and 3D position. Finally, a style-modulated renderer is employed to blend latent features and render 2D images via volume rendering. The entire framework is trained on in-the-wild 2D images end-to-end.

Video

Citation

Related Links

Text2Light generates HDR panorama from texts with a resolution up to 4K.
StyleLight generates HDR indoor panorama from a limited FOV image.
EVA3D is a 3D human generative model that only requires 2D image collections for training.

Acknowledgements

This work is supported by the National Research Foundation, Singapore under its AI Singapore Programme, NTU NAP, MOE AcRF Tier 2 (T2EP20221-0033), and under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).
The website template is borrowed from Mip-NeRF.