MegaScenes: Scene-Level View Synthesis at Scale

ECCV 2024

Joseph Tung^꘎1, Gene Chou^꘎1, Ruojin Cai¹, Guandao Yang², Kai Zhang³, Gordon Wetzstein², Bharath Hariharan¹, Noah Snavely¹

¹Cornell University, ²Stanford University, ³Adobe Research

^꘎Equal Contribution

Paper arXiv Code Data Web Viewer

The MegaScenes Dataset is an extensive collection of structure-from-motion reconstructions and internet images. It includes a diversity of scenes like minarets, building interiors, statues, bridges, towers, religious buildings, and natural landscapes. The images of these scenes are captured under varying conditions, including different times of day, various weather and illumination, and from different devices with distinct camera intrinsics.

On the task of single-image novel view synthesis (NVS), we show that training on MegaScenes leads to generalization to in-the-wild scenes. All videos shown here are generated using a single image as input, and none of the categories were seen during training.

Abstract

Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications. Recently, pose-conditioned diffusion models have led to significant progress by extracting 3D information from 2D foundation models, but these methods are limited by the lack of scene-level training data. Common dataset choices either consist of isolated objects (Objaverse), or of object-centric scenes with limited pose distributions (DTU, CO3D). In this paper, we create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K SfM reconstructions from around the world. Internet photos represent a scalable data source but come with challenges such as lighting and transient objects. We address these issues to further create a subset suitable for the task of NVS. Additionally, we analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency. Through extensive experiments we validate the effectiveness of both our dataset and method on generating in-the-wild scenes.

Dataset Collection

We first source and identify potential scene categories from WikiData. Subsequently, images and metadata for each scene category is downloaded. Finally, we reconstruct scenes using Structure from Motion (SfM) and clean them using the Doppelgangers pipeline.

Dataset Statistics

We show the distribution of the MegaScenes Dataset. On the left, we depict the frequency of scenes grouped by WikiData class. This includes only select classes with more than 3,500 scenes; note that a single scene may be an instance of multiple classes. On the right, we visualize the geospatial distribution of collected scenes worldwide.

Application: Single Image Novel View Synthesis

To explore the diversity and scale of the MegaScenes Dataset, we experiment on the task of single image novel view synthesis, where the goal is to take a reference image and generate a plausible image at a target pose. We train and evaluate on image pairs with pseudo-ground-truth relative poses obtained via SfM.

Conditioning on the Extrinsic Matrix

Simply finetuning pose-conditioned diffusion models, such as ZeroNVS, signficantly improves their generalization to in-the-wild scenes. However, the depth and scale of the scene in ZeroNVS is ambiguous and requires manual tuning.

These scenes are unseen during training. ZeroNVS finetuned on MegaScenes, denoted ZeroNVS (MS), demonstrates stronger generalizability. However, when there are larger translation changes, such as zooming, ZeroNVS (MS) still fails. See the paper for more examples.

Conditioning on Warped Images

We find that first warping the image into the target pose is a strong condition that encodes how pixels are supposed to move, and is directly aligned with the scene scale. On our training and evaluation datasets, the scale is based on 3D SfM points. When given a random, in-the-wild image, we can determine the scene scale from estimated monocular depth and use the same extrinsics for conditioning and warping the image for a consistent scale.

Evaluation

We evaluate on MegaScenes’ test set, which consists of in-the-wild scenes from Internet photos. Here, we show comparisons between four models. 1. SD-inpainting: A Stable Diffusion inpainting model without any finetuning. 2. ZeroNVS (released): The ZeroNVS released checkpoint. 3. ZeroNVS (MS): ZeroNVS finetuned on MegaScenes. 4. Ours: Finetuned from ZeroNVS on MegaScenes, and conditioned on both the extrinsic matrices and the warped images. See the paper for more evaluations and baselines.

Discussion

MegaScenes is a general large-scale 3D dataset, and we foresee a variety of 3D-related applications that could benefit from MegaScenes, such as pose estimation, feature matching, and reconstruction. In this paper we focus on NVS as a representative application and we find that MegaScenes is indeed capable of training generalizable 3D models.

Acknowledgments

We thank Brandon Li for building the COLMAP webviewer. This work was funded in part by the National Science Foundation (IIS-2008313, IIS-2211259, IIS-2212084). Gene Chou was funded by an NSF Graduate Research Fellowship.

BibTeX


    @inproceedings{
      tung2024megascenes,
      title={MegaScenes: Scene-Level View Synthesis at Scale}, 
      author={Tung, Joseph and Chou, Gene and Cai, Ruojin and Yang, Guandao and Zhang, Kai and Wetzstein, Gordon and Hariharan, Bharath and Snavely, Noah},
      booktitle={ECCV},
      year={2024}
    }