3D Photo Stylization
Learning to Generate Stylized Novel Views from a Single Image
CVPR 2022 (Oral Presentation)

  • 1University of Wisconsin-Madison
  • 2Snap Research
  • co-corresponding authors

Abstract

Visual content creation has spurred a soaring interest given its applications in mobile photography and AR / VR. Style transfer and single-image 3D photography as two representative tasks have so far evolved independently. In this paper, we make a connection between the two, and address the challenging task of 3D photo stylization — generating stylized novel views from a single image given an arbitrary style. Our key intuition is that style transfer and view synthesis have to be jointly modeled for this task. To this end, we propose a deep model that learns geometry-aware content features for stylization from a point cloud representation of the scene, resulting in high-quality stylized images that are consistent across views. Further, we introduce a novel training protocol to enable the learning using only 2D images. We demonstrate the superiority of our method via extensive qualitative and quantitative studies, and showcase key applications of our method in light of the growing demand for 3D content creation from 2D image assets.

Video Presentation

Video Demo

Method Overview


overview

Central to our method is a point cloud based scene representation that enables geometry-aware feature learning, attention-based feature stylization and consistent stylized renderings across views. Specifically, our method proceeds as follows:

  1. Estimate a dense depth map from the input content image using a monocular depth estimation model.
  2. Synthesize occluded geometry of the scene by layered depth inpainting.
  3. Construct an RGB point cloud from the layered depth image (LDI) via perspective back-projection.
  4. Extract geometry-aware features using an efficient graph convolutional network (GCN).
  5. Modulate point features through point-to-pixel adaptive attention normalization (AdaAttN) given a style image.
  6. Rasterize the modulated point features to a novel view given camera pose and intrinsics.
  7. Decode rendered features into a stylized image using a 2D neural renderer.

Citation

Acknowledgements

The authors thank Shree Nayar, Hsin-Ying Lee, Menglei Chai, Kyle Olszewski and Jian Ren for fruitful discussions. The authors thank the anonymous participants of the user study. FM and YL acknowledge the support from UW VCRGE with funding from WARF. The website template was borrowed from Michaël Gharbi.