3D Photo Stylization
Learning to Generate Stylized Novel Views from a Single Image
arXiv 2021

  • 1University of Wisconsin-Madison
  • 2Snap Research
  • co-corresponding authors


Visual content creation has spurred a soaring interest given its applications in mobile photography and AR / VR. Style transfer and single-image 3D photography as two representative tasks have so far evolved independently. In this paper, we make a connection between the two, and address the challenging task of 3D photo stylization — generating stylized novel views from a single image given an arbitrary style. Our key intuition is that style transfer and view synthesis have to be jointly modeled for this task. To this end, we propose a deep model that learns geometry-aware content features for stylization from a point cloud representation of the scene, resulting in high-quality stylized images that are consistent across views. Further, we introduce a novel training protocol to enable the learning using only 2D images. We demonstrate the superiority of our method via extensive qualitative and quantitative studies, and showcase key applications of our method in light of the growing demand for 3D content creation from 2D image assets.


Method Overview


Central to our method is a point cloud based scene representation that enables geometry-aware feature learning, attention-based feature stylization and consistent stylized renderings across views. Specifically, our method proceeds as follows:

  1. Estimate a dense depth map from the input content image using a monocular depth estimation model.
  2. Synthesize occluded geometry of the scene by layered depth inpainting.
  3. Construct an RGB point cloud from the layered depth image (LDI) via perspective back-projection.
  4. Extract geometry-aware features using an efficient graph convolutional network (GCN).
  5. Modulate point features through point-to-pixel adaptive attention normalization (AdaAttN) given a style image.
  6. Rasterize the modulated point features to a novel view given camera pose and intrinsics.
  7. Decode rendered features into a stylized image using a 2D neural renderer.



We thank Shree Nayar, the Creative Vision team and the Camera Platform team at Snap Research for fruitful discussions and brainstorming. We thank Abrar Majeedi, Chen-Lin Zhang and Yiwu Zhong for helpful comments on the paper draft. We thank the anonymous participants of our user study. The website template was borrowed from Michaël Gharbi.