Real-time 3D-aware Portrait Editing from a Single Image

ECCV 2024

Qingyan Bai1,2     Zifan Shi1      Yinghao Xu3     Hao Ouyang1,2     Qiuyu Wang2
Ceyuan Yang4      Xuan Wang2      Gordon Wetzstein3     Yujun Shen2     Qifeng Chen1
1 HKUST     2 Ant Group    
3 Stanford University     4 Shanghai AI Laboratory     
Overview
This work presents 3DPE, a practical method that can efficiently edit a face image following given prompts, like reference images or text descriptions, in a 3D-aware manner. To this end, a lightweight module is distilled from a 3D portrait generator and a text-to-image model, which provide prior knowledge of face geometry and superior editing capability, respectively. Such a design brings two compelling advantages over existing approaches. First, our method achieves real-time editing with a feedforward network (i.e., ∼0.04s per image), over 100× faster than the second competitor. Second, thanks to the powerful priors, our module could focus on the learning of editing-related variations, such that it manages to handle various types of editing simultaneously in the training phase and further supports fast adaptation to user-specified customized types of editing during inference.
Method
Motivation. Live3D Portrait (Live3D) proposes a real-time 3D inversion method based on the two-branch structure. The figure below demonstrate the disentanglement in Live3D features. We separately disable the features from the two branches Ehigh(·) and Elow(·) to infer the reconstructed image. Without Ehigh(·), the output retains the coarse structure but loses its appearance. Conversely, when Elow(·) is deactivated, the reconstructed portraits preserve the texture (such as the blue and purple reflection on the glasses) but fail to capture the geometry.
Framework. Inspired by the aforementioned feature disentanglement, we propose to distill the priors in the 2D diffusion generative model and 3D GAN for real-time 3D-aware editing. The proposed model is fine-tuned from Live3D where the prompt features are fused with ones from Ehigh(·) through cross-attention, in order to further predict the triplane representation.
Results
Shown below are input images and the corresponding stylized renderings.


For qualitative comparisons, we compare the results of several baselines with image prompts and text prompts. In each case, we include the edited portraits as well as their novel view renderings.
The figure below includes: (a) testing results of customized prompt adaptation and (b) its learning process. We show the intermediate testing results at 10s, 1min, 2min and 5min during adaptation for the style golden statue.
BibTeX
@inproceedings{bai20243dpe,
  title     = {Real-time 3D-aware Portrait Editing from a Single Image},
  author    = {Bai, Qingyan and Shi, Zifan and Xu, Yinghao and Ouyang, Hao and Wang, Qiuyu and Yang, Ceyuan and Wang, Xuan and Wetzstein, Gordon and Shen, Yujun and Chen, Qifeng},
  booktitle = {European Conference on Computer Vision},
  year      = {2024}
}
Related Work
Efficient Geometry-aware 3D Generative Adversarial Networks. Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, Gordon Wetzstein. CVPR 2022.
Comment: Proposes a hybrid explicit-implicit network that synthesizes high-resolution multi-view-consistent images in real time and also produces high-quality 3D geometry.