3DPE

Qingyan Bai^1,2 Zifan Shi¹ Yinghao Xu³ Hao Ouyang^1,2 Qiuyu Wang²
Ceyuan Yang⁴ Xuan Wang² Gordon Wetzstein³ Yujun Shen² Qifeng Chen¹

¹ HKUST ² Ant Group
³ Stanford University ⁴ Shanghai AI Laboratory

[Paper] [Code]

Overview

This work presents 3DPE, a practical method that can efficiently edit a face image following given prompts, like reference images or text descriptions, in a 3D-aware manner. To this end, a lightweight module is distilled from a 3D portrait generator and a text-to-image model, which provide prior knowledge of face geometry and superior editing capability, respectively. Such a design brings two compelling advantages over existing approaches. First, our method achieves real-time editing with a feedforward network (i.e., ∼0.04s per image), over 100× faster than the second competitor. Second, thanks to the powerful priors, our module could focus on the learning of editing-related variations, such that it manages to handle various types of editing simultaneously in the training phase and further supports fast adaptation to user-specified customized types of editing during inference.

Method

Motivation. Live3D Portrait (Live3D) proposes a real-time 3D inversion method based on the two-branch structure. The figure below demonstrate the disentanglement in Live3D features. We separately disable the features from the two branches E_high(·) and E_low(·) to infer the reconstructed image. Without E_high(·), the output retains the coarse structure but loses its appearance. Conversely, when E_low(·) is deactivated, the reconstructed portraits preserve the texture (such as the blue and purple reflection on the glasses) but fail to capture the geometry.

Framework. Inspired by the aforementioned feature disentanglement, we propose to distill the priors in the 2D diffusion generative model and 3D GAN for real-time 3D-aware editing. The proposed model is fine-tuned from Live3D where the prompt features are fused with ones from E_high(·) through cross-attention, in order to further predict the triplane representation.

Results

Shown below are input images and the corresponding stylized renderings.

For qualitative comparisons, we compare the results of several baselines with image prompts and text prompts. In each case, we include the edited portraits as well as their novel view renderings.

The figure below includes: (a) testing results of customized prompt adaptation and (b) its learning process. We show the intermediate testing results at 10s, 1min, 2min and 5min during adaptation for the style golden statue.

BibTeX

@inproceedings{bai20243dpe,
  title     = {Real-time 3D-aware Portrait Editing from a Single Image},
  author    = {Bai, Qingyan and Shi, Zifan and Xu, Yinghao and Ouyang, Hao and Wang, Qiuyu and Yang, Ceyuan and Wang, Xuan and Wetzstein, Gordon and Shen, Yujun and Chen, Qifeng},
  booktitle = {European Conference on Computer Vision},
  year      = {2024}
}

Related Work

Efficient Geometry-aware 3D Generative Adversarial Networks. Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, Gordon Wetzstein. CVPR 2022.
Comment: Proposes a hybrid explicit-implicit network that synthesizes high-resolution multi-view-consistent images in real time and also produces high-quality 3D geometry.

Live 3D Portrait: Real-Time Radiance Fields for Single-Image Portrait View Synthesis. Alex Trevithick, Matthew Chan, Michael Stengel, Eric R. Chan, Chao Liu, Zhiding Yu, Sameh Khamis, Manmohan Chandraker, Ravi Ramamoorthi, Koki Nagano. TOG 2023.
Comment: Proposes a one-shot method to infer and render a 3D representation from a single unposed image in real-time.

InstructPix2Pix: Learning to Follow Image Editing Instructions. Tim Brooks, Aleksander Holynski, Alexei A. Efros. CVPR 2023.
Comment: Proposes an image editing method following human textual instructions.

ECCV 2024