Edicho: Consistent Image Editing in the Wild

Qingyan Bai1, 2     Hao Ouyang2     Yinghao Xu3     Qiuyu Wang2
Ceyuan Yang4      Ka Leong Cheng1, 2      Yujun Shen2     Qifeng Chen1
1 HKUST     2 Ant Group     3 Stanford University     4 CUHK     
Overview
As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and a carefully refined classifier-free guidance (CFG) denoising strategy, both of which take into account the pre-estimated correspondence. Such an inference-time algorithm enjoys a plug-and-play nature and is compatible to most diffusion-based editing methods, such as ControlNet and BrushNet. Extensive results demonstrate the efficacy of Edicho in consistent cross-image editing under diverse settings.
Method
Motivation. In the task of in-the-wild image editing, learning-based methods often lack proper regularization, resulting in inconsistent edits due to the difficulty of obtaining high-quality training data and enforcing uniformity constraints. Non-optimization methods rely on implicit correspondence from attention features for appearance transfer, but struggle with unstable predictions and intrinsic image variations, leading to inconsistent or distorted edits. We visualize the correspondence predicted respectively by explicit and attention-based implicit methods in the figure below, accompanied by the attention maps for correspondence prediction (regions with the highest attention weights are outlined with dashed circles).
Framework. To achieve consistent editing, we propose a training-free and plug-and-play method which first predicts the explicit correspondence for the input images. The pre-computed correspondence is injected into the pre-trained diffusion models and guide the denoising in the two levels of (a) attention features and (b) noisy latents in classifier-free guidance (CFG), as in the figure below.
Results
Shown below are qualitative comparisons respectively on global and local editing.


With outputs from our consistent editing method (upper) and the customization techniques, customized generation (lower) could be achieved by injecting the edited concepts into the generative model.
We also adopt the neural regressor Dust3R for 3D reconstruction based on the edits by matching the 2D points in a 3D space:
Additional Results
Shown below are additional qualitative results of the proposed method for local (upper three) and global editing (lower three ones). The inpainted regions for local editing are indicated with the light red color. “Fixed Seed” indicates editing results from the same random seed (the same initial noise).
BibTeX
@inproceedings{bai2024edicho,
  title     = {Edicho: Consistent Image Editing in the Wild},
  author    = {Bai, Qingyan and Ouyang, Hao and Xu, Yinghao and Wang, Qiuyu and Yang, Ceyuan and Cheng, Ka Leong and Shen, Yujun and Chen, Qifeng},
  booktitle = {arXiv preprint arXiv:2412.21079},
  year      = {2024}
}