Motivation.
In the task of in-the-wild image editing, learning-based methods often lack proper regularization,
resulting in inconsistent edits due to the difficulty of obtaining high-quality training data and enforcing uniformity constraints.
Non-optimization methods rely on implicit correspondence from attention features for appearance transfer, but struggle with unstable predictions and intrinsic image variations,
leading to inconsistent or distorted edits. We visualize the correspondence predicted respectively by explicit and attention-based implicit methods in the figure below,
accompanied by the attention maps for correspondence prediction (regions with the highest attention weights are outlined with dashed circles).
Framework.
To achieve consistent editing, we propose a training-free and plug-and-play method which first predicts the explicit correspondence for the input
images. The pre-computed correspondence is injected into the pre-trained diffusion models and guide the denoising in the two levels of (a)
attention features and (b) noisy latents in classifier-free guidance (CFG), as in the figure below.