Till now, it is still challenging to edit only local objects in a video, such as changing a running “dog” into a “cat” without influencing the environment. This paper proposes a pipeline that can edit a video both locally and globally, as shown in Figs. 1 and 5.
문제 제기: 특정 부분만 편집하는 것은 어려움. 그래서 locally하고 globally한 거 둘다 편집가능한 pipeline 제시.
In order to edit a real image, this pipeline includes two necessary steps: (1) inverting the image into latent features with a pre-trained diffusion model, and (2) controlling attention maps in the denoising process to edit the corresponding parts of the image.
real image를 편집하기 위해, 2단계를 거침. (1) 이미지를 pretrained diffusion model에 latent features와 함께 image inverting (2) 이미지의 상응하는 부분을 편집하기 위해 denoising process에 attention map을 컨트롤.
In this paper, we aim to build an attention control-based pipeline for video editing. Since no large-scale pre-trained video generation models are publicly available, we propose a novel framework to show that a pre-trained image diffusion model can be adapted for detailed video editing.
논문에서는 video editing을 attention control-based pipeline 설계를 목표로 둠. 큰 pre-trained video generation models는 publicly available하며 자세한 video editing이 가능함.
While a pre-trained image diffusion model can be utilized for video editing by processing frames individually (Image-P2P), it lacks semantic consistency across frames (the 2nd row of Fig. 2).
문제제기 2: fig 2를 2행을 보면 연속적으로 펭귄 모습이 유지 안됨. 아직 이런 일관성에 취약함.
To maintain semantic consistency, we propose using a structure on inversion and attention control for all frames, by transforming the Text-to-image diffusion model (T2I) into a Text-to-set model (T2S).
Fig 2 3행처럼 의미를 유지하기 위해, 저자는 모든 프레임에 inversion과 attention control을 사용하여 제안함. T2I를 T2S로 바꿈
The generation quality will be degraded with the inflation step but it can be recovered after tuning on the original video.
To improve the inversion quality, we propose to optimize a shared unconditional embedding for all frames to align the denoising latent features with the diffusion latent features.
generation quality는 inflation step과 함께 퀄리티가 떨어지지만 기존 비디오를 tuning 후에 개선 가능함. inversion quality를 높이기 위해, diffusion latent features오 ㅏ함께 denoising latent features를 align 하기 위해 모든 framse를 위한 unconditional embedding을 공유하는 최적화를 제안함.
On the other hand, we find that the approximate inversion with an initialized unconditional embedding is editable but cannot reconstruct well. To address this issue, we propose a decoupled-guidance strategy in attention control, utilizing different guidance strategies for the source and target prompts.
문제제기 3: initialized unconditional embedding과 함께 approximate inversion은 editable하지만 reconstruct를 못함. 이문제는 attention control에 decoupled guidance strategy를 제안하고 source와 target prompts를 위한 다른 guidance strategies를 활용함.