We argue that the key to controllable image generation relies not only on conditioning, but even more significantly on compositionality (Lake et al., 2017). The latter can exponentially expand the control space by introducing an enormous number of potential combinations
controllable image 생성은 오직 conditioning에만 의존하면 안되고 compositionality가 큰 향상을 보인다. compositionality는 많은 조합들의 잠재성을 도입하는 control space로 확장한다.
Composer is capable of decoding novel images from unseen combinations of representations that may come from different sources and potentially incompatible with one another.
Our framework comprises the decomposition phase, where an image is divided into a set of independent components; and the composition phase, where the components are reassembled utilizing a conditional diffusion model.
decomposition 단계 그리고 composition 단계는 conditional diffusion과 합치는 과정으로 이렇게 2개로 나뉜다.
Typically, a simple mean-squared error is used as the denoising objective:
Classifier-free guidance
Guidance directions
DDIM으로 deterministically reverse a sample 생성하고 다른 건 설명 불필요. guidance directions로 컨디션을 2개로 하였다.
Bidirectional guidance: By reversing an image x0 to its latent xT using condition c1, and then sampling from xT using another condition c2, we are able to manipulate the image in a disentangled manner using Composer, where the manipulation direction is defined by the difference between c2 and c1.
Bidirectional guidance로 c1,c2 컨디션이 있는데 x0→xT으로 reversing할때는 condition c1으로 XT로부터 sampling 할때는 컨디션 c2를 사용했다.
We decompose an image into decoupled representations which capture various aspects of it. We describe eight representations we use in this work, where all of them are extracted on-the-fly during training.
image를 decompose 할때 다양한 측명에서 잡는다. 8개 representation으로 학습 중에 즉각 나온다.