Motivated by the ability of guided diffusion models to generate photorealistic samples and the ability of text-to-image models to handle free-form prompts, we apply guided diffusion to the problem of text-conditional image synthesis.
저자는 자유롭게 prompts를 모델에 넣는 능력과 실제 같은 샘플을 만드는 모델로부터 동기를 얻어 저자는 text-conditional 이미지 생성의 문제를 guided diffusion에 적용한다
While our model can render a wide variety of text prompts zero-shot, it can can have difficulty producing realistic images for complex prompts.
저자의 모델은 다양하고 넓게 zero-shot이 가능함. complex prompts에는 realistic image를 생성하는데 어려움을 지님.
which allows humans to iteratively improve model samples until they match more complex prompts.
그래서 여기에 더해 더 어려운 cmplex prompts에 맞을 떄까지 모델 샘플을 향상한다.
A CLIP model consists of two separate pieces: an image encoder f(x) and a caption encoder g(c). During training, batches of (x, c) pairs are sampled from a large dataset, and the model optimizes a contrastive cross-entropy loss that encourages a high dot-product f(x) · g(c) if the image x is paired with the given caption c, or a low dot-product if the image and caption correspond to different pairs in the training data.
image와 caption encoder로 구성됨. f(x)*g(c)로 cross-entropy 최적화를 함.
To apply the same idea to diffusion models, we can replace the classifier with a CLIP model in classifier guidance. In particular, we perturb the reverse-process mean with the gradient of the dot product of the image and caption encodings with respect to the image:
classifier guidance에 CLIP model로 하여 classifier를 대체할 수 있음. 특히 위와 같이 계산이 됨.
Throughout our experiments, we use CLIP models that were explicitly trained to be noise-aware, which we refer to as noised CLIP models
CLIP 모델에 reverse-process에서 올바른 기울기를 얻기 위해 noised image를 학습함.
We adopt the ADM model architecture proposed by Dhariwal & Nichol (2021), but augment it with text conditioning information
ADM 모델 구조 사용. text conditioning 정보를 augment했다.