Untitled

is it possible to extract the visual knowledge learned by large diffusion models for visual perception tasks?

질문: vosial perception을 위해 diffusion model에 학습한 visual knowledge 추출이 가능할까? → 쉽지않음

Although rich representations are learned in large diffusion models, it is still unknown how to extract this knowledge for various visual perception tasks and whether it can benefit visual perception.

문제제기: 풍부한 표현을 하더라도 어떻게 다양한 visual perception task와 visual perception에 이익이 되는지 알수 없음

To tackle these challenges, we introduce a new framework called VPD to adapt pre-trained diffusion models for visual perception tasks. Instead of using the step-bystep diffusion pipeline, we propose to simply employ the autoencoder as a backbone model to directly consume the natural images without noise and perform a single extra denoising step with designed prompts to extract the semantic information.

이 문제는 저자가 도입한 VPD라 불리는 visual perception tasks를 위한 pretrained diffusion model을 사용. step-by-step diffusion pipeline을 사용하는 것 대신에 저자는 간단한 autoencoder를 backbone 모델로 사용하여 노이즈 없는 그냥 이미지를 input으로 넣고 semantic information을 추출하기 위해 설계된 promprs와 함께 추가적인 denosiing step을 행함.

Untitled

1. Method

Our key idea is to investigate how to fully extract the pre-trained high-level knowledge in a pre-trained text-toimage diffusion model.

어떻게 diffusion model에 pre-trained high-level 정보를 추출하는지 조사함.

Preliminaries: Diffusion Models

Denoting zt as the random variable at t-th timestep, the diffusion process is modeled as a Markov:

Untitled

where {αt} are fixed coefficients that determine the noise schedule. The above definition leads to a simple close form of p(zt|z0):

Untitled

With proper re-parameterization, the training objective of diffusion models can be derived as [20]:

Untitled

아는 내용 넘김

Prompting Text-to-Image Diffusion Model