Therefore, a very natural question arises: whether the reliance of the CNN-based U-Net is necessary in diffusion models?
질문 제기 : diffusion model에 U-Net에 기반한 모델에 의존할 필요가 있을까?
In this paper, we design a simple and general ViT-based architecture called U-ViT (Figure 1). Following the design methodology of transformers, U-ViT treats all inputs including the time, condition and noisy image patches as tokens. Crucially, U-ViT employs long skip connections between shallow and deep layers inspired by U-Net. Intuitively, low-level features are important to the pixel-level prediction objective in diffusion models and such connections can ease the training of the corresponding prediction network. Besides, U-ViT optionally adds an extra 3×3 convolutional block before output for better visual quality.
저자는 ViT 기반 구조로 설계함. transformer의 설계 방법에 따라 U-ViT는 시간, condition 그리고 noisy 이미지를 토큰으로 만들어 입력한다. 결정적으로 U-ViT는 shallow하고 deep layers 사이에 long skip connections을 가져온다. low-level features는 모델에 pixel-level 예측에 필요하고 맞는 예측의 학습을 쉽게 해줌. U-ViT는 선택적으로 3x4 block를 추가해 퀄리티를 높일 수 있다. Fig 2에서 연구 결과를 볼 수 있다.
Our results suggest that the long skip connection is crucial while the down/up-sampling operators in CNN-based U-Net are not always necessary for image diffusion models. We believe that U-ViT can provide insights for future research on diffusion model backbones and benefit generative modeling on large scale cross-modality datasets.
저자의 결과로는 long skip connection은 down/up sampling하는 동안 중요하고 U-Net이 항상 필요하지 않다는 결과를 내림. 그리고 큰 사이즈의 모델에 이익을 가져다줌.
. In particular, U-ViT parameterizes the noise prediction network1 Eθ(xt, t, c) in Eq. (1).
Following the design methodology of ViT, the image is split into patches, and U-ViT treats all inputs including the time, condition and image patches as tokens (words).
U-ViT는 공식 1 모델과 같이 condition을 넣음. 그리고 ViT 방법에 따라 image를 patches로 나누고 U-ViT는 token으로 image patches와 condtion 그리고 time을 포함해 넣는다.
U-ViT also employs similar long skip connections between shallow and deep layers. Intuitively, the objective in Eq. (1) is a pixel-level prediction task and is sensitive to low-level features. The long skip connections provide shortcuts for the low-level features and therefore ease the training of the noise prediction network.
Additionally, U-ViT optionally adds a 3×3 convolutional block before output.
U-ViT는 U-Net에서 좋은 걸 가져왔는데 long skip connection임. 공식 1에 objective는 pixel-level 예측과 low-level features에 예민함. long skip connections는 low-level features에 shoctcuts을 제공하기에 noise prediction network의 학습이 쉬워진다. 추가적으로 3x3 conv를 추가 할 수 있음.
The way to combine the long skip branch.
Let hm, hs ∈ R L×D be the embeddings from the main branch and the long skip branch respectively. We consider several ways to combine them before feeding them to the next transformer block:
(1) concatenating them and then performing a linear projection as illustrated in Figure 1, i.e., Linear(Concat(hm, hs)); (2) directly adding them, i.e., hm + hs; (3) performing a linear projection to hs and then adding them, i.e., hm + Linear(hs); (4) adding them and then performing a linear projection, i.e., Linear(hm + hs). (5) We also compare with the case where the long skip connection is dropped. As shown in Figure 2 (a), directly adding hm, hs does not provide benefits.