However, these generative models are designed as bespoke systems, which only allow a single task. Actually, humans can generate various multi-modal content simultaneously, with arbitrary conditioning types.
Toward a general generative system on multimodal data, a unified training framework that can cover all types of multi-modal generative tasks (see Figure 1) is one of the fundamental components.
문제제기: generative model들은 맞춤현 시스템으로 설계되는데 이게 한가지일만 가능하다. 사람은 여러 일을 동시에 이뤄낸다. multimodal data 시스템 방향으로 통합된 training framework로 하나의 기반 요소로 모든 종류의 generative tasks를 cover할 수 있다.
In contrast, this paper presents a diffusion-based framework (dubbed UniDiffuser) that explicitly fits all relevant distributions in one model without introducing additional training or inference overhead. Our key insight is – learning diffusion models for all distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities.
이 논문은 추가적인 학습이나 inference overhead 없이 하나 모델에 ㄱ모든 관련 분포들을 맞춘 diffusion-based framework이다. 저자의 주요 insight는 모든 분포 학습 중인 diffusion model은 perturbed data에 noise 예측을 통합할 수 있다.ㅁ
Formally, suppose we have two modalities of data sampled from distribution q(x0, y0). We aim to design a diffusionbased model that is able to capture all relevant distributions determined by q(x0, y0), i.e., the marginal distributions q(x0) and q(y0), the conditional distributions q(x0|y0) and q(y0|x0), and the joint distribution q(x0, y0).
분포로부터 2가지 양식을 가져와야함. q(x0,y0)에 의해 관련 분포를 잡을 수 있는데 주변분포랑 조건부 분포 그리고 joint distribution임,
In particular, modeling the marginal distribution q(x0) is equivalent to estimating the conditional expectation of the noise injected to xt, i.e., E[ x |xt], according to Eq. (1). Similarly, the key quantities to be estimated in modeling the conditional distribution q(x0|y0) and the joint distribution q(x0, y0) are E[ x |xt, y0] (see Eq. (3)) and E[ x , y |xt, yt] respectively
q(x0) 주변분포 모델링은 주입된 noise의 conditional expectation 추정과 동일함. 비슷하게[ conditional distribution q(x0|y0)과 joint distiribution q(x0,y0)에 측정은 각각 아래와 같다.
est 1
est 2
A key observation is that all above conditional expectations can be unified in the general form of E[ x , y |xt x , yt y ], where t x and t y are two timesteps that can be different, and xt x and yt y are the corresponding perturbed data. In particular, a maximum timestep T means marginalizing it.
tx,ty가 다르더라도 codntional expectations는 est 2로부터 통합될 수 있다. timestep T 최대값은 이것을 marginalizing 한다는 의미다.
Namely, by setting t y = T, we have E[ x |xt x , yT ] ≈ E[ x |xt x ] 1 , which corresponds to the marginal distribution q(x0).
즉 ty가 T이면 아래와 같다는 의미다.
Formally, E[ x |xt x , y0] corresponds to the conditional distribution q(x0|y0) by setting t y = 0 and E[ x , y |xt, yt] corresponds to the joint distribution q(x0, y0) by setting t x = t y = t. Moreover, we can characterize q(x0|yt y ) and q(y0|xt x ) for all t y and t x and generate data conditioned on noisy input, by estimating E[ x , y |xt x , yt y ] in general.