Untitled

To better understand the origin of this instability, we decompose the score field into three regimes.

In other words, in this regime, the sources of the noisy examples generated in the course of the forward process become ambiguous. We illustrate the problem in Figure 1(a), where each stochastic update of the score model is based on disparate targets.

We propose a generalized version of the denoising score-matching objective, termed the Stable Target Field (STF) objective. The idea is to include an additional reference batch of examples that are used to calculate weighted conditional scores as targets. We apply self-normalized importance sampling to aggregate the contribution of each example in the reference batch.

However, we show that the bias together with the trace-of-covariance of the STF training targets shrinks to zero as we increase the size of the reference batch.ver, we show that the bias together with the trace-of-covariance of the STF training targets shrinks to zero as we increase the size of the reference batch.

1. UNDERSTANDING THE TRAINING TARGET IN SCORE-MATCHING OBJECTIVE

The vanilla denoising score-matching objective at time t is:

Untitled

We can swap the order of the sampling process by first sampling x(t) from pt and then x from p0|t(·|x(t)). Thus, sθ has a closed form minimizer:

Untitled

결국은 Eq 2 log p_t|0(x(t)|x)로부터 x(t)를 얻고 반대로 p0|t(x|x(t))로 역행하여 sampling process를 얻는 것이 결국 s_theta의 최소화에 근접한다? 라고 해석됨

In particular, when multiple modes of the data distribution have comparable influences on x(t), p0|t(·|x(t)) is a multi-mode distribution, as also observed in Xiao et al. (2022). Thus the targets ∇x(t) log pt|0(x(t)|x) vary considerably across different x and this can strongly affect the estimated score at (x(t), t), resulting in slower convergence and worse performance in practical stochastic gradient optimization (Wang et al., 2013).

To quantitatively characterize the variations of individual targets at different time, we propose a metric – the average trace-of-covariance of training targets at time t:

Untitled

결론은 V_DSM은 각 시간에 달라지는 target을 측정하여 metric으로 제안한다.

Untitled

We use VDSM(t) to define three successive phases relating to the behavior of training targets. As shown in Figure 2(a), the three phases partition the score field into near, intermediate, and far regimes (Phase 1∼3 respectively).

In Phase 1, the posterior p0|t concentrates around one single mode, thus low variation. In Phase 3, the targets remain similar across modes since limt→1 pt|0(x(t)|x) ≈ p1 for commonly used transition kernels.

2. TREATING SCORE AS A FIELD

Since sampling directly from the posterior p0|t is not practical, we first apply importance sampling with the proposal distribution p0. Specifically, we sample a large reference batch BL = {xi} n i=1 ∼ p n 0 and get the following approximation: