we propose a simple and generic approach for enabling continuous state diffusion models to generate discrete data. The key ingredient in our approach is analog bits: real numbers used to model the bits that represent the discrete data. Analog bits can be directly modeled by continuous state diffusion models, without requiring a discrete state space or re-formulation of the continuous diffusion process. At sampling time, the generated analog bits can be decoded into discrete variables by a simple thresholding operation. Our approach, as illustrated in Figure 1, is based on the following high-level conjecture.
To reduce the prediction loss (such as negative log likelihood), the network has to model structures among analog bits that can actually lead to meaningful discrete variables after thresholding.
Analog Bits
A discrete data variable from an alphabet of size K can be represented using n = dlog2 Ke bits, as {0, 1} n. Due to the discreteness, existing work has to re-formulate continuous diffusion models by adopting a discrete data space and state space
이게 n은 bit 크기를 나타낸다. continuos 모델을 discrete data와 state space를 적용하여 re-formulate하는 작업을 기존까지 했다한다.
. In contrast, we propose to simply cast the binary bits {0, 1} n into real numbers R n for the continuous diffusion models 1 . We term these real numbers analog bits since they learn to share the same bimodal values as binary bits but are modeled as real numbers. To draw samples, we follow the same procedure as sampling in a continuous diffusion model, except that we apply a quantization operation at the end by simply thresholding the generated analog bits. This yields binary bits which can be then converted into original discrete/categorical variables. Notably, there is no hard constraint to force the model to generate exact binary bits, but we expect a strong continuous generative model to generate real numbers that exhibit very clear bimodal concentrations and this is what happens in our experiments.
하지만 저자는 위에 언급된 방식과 다르고 오직 binary bits만 사용한다고 함. real number과 모델들과 같은 bimodal 값들을 공유하고 한습하지만 sample을 가져오기 위해 continous diffusion model의 과정을 똑같이하고 간단한 thresholding으로 quantization을 한다.이게 binary bits를 discrete/categorical 값으로 바꿔준다. 아무튼 좋은 bimodal이 있어야 좋은 결과를 낸다.
binary bits라고 Cross Entropy loss를 적용못하진 않는다.
Self-Conditioning
. However, as shown in Figure 2a, the previously estimated x˜0 is simply discard when estimating x0 from a new time step, i.e. the denoising function f(xt, t) does not directly depend on a previously estimated x˜0. Here we consider a slightly different denoising function of f(xt, x˜0, t) that also takes previous generated samples as its input, illustrated in Figure 2b.
기존 방식은 x0을 쓰면 버렸다. 하지만 conditioning을 위해 이 모델에서는 버리지 않고 재활용을 한다. 그리고 cost도 무시할 정도(neglibible)
In order to train the denoising function f(xt, x˜0, t), we make some small changes to the training. With some probability (e.g., 50%), we set x˜0 = 0 which falls back to modeling without Self-Conditioning. At other times, we first estimate x˜0 = f(xt, 0, t) and then use it for Self-Conditioning. Note that we do not backpropagate through the estimated x˜0 so the overall increase of training time is small (e.g., less than 25%).
학습하기 위해 몇가지 변화를 주었는데 어떤 확률일때는 self-conditioning 없이 x˜0=0으로 설정한다. 이외에는 x˜0 = f(xt, 0, t)로 값을 구하고 self-conditiong을 한다.
Asymmetric Time Intervals
we identify another factor, time step t, that can also impact Bit Diffusion models.
During a typical reverse diffusion process, the model takes symmetric time intervals (i.e., ∆ as in t → t − ∆) for both the state transition and time reduction itself, resulting in the same/shared t for both arguments of f(xt, t). However, we find that, when taking large reverse steps, using asymmetric time intervals, implemented via a simple manipulation of time scheduling at generation, can lead to improved sampling quality for Bit Diffusion models.
지금까지 t를 시간으로 나타내고 조금씩 조금씩 대칭적으로 t→ t − ∆로 바꾸었다. 하지만 저자는 이 델타 ∆값을 높이면 간단하게 좋은 결과를 얻는다고 한다.