This paper investigates the general applicability of Palette, our implementation of image-to-image diffusion models, to a suite of distinct and challenging tasks, namely colorization, inpainting, uncropping, and JPEG restoration (see Figs. 1, 2). We show that Palette, with no task-specific architecture customization, nor changes to hyper-parameters or the loss, delivers high-fidelity outputs across all four tasks.
We study key components of Palette, including the denoising loss function and the neural net architecture.
We find that while 𝐿2 [Ho et al. 2020] and 𝐿1 [Chen et al. 2021a] losses in the denoising objective yield similar sample-quality scores, 𝐿2 leads to a higher degree of diversity in model samples, whereas 𝐿1 [Chen et al. 2021a] produces more conservative outputs. We also find that removing self-attention layers from the U-Net architecture of Palette, to build a fully convolutional model, hurts performance.
Given a training output image 𝒚, we generate a noisy version 𝒚e, and train a neural network 𝑓𝜃 to denoise 𝒚egiven 𝒙 and a noise level indicator 𝛾, for which the loss is
While 𝐿1 may be useful, to reduce potential hallucinations in some applications, here we adopt 𝐿2 to capture the output distribution more faithfully.
Architecture. Palette uses a U-Net architecture [Ho et al. 2020] with several modifications inspired by recent work [Dhariwal and Nichol 2021; Saharia et al. 2021; Song et al. 2021]. The network architecture is based on the 256×256 class-conditional U-Net model of [Dhariwal and Nichol 2021]. The two main differences between our architecture and theirs are (i) absence of class-conditioning, and (ii) additional conditioning of the source image via concatenation, following [Saharia et al. 2021].