Accelerating Score-based Generative Models for High-Resolution Image Synthesis

Hengyuan Ma¹ Li Zhang^{1 *} Xiatian Zhu² Jingfeng Zhang³ Jianfeng Feng¹

¹Fudan University ²University of Surrey ³RIKEN

Paper Code

Abstract

Score-based generative models (SGMs) have recently emerged as a promising class of generative models. The key idea is to produce high-quality images by recurrently adding Gaussian noises and gradients to a Gaussian sample until converging to the target distribution, a.k.a. the diffusion sampling. To ensure stability of convergence in sampling and generation quality, however, this sequential sampling process has to take a small step size and many sampling iterations (e.g., 2000). Several acceleration methods have been proposed with focus on low-resolution generation. In this work, we consider the acceleration of high-resolution generation with SGMs, a more challenging yet more important problem. We prove theoretically that this slow convergence drawback is primarily due to the ignorance of the target distribution. Further, we introduce a novel Target Distribution Aware Sampling (TDAS) method by leveraging the structural priors in space and frequency domains. Extensive experiments on CIFAR-10, CelebA, LSUN, and FFHQ datasets validate that TDAS can consistently accelerate state-of-the-art SGMs, particularly on more challenging high resolution 1024x1024 image generation tasks by up to 18.4x, whilst largely maintaining the synthesis quality. With fewer sampling iterations, TDAS can still generate good quality images. In contrast, the existing methods degrade drastically or even fails completely.

High resolution facial images (FFHQ) with 1024x1024 resolution generated by NCSN++ with TDAS (top two rows) and original NCSN++ sampling (bottom two rows) under a variety of sampling iterations T.

Samples (LSUN church and bedroom) with 256x256 resolution generated by NCSN++ with TDAS (top) and original NCSN++ sampling (bottom) under 400 iteration.

Facial images (CelebA) with 64x64 resolution generated on CelebA using NCSN++ with and without TDAS.

Fréchet inception distance (FID) score of NCSN++ with and without TDAS generating on CIFAR-10 under a variety of sampling iterations T.

Our TDAS encourages the faster convergence of the diffusion process by regulating the initial sample point and added noises at every iteration in both space and frequency domain. After regulation, both the initial sample point and added noises are more similar to the target distribution, reducing the difficulty of convergence to the target distribution especially in the large step size cases.

Accelerating Score-based Generative Models for High-Resolution Image Synthesis

Hengyuan Ma1 Li Zhang1 * Xiatian Zhu2 Jingfeng Zhang3 Jianfeng Feng1

1Fudan University 2University of Surrey 3RIKEN

Paper Code

Abstract

Hengyuan Ma¹ Li Zhang^{1 *} Xiatian Zhu² Jingfeng Zhang³ Jianfeng Feng¹

¹Fudan University ²University of Surrey ³RIKEN