DM Television

Starting with the seminal works Ho et al. [14] and Song et al. [47], which demonstrated that DMs are powerful generative models for image synthesis, the convolutional UNet [39] architecture has been the dominant architecture for diffusion-based image synthesis. However, with the development

\ Comparison of SDXL and older Stable Diffusion models.

\ of foundational DMs [40, 37, 38], the underlying architecture has constantly evolved: from adding self-attention and improved upscaling layers [5], over cross-attention for text-to-image synthesis [38], to pure transformer-based architectures [33].

\ We follow this trend and, following Hoogeboom et al. [16], shift the bulk of the transformer computation to lower-level features in the UNet. In particular, and in contrast to the original Stable Diffusion architecture, we use a heterogeneous distribution of transformer blocks within the UNet: For efficiency reasons, we omit the transformer block at the highest feature level, use 2 and 10 blocks at the lower levels, and remove the lowest level (8× downsampling) in the UNet altogether — see Tab. 1 for a comparison between the architectures of Stable Diffusion 1.x & 2.x and SDXL. We opt for a more powerful pre-trained text encoder that we use for text conditioning. Specifically, we use OpenCLIP ViT-bigG [19] in combination with CLIP ViT-L [34], where we concatenate the penultimate text encoder outputs along the channel-axis [1]. Besides using cross-attention layers to condition the model on the text-input, we follow [30] and additionally condition the model on the pooled text embedding from the OpenCLIP model. These changes result in a model size of 2.6B parameters in the UNet, see Tab. 1. The text encoders have a total size of 817M parameters.

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

Feed: Hacker Noon - Medium

View: Original article

Tags: distribution underlying

Content Distribution