DM Television

This report presents a preliminary analysis of improvements to the foundation model Stable Diffusion for text-to-image synthesis. While we achieve significant improvements in synthesized image quality, prompt adherence and composition, in the following, we discuss a few aspects for which we believe the model may be improved further:

\ • Single stage: Currently, we generate the best samples from SDXL using a two-stage approach with an additional refinement model. This results in having to load two large models into memory, hampering accessibility and sampling speed. Future work should investigate ways to provide a single stage of equal or better quality.

\ • Text synthesis: While the scale and the larger text encoder (OpenCLIP ViT-bigG [19]) help to improve the text rendering capabilities over previous versions of Stable Diffusion, incorporating byte-level tokenizers [52, 27] or simply scaling the model to larger sizes [53, 40] may further improve text synthesis.

\ • Architecture: During the exploration stage of this work, we briefly experimented with transformer-based architectures such as UViT [16] and DiT [33], but found no immediate benefit. We remain, however, optimistic that a careful hyperparameter study will eventually enable scaling to much larger transformer-dominated architectures.

\ • Distillation: While our improvements over the original Stable Diffusion model are significant, they come at the price of increased inference cost (both in VRAM and sampling speed). Future work will thus focus on decreasing the compute needed for inference, and increased sampling speed, for example through guidance- [29], knowledge- [6, 22, 24] and progressive distillation [41, 2, 29].

\ • Our model is trained in the discrete-time formulation of [14], and requires offset-noise [11, 25] for aesthetically pleasing results. The EDM-framework of Karras et al. [21] is a promising candidate for future model training, as its formulation in continuous time allows for increased sampling flexibility and does not require noise-schedule corrections.

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

Feed: Hacker Noon - Medium

View: Original article

Tags: framework

Frameworks