Diffusion models rely on neural networks to iteratively denoise data, and three architectures are commonly used: U-Nets, Transformer-based networks, and ResNet variants. Each architecture addresses specific challenges in modeling the diffusion process, such as capturing spatial relationships, handling sequential denoising steps, or improving training stability. The choice depends on the data type (e.g., images, text) and computational constraints.
U-Nets are the most widely used architecture for image-based diffusion models, such as Stable Diffusion and DDPM (Denoising Diffusion Probabilistic Models). U-Nets combine an encoder-decoder structure with skip connections, allowing the model to preserve fine-grained details during the denoising process. The encoder downsamples input data to extract high-level features, while the decoder upsamples to reconstruct the output. Skip connections bridge these stages, ensuring local details are retained. Additionally, many implementations integrate attention mechanisms (similar to those in Transformers) into U-Net blocks. For example, Stable Diffusion uses a U-Net with cross-attention layers to condition image generation on text prompts. This hybrid approach balances spatial awareness with the ability to model long-range dependencies.
Transformer-based architectures are increasingly used in diffusion models, particularly for tasks involving sequential or high-dimensional data. Transformers excel at capturing global context through self-attention, which is useful for modeling relationships across diffusion timesteps or data modalities. For instance, Google’s Imagen employs a Transformer to encode text prompts, which guide the diffusion process handled by a U-Net. Pure Transformer architectures, like those in some video diffusion models, treat data as sequences of tokens, enabling flexible handling of variable-length inputs. However, their computational cost often limits use to specific components (e.g., conditioning networks) rather than the entire denoising pipeline.
ResNet variants, built with residual blocks, are another common choice. These blocks use skip connections to mitigate vanishing gradients, making them stable for deep networks. In diffusion models, ResNets are often integrated into U-Net architectures. For example, the original DDPM paper uses a U-Net composed of ResNet blocks for image generation. Latent diffusion models, such as Stable Diffusion, combine ResNet-based U-Nets with autoencoders: the autoencoder compresses images into a lower-dimensional latent space, reducing compute demands, while the U-Net operates in this space for efficient training and sampling. This approach balances performance and resource usage, making it practical for large-scale applications.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word