🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
  • Home
  • AI Reference
  • What loss functions are typically used when training diffusion models?

What loss functions are typically used when training diffusion models?

When training diffusion models, the most common loss functions focus on measuring the difference between the predicted noise and the actual noise added during the diffusion process. The core idea is to train the model to reverse the gradual noising of data by predicting the noise at each step. The primary loss function used is a mean squared error (MSE) loss between the model’s predicted noise and the true noise applied to the data. This approach is straightforward and aligns with the diffusion process’s iterative denoising objective. For example, in the Denoising Diffusion Probabilistic Models (DDPM) framework, the loss simplifies to minimizing the MSE between the model’s output and the noise added at each timestep. This works because the model learns to reconstruct the original data by progressively refining its noise predictions.

Some variations of diffusion models use a weighted version of the MSE loss to account for differences in noise levels across timesteps. During training, earlier timesteps (where data is less noisy) might be assigned lower weights, while later timesteps (where data is highly corrupted) receive higher weights. This weighting helps the model focus on learning to denoise more challenging, noisier states. For instance, the original DDPM paper introduced a simplified weighting scheme that avoids complex calculations, making training computationally efficient. Additionally, some implementations use a variational lower bound (VLB) loss, which stems from maximizing the likelihood of the data under the model’s probabilistic assumptions. However, in practice, the VLB loss is often approximated or replaced with the simpler MSE loss due to its computational efficiency and stable training behavior.

In specialized cases, developers might combine the MSE loss with other objectives. For example, when training latent diffusion models (like Stable Diffusion), an L1 or perceptual loss can be added to improve the quality of reconstructions in the latent space. However, these additions are secondary to the core noise-prediction loss. The choice of loss function often depends on the specific architecture and trade-offs between training stability and output quality. For most developers, starting with the standard MSE loss—as implemented in frameworks like Hugging Face’s Diffusers or OpenAI’s codebases—provides a reliable baseline. Practical implementations typically involve scaling the loss by timestep-specific factors or adjusting learning rates to balance convergence speed and stability.

Like the article? Spread the word