🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How is the reverse process learned during training?

The reverse process in diffusion models is learned through a training procedure that teaches the model to iteratively remove noise from data. During training, the model observes how noise is progressively added to data samples (the forward process) and learns to predict how to reverse this. The core idea is to train a neural network to estimate the noise present at each step of the forward process, which allows it to reconstruct the original data when run in reverse. This is typically done using a loss function that compares the model’s predicted noise to the actual noise added during the forward pass. By minimizing this difference across many training examples, the model learns a step-by-step denoising procedure.

The training process involves two key components: timestep conditioning and noise prediction. Each training example is paired with a randomly selected timestep (t), which corresponds to a specific noise level in the forward process. The model is conditioned on this timestep, allowing it to adjust its predictions based on how much noise needs to be removed at each step. For example, early timesteps might require the model to predict subtle noise patterns, while later timesteps involve larger corrections. Architectures like U-Net are often used because they efficiently process spatial data (e.g., images) and can incorporate timestep information via embeddings. The model doesn’t learn a single-step reversal but instead learns a sequence of small denoising operations that collectively reverse the entire forward process.

A practical example is training a model to generate images. Suppose the forward process adds Gaussian noise to an image over 1,000 steps. During training, the model receives a noisy image at a random step (e.g., step 500) and predicts the noise added at that step. If the prediction matches the actual noise, subtracting it from the noisy image would partially reconstruct the original. Repeating this across all timesteps teaches the model to “chain” these predictions, enabling it to start from pure noise and iteratively refine it into a coherent image. Techniques like cosine noise scheduling or learned variances can further optimize how noise levels are distributed across timesteps. This approach ensures the model generalizes to unseen data by learning the underlying data distribution through noise prediction rather than memorizing specific examples.

Like the article? Spread the word