🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How are sinusoidal embeddings implemented in diffusion models?

Sinusoidal embeddings in diffusion models are used to encode time steps into continuous vector representations, helping the model understand the progression of the denoising process. These embeddings convert discrete timesteps (e.g., t=0 to t=1000 in a 1000-step diffusion process) into high-dimensional vectors using sine and cosine functions. The key idea is to create a smooth, periodic representation of time that the neural network can process more effectively than raw integer timesteps. For example, a timestep t is mapped to a vector where each dimension alternates between sine and cosine values of exponentially increasing frequencies, allowing the model to capture both fine-grained and broad temporal patterns.

The implementation typically involves two steps. First, a scalar timestep t is projected into a higher-dimensional space using a geometric sequence of frequencies. For each dimension i in the embedding, a frequency is computed as 1/(10000^(i/d)), where d is the total embedding dimension. The sine function is applied to even indices and the cosine to odd indices of the resulting vector. This creates a unique pattern for each timestep while maintaining continuity between adjacent steps. For instance, in PyTorch, this might be implemented using tensor operations to generate the frequency bands and element-wise trigonometric functions. The final embedding is often passed through a linear layer or MLP to align it with the model’s hidden dimensions before being added to feature maps in the U-Net architecture commonly used in diffusion models.

These embeddings are critical because diffusion models rely on conditioning each denoising step on the current timestep. For example, in a text-to-image model like Stable Diffusion, the time embedding helps the U-Net distinguish between early steps (where coarse structure is generated) and later steps (where fine details are refined). The sinusoidal approach avoids abrupt transitions between timesteps, unlike learned embeddings, which might struggle to generalize to unseen timesteps. By providing a structured yet flexible time representation, sinusoidal embeddings enable the model to smoothly adapt its behavior across the entire diffusion process, improving both training stability and generation quality.

Like the article? Spread the word