To implement cosine annealing with warm restarts, you need a learning rate schedule that combines periodic resets with a cosine-shaped rate decay. This technique adjusts the learning rate (LR) during training by following a cosine curve that restarts to a higher value at predefined intervals. The goal is to help the model escape local minima and converge faster by periodically “resetting” the LR to a higher value, followed by a gradual decay. Libraries like PyTorch and TensorFlow have built-in classes (e.g., CosineAnnealingWarmRestarts
in PyTorch) to simplify implementation, but you can also create a custom scheduler.
For example, in PyTorch, you can use torch.optim.lr_scheduler.CosineAnnealingWarmRestarts
. Initialize it with parameters like T_0
(number of epochs until the first restart) and T_mult
(a multiplier to increase the restart interval after each cycle). The learning rate starts at an initial value, drops following a cosine curve until the restart point, then resets and repeats with a longer cycle if T_mult > 1
. Here’s a snippet:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
optimizer, T_0=10, T_mult=2, eta_min=1e-6
)
for epoch in range(100):
# Training loop...
scheduler.step()
In this example, the first cycle runs for 10 epochs (T_0
), the next for 20 (T_0 * T_mult
), and so on. eta_min
sets the minimum LR during decay. The scheduler updates the LR automatically after each epoch.
The key considerations are tuning T_0
and T_mult
to match your dataset and model size. Shorter cycles (small T_0
) work well for small datasets or when training time is limited, while longer cycles suit larger models. The warm restart mechanism helps avoid stagnation—for instance, if validation loss plateaus, the LR reset can push the optimizer to explore new regions of the loss landscape. However, frequent restarts might destabilize training, so monitor performance closely during initial experiments. This approach is particularly effective in scenarios like semi-supervised learning or when training with noisy data, where periodic LR spikes help correct misaligned gradients.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word