The learning rate schedule used during fine-tuning typically follows a pattern that starts with a relatively low initial rate, gradually adjusts it during training, and often includes a warmup phase. This approach balances stability and adaptability: a low initial rate prevents drastic changes to the pre-trained model’s weights, while adjustments help the model converge effectively. Common schedules include linear decay, cosine annealing, and step-based decay, often paired with a warmup period. For example, a linear decay schedule might start with a rate like 2e-5, hold it constant during warmup, then decrease it linearly to zero by the end of training. These choices depend on factors like dataset size, task complexity, and how similar the fine-tuning task is to the model’s original pre-training.
Implementation details vary by framework, but libraries like PyTorch and TensorFlow provide built-in tools. In PyTorch, torch.optim.lr_scheduler.CosineAnnealingLR applies a cosine-shaped rate that oscillates between the initial rate and a minimum value, which can help escape local minima. Hugging Face’s Transformers library often defaults to a linear decay schedule with AdamW optimizer, combining warmup (e.g., 10% of total steps) followed by a steady decrease. For instance, if training for 1,000 steps with a warmup of 100 steps, the learning rate increases from 0 to 2e-5 during the first 100 steps, then decreases linearly to 0. This balances early exploration (allowing the model to adjust gently) and later refinement (slowing updates to stabilize training).
The choice of schedule depends on practical considerations. Smaller datasets or tasks closely aligned with the pre-training objective (e.g., refining a text classifier for a similar domain) often work well with simpler schedules like linear decay. For larger datasets or divergent tasks (e.g., adapting a language model for code generation), cosine annealing or compound schedules might perform better. Additionally, very large models (e.g., GPT-3 or T5-XXL) often use conservative rates (e.g., 1e-5 to 5e-5) with longer warmups to avoid destabilizing pre-trained features. Experimentation is key: developers might start with a linear decay + warmup baseline, then test alternatives if convergence is slow or unstable. Monitoring validation loss during training helps identify whether the schedule needs adjustment—for example, extending warmup if loss spikes early.