🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
  • Home
  • AI Reference
  • How do baseline functions reduce variance in policy gradient methods?

How do baseline functions reduce variance in policy gradient methods?

Baseline functions reduce variance in policy gradient methods by providing a reference point to compare the actual rewards received during training. In policy gradient algorithms like REINFORCE, the gradient estimate depends on the rewards sampled from trajectories, which can vary widely due to stochastic environments or policy behavior. High variance in these estimates leads to unstable updates, slowing learning. A baseline subtracts a learned or fixed value (often the expected reward for a state) from the observed reward, creating a “centered” signal that highlights whether an action performed better or worse than expected. This adjustment reduces the magnitude of fluctuations in gradient updates while preserving their direction, making training more stable.

A common example is using a state-value function as the baseline. For instance, in the Advantage Actor-Critic (A2C) algorithm, the critic network estimates the value of a state, ( V(s) ), which represents the average expected return from that state. The advantage—calculated as the actual return ( G_t ) minus ( V(s) )—measures how much better or worse an action is compared to the baseline. Since ( V(s) ) accounts for the inherent value of the state, the advantage ( (G_t - V(s)) ) has lower variance than ( G_t ) alone. This works because the baseline is state-dependent: it adapts to the varying expectations of different states, unlike a global average. Another example is REINFORCE with a simple moving average baseline, where the baseline is the mean reward across recent episodes. While less sophisticated, this still reduces variance by normalizing the scale of updates.

Implementing a baseline requires balancing computational cost and effectiveness. For instance, a learned state-value baseline (as in A2C) adds complexity by requiring a separate neural network to estimate ( V(s) ), but it provides precise, adaptive variance reduction. In contrast, a fixed baseline (e.g., a running average) is simpler but less effective in environments where states have widely differing reward potentials. Crucially, the baseline must not depend on the current action to avoid introducing bias. By focusing gradient updates on the “surprise” component of rewards (i.e., deviations from the baseline), the policy learns faster with fewer erratic parameter changes. This principle underpins many modern algorithms, making baselines a foundational tool for improving policy gradient efficiency.

Like the article? Spread the word