🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How does Upper Confidence Bound (UCB) work in RL?

Upper Confidence Bound (UCB) is a strategy in reinforcement learning (RL) that balances exploration and exploitation when making decisions under uncertainty. It is widely used in problems like the multi-armed bandit, where an agent must choose between actions with unknown reward distributions. The core idea is to prioritize actions based on both their current estimated value and the uncertainty around that estimate. By adding an “optimism bonus” to under-explored actions, UCB encourages the agent to try options that might yield higher rewards while still exploiting known high-reward actions. This reduces regret—the difference between the optimal cumulative reward and what the agent actually achieves—by systematically addressing uncertainty.

The UCB algorithm computes a score for each action using a formula that combines the average observed reward and an exploration term. For an action (a), the score is calculated as: [ \text{UCB}(a) = \text{average_reward}(a) + \sqrt{\frac{2 \ln T}{n(a)}} ] Here, (T) is the total number of steps taken, and (n(a)) is how many times action (a) has been selected. The first term ((\text{average_reward}(a))) represents exploitation (using current knowledge), while the second term (the square root) quantifies uncertainty, favoring actions with fewer trials. As (n(a)) increases, the exploration bonus shrinks, shifting focus toward exploitation. For example, in a news recommendation system, articles with high click-through rates (exploitation) are shown more often, but UCB ensures newer articles (with higher uncertainty) are occasionally tested to avoid missing better options.

A practical use case is A/B testing, where UCB dynamically allocates traffic to webpage variants. Early on, it explores all variants equally, but as data accumulates, it shifts traffic to better-performing ones while still testing others. In RL, UCB principles extend to algorithms like Monte Carlo Tree Search (MCTS), where nodes in a decision tree are selected based on their UCB scores to balance exploring new states and exploiting known paths. While UCB is simple to implement, it assumes stationary reward distributions and requires tracking action counts and rewards. For non-stationary environments (e.g., changing user preferences), modifications like sliding windows or decay factors are needed. Despite these limitations, UCB remains a foundational method for managing exploration-exploitation trade-offs.

Like the article? Spread the word