🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

What is Thompson Sampling?

Thompson Sampling is a probabilistic algorithm used to balance exploration and exploitation in decision-making problems, particularly in scenarios known as multi-armed bandits. The term “multi-armed bandit” refers to a situation where you must repeatedly choose between multiple actions (e.g., different website layouts, ad variants) with uncertain rewards. The goal is to maximize cumulative rewards over time by learning which actions perform best. Thompson Sampling addresses this by maintaining probability distributions over the potential outcomes of each action and using random sampling to guide decisions. Unlike rigid methods like A/B testing, which split traffic evenly, it dynamically allocates more trials to better-performing options while still exploring alternatives.

The algorithm works by assigning a prior probability distribution (e.g., Beta distribution for binary outcomes) to each action’s success rate. For example, if testing three ad variants, each variant starts with a Beta distribution reflecting initial assumptions about its click-through rate. In each round, the algorithm samples a value from each distribution, selects the action with the highest sampled value, and observes the outcome (e.g., a click or no click). The distribution for the chosen action is then updated based on the observed result. Over time, these distributions become more accurate, allowing the algorithm to converge on the optimal action. This approach naturally balances exploration (testing uncertain options) and exploitation (using known good options) without requiring manual tuning of parameters like exploration rates.

Developers can apply Thompson Sampling in scenarios such as online advertising, recommendation systems, or clinical trials. For instance, a streaming service might use it to test different recommendation algorithms, allocating more users to the algorithm with the highest sampled engagement metrics. Implementation typically involves modeling each action’s rewards (e.g., using Beta-Bernoulli for clicks or Gaussian for continuous metrics) and updating beliefs incrementally. In Python, libraries like numpy can generate samples from Beta distributions, while frameworks like TensorFlow Probability simplify Bayesian updates. Key advantages include scalability, adaptability to changing environments, and efficient use of data. Developers should ensure proper initialization of priors and monitor convergence to avoid biases in long-running systems.

Like the article? Spread the word