SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference (TD) reinforcement learning algorithm used to learn optimal policies for decision-making in environments with delayed rewards. It updates action-value estimates (Q-values) based on the agent’s experiences as it interacts with the environment. Unlike off-policy methods like Q-learning, SARSA learns the value of the policy it is currently following, including any exploration strategies like ε-greedy. The name reflects the sequence of events used in updates: the agent observes a state (S), takes an action (A), receives a reward ®, transitions to a new state (S’), and selects the next action (A’) under the current policy before updating Q-values.
The core of SARSA lies in its update rule:
Q(S, A) = Q(S, A) + α [R + γQ(S', A') - Q(S, A)]
Here, α (learning rate) controls how much new information overrides old estimates, and γ (discount factor) weights future rewards. The term R + γQ(S', A')
represents the TD target, combining immediate reward and the discounted value of the next state-action pair. For example, in a grid-world navigation task, if an agent moves right (action A) from state S, receives a reward R=0, and then selects action A’=up in state S’, SARSA updates Q(S, right) using Q(S’, up). This approach ensures the agent accounts for the actual next action it will take, which depends on its exploration strategy.
SARSA is particularly useful in scenarios where the policy’s exploration behavior impacts safety or performance. For instance, a robot avoiding obstacles might use ε-greedy exploration, where occasional random actions could lead to collisions. SARSA learns a policy that factors in these exploration risks, leading to more conservative paths compared to Q-learning, which might ignore exploration’s consequences. However, SARSA can converge slower than off-policy methods in some cases, as it depends on the current policy’s actions. Developers often choose SARSA when the agent must learn while accounting for real-time exploration trade-offs, such as in real-world robotics or safety-critical simulations like cliff-walking environments, where taking the shortest path might be riskier than a slightly longer, safer route.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word