Failover
Failover promotes a standby cluster to a standalone primary when the original primary is completely unavailable. It is an availability-first operation and may lose data that was not replicated before the failure.
This guide assumes the original topology is:
cluster-a (primary) -> cluster-b (standby)
After failover, cluster-b becomes a standalone primary:
cluster-b (primary)
When to Use Failover
Use failover only when:
- The original primary cannot respond to requests.
- The primary cannot be recovered within an acceptable time.
- Restoring write availability is more important than waiting for the old primary.
If the primary is still reachable, use Switchover instead. Switchover avoids data loss.
Data Loss Risk
Failover does not wait for the original primary. Any data written to the old primary but not yet replicated to the standby may be lost.
The possible data loss is determined by CDC lag at the time the primary became unavailable.
Before running failover, understand the tradeoff:
| Goal | Switchover | Failover |
|---|---|---|
| Restore writes while primary is unreachable | No | Yes |
| Avoid data loss | Yes | Not guaranteed |
| Requires old primary to respond | Yes | No |
Before You Begin
Confirm the following:
- The original primary is unavailable.
- You have decided not to wait for primary recovery.
- Application traffic can be redirected to the standby.
- Traffic automation will not send writes back to the old primary if it recovers.
- You have the standby cluster ID, address, token, and pchannels.
The most important safety requirement is to prevent split brain. After failover, only the promoted standby should accept application writes.
Build the Failover Configuration
Build a configuration that contains only the standby cluster and no replication topology. Set force_promote=True.
# If you followed Set Up CDC Replication, cluster B is the original target cluster.
cluster_b_id = target_cluster_id
cluster_b_addr = target_cluster_addr
cluster_b_client_addr = target_client_addr
cluster_b_token = target_cluster_token
cluster_b_pchannels = target_cluster_pchannels
failover_config = {
"clusters": [
{
"cluster_id": cluster_b_id,
"connection_param": {
"uri": cluster_b_addr,
"token": cluster_b_token,
},
"pchannels": cluster_b_pchannels,
}
],
"cross_cluster_topology": [],
"force_promote": True,
}
Promote the Standby
Send the request to the standby cluster.
from pymilvus import MilvusClient
client_b = MilvusClient(uri=cluster_b_client_addr, token=cluster_b_token)
try:
client_b.update_replicate_configuration(**failover_config)
finally:
client_b.close()
If the request succeeds, cluster-b becomes a standalone primary and can accept writes.
Redirect Application Traffic
After promotion:
- Redirect write traffic to
cluster-b. - Remove
cluster-afrom write endpoints, load balancers, DNS records, and automation. - Verify that
cluster-baccepts writes. - Keep
cluster-aisolated until it is decommissioned or explicitly rebuilt.
Example write verification:
client_b = MilvusClient(uri=cluster_b_client_addr, token=cluster_b_token)
try:
client_b.insert(
collection_name="test_collection",
data=[{"id": 1, "vector": [0.1] * 128}],
)
finally:
client_b.close()
Adjust the collection name and schema fields to match your deployment.
Verify the Result
Verify the promoted cluster directly:
- Writes succeed on
cluster-b. - Reads return expected data.
- No application component writes to
cluster-a.
Handling the Old Primary
After failover, treat cluster-a as stale. Do not send application writes to it if it becomes reachable again. It may contain data that was never replicated to cluster-b, and cluster-b may already contain new writes after failover.
Do not reconnect cluster-a to the old topology automatically. Reintroducing the old primary is a separate recovery task that must be planned carefully.
Minimizing Data Loss
You cannot remove all data-loss risk from failover, but you can reduce it:
- Monitor CDC lag continuously.
- Keep standby clusters provisioned to handle the primary write rate.
- Keep cross-cluster network latency and packet loss low.
- Make application writes idempotent.
- Retry writes whose success is uncertain after failover.
- Prefer switchover whenever the primary can still respond.
FAQ
Does failover always lose data?
No, but it can. If all writes were already replicated before the primary failed, no data is lost. If CDC lag existed, the lagging data may be lost.
How long does failover take?
It typically completes within seconds, depending on cluster state and control-plane availability on the standby.
Can I run failover on the primary?
No. Failover is intended for a standby cluster. If the current primary is available, use switchover.
Can the old primary rejoin automatically?
No. After failover, the old primary must be treated as stale and decommissioned or rebuilt before it can participate in replication again.
How do I avoid split brain?
Ensure that only the promoted cluster receives writes. Remove the old primary from all write paths before it can recover and accept traffic.