Skip to content

Support zero-downtime upgrades (ZDU) with RollingUpdate strategy #170

@markmishaev76

Description

@markmishaev76

Problem

The chart currently defaults to updateStrategyType: "OnDelete" (values.yaml), which requires operators to manually delete pods in the correct order (standbys first, leader last) during upgrades. This is safe but prevents automated rolling upgrades.

Switching to RollingUpdate is unsafe today because Kubernetes has no awareness of the HA upgrade constraint:

"Do not fail over from a newer version of OpenBao to an older version."

If a newer-version standby becomes leader during a rolling update while older-version nodes still exist, then the leader crashes, an older node could claim leadership. Nothing in OpenBao prevents this.

Upstream dependency

openbao/openbao#2858 proposes version-gated leader election in OpenBao core: a node will refuse leadership if its version is older than the last recorded leader version. Once that lands, the primary blocker for RollingUpdate is removed.

What this issue tracks

After version-gated leader election lands in core:

  1. Evaluate switching the default updateStrategyType to RollingUpdate, or at least documenting how to safely enable it.
  2. Consider pod ordering. StatefulSets with RollingUpdate terminate pods in reverse ordinal order (highest first). If the leader happens to be the lowest ordinal (pod-0, which is typical for Raft-based setups), this naturally upgrades standbys first and leader last -- which is the correct order. Document this behavior.
  3. Consider maxUnavailable settings. For HA clusters, maxUnavailable: 1 ensures only one pod is replaced at a time, maintaining quorum.
  4. Leverage the openbao-version-blocked K8s pod label (proposed in Enforce version-gated leader election to prevent older-to-newer failover openbao#2858) for readiness gates or monitoring. A version-blocked standby still serves reads but should not receive write-forwarded traffic.
  5. Update chart documentation with a recommended ZDU upgrade procedure.

Current workarounds

  • Use OnDelete strategy and manually delete pods in the correct order (standbys first, leader last).
  • Use external tooling (scripts, Operators) to orchestrate the deletion order.

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions