Skip to content

[test] Avoid killing the Auditor leader in BookieAutoRecoveryTest#4769

Open
merlimat wants to merge 1 commit intoapache:masterfrom
merlimat:fix-flaky-bookie-auto-recovery-test
Open

[test] Avoid killing the Auditor leader in BookieAutoRecoveryTest#4769
merlimat wants to merge 1 commit intoapache:masterfrom
merlimat:fix-flaky-bookie-auto-recovery-test

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

Summary

  • Three tests in BookieAutoRecoveryTest were flaking in CI (testOpenLedgers, testClosedLedgers, testEmptyLedgerLosesQuorumEventually) when the bookie they killed happened to be the Auditor leader. The leader-failover chain — AuditorElector.shutdown()'s 10s timeout + ZK session expiry + new election + first audit cycle — routinely exceeded the tests' 20s/90s latch timeouts under CI load.
  • Make the bookie-to-kill choice deterministic w.r.t. the Auditor: pick a non-Auditor bookie from the ensemble, or for testEmptyLedgerLosesQuorumEventually (which needs to kill specific indices), pin the Auditor onto ensemble[0] by stopping AutoRecovery on the bookies the test will kill before the timed assertions.
  • Recent example failure: https://github.com/apache/bookkeeper/actions/runs/25072614471/job/73458460684

Test plan

  • mvn -pl bookkeeper-server test -Dtest='BookieAutoRecoveryTest' passes locally on two consecutive runs (7/7 tests).

Three tests in BookieAutoRecoveryTest were flaking in CI because they
sometimes killed the bookie that happened to be the Auditor leader, and
the leader-failover chain (10s AuditorElector shutdown timeout + ZK
session expiry + new election + first audit cycle) routinely exceeded
the tests' 20s/90s latch timeouts under CI load.

Make the bookie-to-kill choice deterministic w.r.t. the Auditor:

- testOpenLedgers, testClosedLedgers, testStopWhileReplicationInProgress,
  testNoSuchLedgerExists: pick a non-Auditor bookie from the ensemble.
- testEmptyLedgerLosesQuorumEventually needs to kill ensemble[2] then
  ensemble[1] specifically (it tests non-write-quorum bookie loss).
  Pin the Auditor onto ensemble[0] by stopping AutoRecovery on the two
  bookies the test will kill. Any slow failover happens in pre-flight,
  outside the timed assertions.
@merlimat merlimat changed the title [FIX] Avoid killing the Auditor leader in BookieAutoRecoveryTest [test] Avoid killing the Auditor leader in BookieAutoRecoveryTest Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant