[test] Avoid killing the Auditor leader in BookieAutoRecoveryTest#4769
Open
merlimat wants to merge 1 commit intoapache:masterfrom
Open
[test] Avoid killing the Auditor leader in BookieAutoRecoveryTest#4769merlimat wants to merge 1 commit intoapache:masterfrom
merlimat wants to merge 1 commit intoapache:masterfrom
Conversation
Three tests in BookieAutoRecoveryTest were flaking in CI because they sometimes killed the bookie that happened to be the Auditor leader, and the leader-failover chain (10s AuditorElector shutdown timeout + ZK session expiry + new election + first audit cycle) routinely exceeded the tests' 20s/90s latch timeouts under CI load. Make the bookie-to-kill choice deterministic w.r.t. the Auditor: - testOpenLedgers, testClosedLedgers, testStopWhileReplicationInProgress, testNoSuchLedgerExists: pick a non-Auditor bookie from the ensemble. - testEmptyLedgerLosesQuorumEventually needs to kill ensemble[2] then ensemble[1] specifically (it tests non-write-quorum bookie loss). Pin the Auditor onto ensemble[0] by stopping AutoRecovery on the two bookies the test will kill. Any slow failover happens in pre-flight, outside the timed assertions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BookieAutoRecoveryTestwere flaking in CI (testOpenLedgers,testClosedLedgers,testEmptyLedgerLosesQuorumEventually) when the bookie they killed happened to be the Auditor leader. The leader-failover chain —AuditorElector.shutdown()'s 10s timeout + ZK session expiry + new election + first audit cycle — routinely exceeded the tests' 20s/90s latch timeouts under CI load.testEmptyLedgerLosesQuorumEventually(which needs to kill specific indices), pin the Auditor ontoensemble[0]by stoppingAutoRecoveryon the bookies the test will kill before the timed assertions.Test plan
mvn -pl bookkeeper-server test -Dtest='BookieAutoRecoveryTest'passes locally on two consecutive runs (7/7 tests).