Skip to content

Improve Leader Lock Code#3251

Open
MitchTurner wants to merge 7 commits intomasterfrom
chore/improve-leader-lock-code
Open

Improve Leader Lock Code#3251
MitchTurner wants to merge 7 commits intomasterfrom
chore/improve-leader-lock-code

Conversation

@MitchTurner
Copy link
Copy Markdown
Member

@MitchTurner MitchTurner commented Apr 1, 2026

Linked Issues/PRs

Followup to feedback in #3241

Description

  • Do we need to update the timestamp of the block as well and re-calculate Instant? Liek we do in the case of SynState with the network (8081c81)
  • The same here, I think we should update Isntant accordingly to ensure that next block will be produced correctly (8081c81)
  • Have you tried to build a table for these metrics to ensure that they work?
  • I think we coudl be more descriptive what was implemented=D Plus, we have a lot of docs related files, so we also can mention what information do we have there and how reader can consume its content (eedec99, 85cb842)
  • Do we handle 0 properly?
  • This await can be really long, in the case when we have multiple redis nodes, huge blocks. Especially if we need to load multiple blocks. Maybe we should consider fetching just IDs, and for each ID, we can get them from Redis by ID.
  • This is a really long function, with a lot of logic. Are you sure that it is not possible to write in a smaller way with suing less iteratators?=) Maybe we could use some data structure to get properties which we are looking for. (9734a39)
  • This is a very long adapter code; usually, we try to keep them smaller. Maybe we should consider factoring this adapter into smaller components. Or at least extract pure redis interactions into its own module. Because we have a lot of simialrt code in order to reach quorume, and I suppose it could be hidden behind the traits

Checklist

  • Breaking changes are clearly marked as such in the PR description and changelog
  • New behavior is reflected in tests
  • The specification matches the implemented behavior (link update PR if changes are needed)

Before requesting review

  • I have reviewed the code myself
  • I have created follow-up issues caused by this PR and linked them here

After merging, notify other teams

[Add or remove entries as needed]

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 1, 2026

PR Summary

Medium Risk
Touches PoA leader election and block reconciliation paths (Redis fencing/repair and block-production scheduling), which can impact liveness or fork-safety if logic is wrong; changes are mostly refactors but in consensus-critical code.

Overview
Improves the Redis-backed PoA leader lock adapter by making quorum calculation configurable at construction time (passes quorum_disruption_budget into RedisLeaderLeaseAdapter::new and removes the post-construction setter) and standardizing terminology from lease_owner_token to lease_owner_id across Rust and Lua script call sites.

Refactors reconciliation to be more structured: stream backlog reading is split into read_backlog/group_entries_by_height, metrics updates are isolated, and contiguous-height reconciliation is decomposed into reconcile_height/select_height_winner, with sub-quorum repair now returning the repaired block (or an error) and emitting clearer logs.

Adjusts PoA service state tracking so last_block_created/timestamp progression is updated consistently when acting as a follower and when importing unreconciled blocks, helping keep subsequent block scheduling correct.

Written by Cursor Bugbot for commit 85cb842. This will update automatically on new commits. Configure here.

@MitchTurner MitchTurner added the no changelog Skip the CI check of the changelog modification label Apr 1, 2026
@MitchTurner MitchTurner self-assigned this Apr 1, 2026
@MitchTurner
Copy link
Copy Markdown
Member Author

Have you tried to build a table for these metrics to ensure that they work?

@Voxelot added these. I think he can give more context.

@MitchTurner
Copy link
Copy Markdown
Member Author

MitchTurner commented Apr 1, 2026

Do we handle 0 properly?

Yes, should be safe:

   fn calculate_quorum(redis_nodes_len: usize, quorum_disruption_budget: u32) -> usize {
        let majority = redis_nodes_len
            .checked_div(2)
            .unwrap_or(0)
            .saturating_add(1);
        let disruption_budget = usize::try_from(quorum_disruption_budget).unwrap_or(0);
        majority
            .saturating_add(disruption_budget)
            .min(redis_nodes_len)
    }

@MitchTurner
Copy link
Copy Markdown
Member Author

This await can be really long, in the case when we have multiple redis nodes, huge blocks. Especially if we need to load multiple blocks. Maybe we should consider fetching just IDs, and for each ID, we can get them from Redis by ID.

The idea was we have pagination to minimize this time... but might not be enough as block size gets bigger?

@MitchTurner
Copy link
Copy Markdown
Member Author

I think we coudl be more descriptive what was implemented=D Plus, we have a lot of docs related files, so we also can mention what information do we have there and how reader can consume its content

Added documentation to the adapter here: eedec99

Happy to add more.

@MitchTurner MitchTurner marked this pull request as ready for review April 1, 2026 19:14
@Voxelot
Copy link
Copy Markdown
Member

Voxelot commented Apr 1, 2026

Have you tried to build a table for these metrics to ensure that they work?

@Voxelot added these. I think he can give more context.

I can confirm using these metrics when debugging the failover behavior in devnet and that they worked. I'm not sure what is meant by building a table? Like writing to rocksdb as well as metrics?

Self {
redis_nodes: self.redis_nodes.clone(),
quorum: self.quorum,
quorum_disruption_budget: self.quorum_disruption_budget,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we might be using this setting to ensure we have more than n/2 + 1 threshold in some situations. Will it still be possible to require n/2 + m without this quorum disruption budget setting?

@Voxelot
Copy link
Copy Markdown
Member

Voxelot commented Apr 16, 2026

This await can be really long, in the case when we have multiple redis nodes, huge blocks. Especially if we need to load multiple blocks. Maybe we should consider fetching just IDs, and for each ID, we can get them from Redis by ID.

The idea was we have pagination to minimize this time... but might not be enough as block size gets bigger?

This is partially mitigated by using local block height as a cursor. However, if there are many conflicting blocks in redis, agree that it could be potentially lighter weight to only fetch and compare block ids first to determine the canonical path, before fetching the actual data. In that case we could also limit the data fetch to be from just one redis node, instead of fetching multiple copies of the data from all of them at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no changelog Skip the CI check of the changelog modification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants