Skip to content

fix: mark blocks as processed only on successful branch walk#350

Open
guillaumemichel wants to merge 1 commit intomasterfrom
fix/partial-head-processing
Open

fix: mark blocks as processed only on successful branch walk#350
guillaumemichel wants to merge 1 commit intomasterfrom
fix/partial-head-processing

Conversation

@guillaumemichel
Copy link
Copy Markdown
Contributor

Bug

When a branch walk (DAG traversal) fails partway through (e.g. a network timeout fetching a child block) any nodes already merged and marked processed before the failure were permanently written to disk. On retry, those nodes would be silently skipped (already marked processed), but the walk never completed, so the head was never advanced. The datastore ends up in a stuck, inconsistent state with no path to recovery.

Fix

Instead of writing each node's processed-block key immediately on merge, the walk now accumulates all merged CIDs in a per-walk walkState.pending set. Only when the entire walk succeeds does commitProcessed() write them all to disk. On failure, the pending set is discarded, so the next broadcast can retry the full walk cleanly.

A new inFlightCids set on the store lets isProcessedOrInFlight block concurrent walks from duplicating in-progress branches during a walk. CIDs are cleared both after success or failure.

Copy link
Copy Markdown
Contributor

@gammazero gammazero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far I can tell this all looks good.

Copy link
Copy Markdown
Collaborator

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have very long DAGs and be running on devices with low memory, so throwing CIDs into the memory is not super good and that is why it wasn't done like that from the beginning. I think there is also an issue here: an irretrievable block will cause that no block is marked as processed in that branch, triggering re-walks of the affected branch all the time upon rebroadcast, regardless of depth, wasting worker time and potentially preventing other work.

The datastore ends up in a stuck, inconsistent state with no path to recovery.

Is this the case? If the datastore is marked as dirty, the Repair process will trigger on the given interval, and it should resolve the issues, so even if optimal, there is a path to recovery? The existing issue is when the store is never marked as dirty (due to process dying). On #279 (comment) I proposed having a flag that only gets cleared on successful Close(), and otherwise triggers Repair().

On the other side, Repair() is a heavy price to pay for recovery of a few blocks. If a block is irretrievable, Repair is triggered on every interval, which is only marginally better than triggering re-walks all the time. Repair also does not stop until it reaches the bottom, but if blocks are processed it only does DAG traversal (not merging of deltas) and does not use dagWorkers for it.

So:

  • General assumption is that all blocks are going to become available at some point, as otherwise convergence never happens.}
  • Need to tolerate errors fetching blocks.
  • In an scenario where a block is missing for a long time, it is dangerous to handleBranch from scratch every time as it is heavy to re-merge deltas constantly and it keeps dagWorkers busy. In this case, occasional Repair to resolve all issues at once seems better.
  • In an scenario where a block is missing due to a spurious error that would be fixed on a retry soon later, Repair() is overkill and it would be better to reprocess the affected part of the branch.
  • We don't know how long a block is going to be missing or why, also don't know how long a branch is going to be.

It seems to be that repairDAG() is a bit safer in the general scenario, or at least less likely to have very bad side-effects in some circumstances, but maybe we could make it DagName-aware?

A dagName-aware dirty-flag would limit recovery to specific sub-dags, which could result in much better performance when dagNames are used, as repair would only focus on certain subdags and not the whole thing.

Comment thread crdt.go
Comment on lines +951 to +956
// Drain any jobs that were queued in sendJobs but not yet
// forwarded to jobQueue. session.Add(1) was already called
// for each of these, so we must call session.Done() to
// unblock handleBranch. Without this, session.Wait() would
// block forever, leaking the goroutine and preventing the
// inFlightCids defer from running.
Copy link
Copy Markdown
Collaborator

@hsanjuan hsanjuan Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this. sendJobs is not buffered, so there are not going to be any jobs waiting there. Probably the original markDirty is an old leftover. jobs in the job queue are drained in dagWorker().

Then, hasPending is set if when things are left jobQueue, but the dagWorker cleans those pending jobs. Before we didn't mark the store as "dirty" when there are pending jobs in the work queue but it seems now we do. I am not sure why though.

@guillaumemichel
Copy link
Copy Markdown
Contributor Author

See #351 for "bad-shutdown" mark fix, but it doesn't solve the issue, as described in #279 (comment). TL;DR is that repairDAG() doesn't help here.

I don't see many alternatives to waiting before the branch is fully traversed before marking its blocks as processed.

You can have very long DAGs and be running on devices with low memory, so throwing CIDs into the memory is not super good and that is why it wasn't done like that from the beginning.

Yes this is valid. We can use the datastore (flush batches if needed) to avoid holding all CIDs in memory 🤷🏻


I think there is also an issue here: an irretrievable block will cause that no block is marked as processed in that branch, triggering re-walks of the affected branch all the time upon rebroadcast, regardless of depth, wasting worker time and potentially preventing other work.

That seems tricky. Should the head(s) move if some blocks cannot be retrieved?

I doubt we want any blocks to be marked as processed unless all their parents are processed as well. See example below:

A (oldHead) -> ??? -> C (missing) -> D -> E (newHead)

In this scenario, the oldHead is A, then we want to sync a branch (linear). We can successfully fetch the new head E and its parent D, but C is missing, which means we cannot walk down to the old head. Hence the head cannot move from A to E, because C and B (which cannot be discovered) are missing.

E and D can be cached so that they are still kept in the blockstore and don't need to be fetched again. The DAG won't heal with repairDAG() since the issue is after the current head.

So a logical way to fix seems to be to retry walking down the DAG using blocks we already fetched and trying again to fetch missing blocks. We could implement backoff per CID to reduce retry intensity. Or we could remember where the DAG was stuck with list of missing/blocking CIDs to resume walk for each faulty branch.

Another way could be to track candidate heads on the Datastore so that repairDAG() can try to walk the DAG from the candidate heads.

WDYT @hsanjuan ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants