fix: mark blocks as processed only on successful branch walk by guillaumemichel · Pull Request #350 · ipfs/go-ds-crdt

guillaumemichel · 2026-04-17T11:39:28Z

Bug

When a branch walk (DAG traversal) fails partway through (e.g. a network timeout fetching a child block) any nodes already merged and marked processed before the failure were permanently written to disk. On retry, those nodes would be silently skipped (already marked processed), but the walk never completed, so the head was never advanced. The datastore ends up in a stuck, inconsistent state with no path to recovery.

Fix

Instead of writing each node's processed-block key immediately on merge, the walk now accumulates all merged CIDs in a per-walk walkState.pending set. Only when the entire walk succeeds does commitProcessed() write them all to disk. On failure, the pending set is discarded, so the next broadcast can retry the full walk cleanly.

A new inFlightCids set on the store lets isProcessedOrInFlight block concurrent walks from duplicating in-progress branches during a walk. CIDs are cleared both after success or failure.

gammazero

As far I can tell this all looks good.

hsanjuan

You can have very long DAGs and be running on devices with low memory, so throwing CIDs into the memory is not super good and that is why it wasn't done like that from the beginning. I think there is also an issue here: an irretrievable block will cause that no block is marked as processed in that branch, triggering re-walks of the affected branch all the time upon rebroadcast, regardless of depth, wasting worker time and potentially preventing other work.

The datastore ends up in a stuck, inconsistent state with no path to recovery.

Is this the case? If the datastore is marked as dirty, the Repair process will trigger on the given interval, and it should resolve the issues, so even if optimal, there is a path to recovery? The existing issue is when the store is never marked as dirty (due to process dying). On #279 (comment) I proposed having a flag that only gets cleared on successful Close(), and otherwise triggers Repair().

On the other side, Repair() is a heavy price to pay for recovery of a few blocks. If a block is irretrievable, Repair is triggered on every interval, which is only marginally better than triggering re-walks all the time. Repair also does not stop until it reaches the bottom, but if blocks are processed it only does DAG traversal (not merging of deltas) and does not use dagWorkers for it.

So:

General assumption is that all blocks are going to become available at some point, as otherwise convergence never happens.}
Need to tolerate errors fetching blocks.
In an scenario where a block is missing for a long time, it is dangerous to handleBranch from scratch every time as it is heavy to re-merge deltas constantly and it keeps dagWorkers busy. In this case, occasional Repair to resolve all issues at once seems better.
In an scenario where a block is missing due to a spurious error that would be fixed on a retry soon later, Repair() is overkill and it would be better to reprocess the affected part of the branch.
We don't know how long a block is going to be missing or why, also don't know how long a branch is going to be.

It seems to be that repairDAG() is a bit safer in the general scenario, or at least less likely to have very bad side-effects in some circumstances, but maybe we could make it DagName-aware?

A dagName-aware dirty-flag would limit recovery to specific sub-dags, which could result in much better performance when dagNames are used, as repair would only focus on certain subdags and not the whole thing.

hsanjuan · 2026-04-20T22:03:55Z

+			// Drain any jobs that were queued in sendJobs but not yet
+			// forwarded to jobQueue. session.Add(1) was already called
+			// for each of these, so we must call session.Done() to
+			// unblock handleBranch. Without this, session.Wait() would
+			// block forever, leaking the goroutine and preventing the
+			// inFlightCids defer from running.


I don't fully understand this. sendJobs is not buffered, so there are not going to be any jobs waiting there. Probably the original markDirty is an old leftover. jobs in the job queue are drained in dagWorker().

Then, hasPending is set if when things are left jobQueue, but the dagWorker cleans those pending jobs. Before we didn't mark the store as "dirty" when there are pending jobs in the work queue but it seems now we do. I am not sure why though.

guillaumemichel · 2026-04-21T13:11:51Z

See #351 for "bad-shutdown" mark fix, but it doesn't solve the issue, as described in #279 (comment). TL;DR is that repairDAG() doesn't help here.

I don't see many alternatives to waiting before the branch is fully traversed before marking its blocks as processed.

You can have very long DAGs and be running on devices with low memory, so throwing CIDs into the memory is not super good and that is why it wasn't done like that from the beginning.

Yes this is valid. We can use the datastore (flush batches if needed) to avoid holding all CIDs in memory 🤷🏻

I think there is also an issue here: an irretrievable block will cause that no block is marked as processed in that branch, triggering re-walks of the affected branch all the time upon rebroadcast, regardless of depth, wasting worker time and potentially preventing other work.

That seems tricky. Should the head(s) move if some blocks cannot be retrieved?

I doubt we want any blocks to be marked as processed unless all their parents are processed as well. See example below:

A (oldHead) -> ??? -> C (missing) -> D -> E (newHead)

In this scenario, the oldHead is A, then we want to sync a branch (linear). We can successfully fetch the new head E and its parent D, but C is missing, which means we cannot walk down to the old head. Hence the head cannot move from A to E, because C and B (which cannot be discovered) are missing.

E and D can be cached so that they are still kept in the blockstore and don't need to be fetched again. The DAG won't heal with repairDAG() since the issue is after the current head.

So a logical way to fix seems to be to retry walking down the DAG using blocks we already fetched and trying again to fetch missing blocks. We could implement backoff per CID to reduce retry intensity. Or we could remember where the DAG was stuck with list of missing/blocking CIDs to resume walk for each faulty branch.

Another way could be to track candidate heads on the Datastore so that repairDAG() can try to walk the DAG from the candidate heads.

WDYT @hsanjuan ?

fix: mark blocks as processed only on successful branch walk

f9a7311

guillaumemichel requested review from gammazero and hsanjuan April 17, 2026 11:39

gammazero approved these changes Apr 17, 2026

View reviewed changes

hsanjuan requested changes Apr 20, 2026

View reviewed changes

guillaumemichel mentioned this pull request Apr 21, 2026

Potential issue where DAG branch might be left partly processed #279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: mark blocks as processed only on successful branch walk#350

fix: mark blocks as processed only on successful branch walk#350
guillaumemichel wants to merge 1 commit intomasterfrom
fix/partial-head-processing

guillaumemichel commented Apr 17, 2026

Uh oh!

gammazero left a comment

Uh oh!

hsanjuan left a comment

Uh oh!

hsanjuan Apr 20, 2026 •

edited

Loading

Uh oh!

guillaumemichel commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

guillaumemichel commented Apr 17, 2026

Bug

Fix

Uh oh!

gammazero left a comment

Choose a reason for hiding this comment

Uh oh!

hsanjuan left a comment

Choose a reason for hiding this comment

Uh oh!

hsanjuan Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guillaumemichel commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hsanjuan Apr 20, 2026 •

edited

Loading