fix: mark blocks as processed only on successful branch walk#350
fix: mark blocks as processed only on successful branch walk#350guillaumemichel wants to merge 1 commit intomasterfrom
Conversation
gammazero
left a comment
There was a problem hiding this comment.
As far I can tell this all looks good.
hsanjuan
left a comment
There was a problem hiding this comment.
You can have very long DAGs and be running on devices with low memory, so throwing CIDs into the memory is not super good and that is why it wasn't done like that from the beginning. I think there is also an issue here: an irretrievable block will cause that no block is marked as processed in that branch, triggering re-walks of the affected branch all the time upon rebroadcast, regardless of depth, wasting worker time and potentially preventing other work.
The datastore ends up in a stuck, inconsistent state with no path to recovery.
Is this the case? If the datastore is marked as dirty, the Repair process will trigger on the given interval, and it should resolve the issues, so even if optimal, there is a path to recovery? The existing issue is when the store is never marked as dirty (due to process dying). On #279 (comment) I proposed having a flag that only gets cleared on successful Close(), and otherwise triggers Repair().
On the other side, Repair() is a heavy price to pay for recovery of a few blocks. If a block is irretrievable, Repair is triggered on every interval, which is only marginally better than triggering re-walks all the time. Repair also does not stop until it reaches the bottom, but if blocks are processed it only does DAG traversal (not merging of deltas) and does not use dagWorkers for it.
So:
- General assumption is that all blocks are going to become available at some point, as otherwise convergence never happens.}
- Need to tolerate errors fetching blocks.
- In an scenario where a block is missing for a long time, it is dangerous to
handleBranchfrom scratch every time as it is heavy to re-merge deltas constantly and it keeps dagWorkers busy. In this case, occasional Repair to resolve all issues at once seems better. - In an scenario where a block is missing due to a spurious error that would be fixed on a retry soon later, Repair() is overkill and it would be better to reprocess the affected part of the branch.
- We don't know how long a block is going to be missing or why, also don't know how long a branch is going to be.
It seems to be that repairDAG() is a bit safer in the general scenario, or at least less likely to have very bad side-effects in some circumstances, but maybe we could make it DagName-aware?
A dagName-aware dirty-flag would limit recovery to specific sub-dags, which could result in much better performance when dagNames are used, as repair would only focus on certain subdags and not the whole thing.
| // Drain any jobs that were queued in sendJobs but not yet | ||
| // forwarded to jobQueue. session.Add(1) was already called | ||
| // for each of these, so we must call session.Done() to | ||
| // unblock handleBranch. Without this, session.Wait() would | ||
| // block forever, leaking the goroutine and preventing the | ||
| // inFlightCids defer from running. |
There was a problem hiding this comment.
I don't fully understand this. sendJobs is not buffered, so there are not going to be any jobs waiting there. Probably the original markDirty is an old leftover. jobs in the job queue are drained in dagWorker().
Then, hasPending is set if when things are left jobQueue, but the dagWorker cleans those pending jobs. Before we didn't mark the store as "dirty" when there are pending jobs in the work queue but it seems now we do. I am not sure why though.
|
See #351 for "bad-shutdown" mark fix, but it doesn't solve the issue, as described in #279 (comment). TL;DR is that I don't see many alternatives to waiting before the branch is fully traversed before marking its blocks as processed.
Yes this is valid. We can use the datastore (flush batches if needed) to avoid holding all CIDs in memory 🤷🏻
That seems tricky. Should the head(s) move if some blocks cannot be retrieved? I doubt we want any blocks to be marked as processed unless all their parents are processed as well. See example below: In this scenario, the oldHead is
So a logical way to fix seems to be to retry walking down the DAG using blocks we already fetched and trying again to fetch missing blocks. We could implement backoff per CID to reduce retry intensity. Or we could remember where the DAG was stuck with list of missing/blocking CIDs to resume walk for each faulty branch. Another way could be to track candidate heads on the WDYT @hsanjuan ? |
Bug
When a branch walk (DAG traversal) fails partway through (e.g. a network timeout fetching a child block) any nodes already merged and marked
processedbefore the failure were permanently written to disk. On retry, those nodes would be silently skipped (already marked processed), but the walk never completed, so the head was never advanced. The datastore ends up in a stuck, inconsistent state with no path to recovery.Fix
Instead of writing each node's processed-block key immediately on merge, the walk now accumulates all merged CIDs in a per-walk
walkState.pendingset. Only when the entire walk succeeds doescommitProcessed()write them all to disk. On failure, the pending set is discarded, so the next broadcast can retry the full walk cleanly.A new
inFlightCidsset on the store letsisProcessedOrInFlightblock concurrent walks from duplicating in-progress branches during a walk. CIDs are cleared both after success or failure.