Skip to content

Prevent permanent worker deadlock when cutover times out waiting for binlog sentinel#1637

Merged
meiji163 merged 8 commits intogithub:masterfrom
VarunChandola:varun/fix-alleventsuptolockprocessed-deadlock
Apr 29, 2026
Merged

Prevent permanent worker deadlock when cutover times out waiting for binlog sentinel#1637
meiji163 merged 8 commits intogithub:masterfrom
VarunChandola:varun/fix-alleventsuptolockprocessed-deadlock

Conversation

@VarunChandola
Copy link
Copy Markdown
Contributor

Fixes #971.

We recently hit a case where gh-ost stalled permanently mid-cutover: heartbeat lag grew
without bound and no further cutover attempts were made despite the process still running.

Background

As part of atomic cutover, gh-ost locks the original table, waits for all in-flight DML
to be applied to the ghost table, then executes the rename. The "wait for in-flight DML"
step works via a sentinel value written to the changelog table. This sentinel travels via
the binlog stream back to gh-ost, which applies binlog events sequentially through a
single-threaded applyEventsQueue. The sentinel is enqueued onto that same queue, so by
the time the applier executes it, all preceding DML is guaranteed to have been applied:

// onChangelogStateEvent — spawns a goroutine to enqueue the sentinel func
var applyEventFunc tableWriteFunc = func() error {
    this.allEventsUpToLockProcessed <- &lockProcessedStruct{...}
    return nil
}
go func() {
    this.applyEventsQueue <- newApplyEventStructByFunc(&applyEventFunc)
}()

waitForEventsUpToLock then blocks waiting to receive on allEventsUpToLockProcessed,
which only fires once the applier has worked through the queue and executed the sentinel.

The bug

Under high DML load the queue backlog is large enough that the sentinel isn't executed
before CutOverLockTimeoutSeconds expires. This is subtle: the logs show the sentinel
being intercepted quickly, which is easy to misread as the sentinel having been
delivered. Handled changelog state AllEventsUpToLockProcessed only means
onChangelogStateEvent spawned the goroutine to push onto applyEventsQueue — not that
the applier has executed it or that allEventsUpToLockProcessed has been written to.

From our logs:

09:56:49.616  Writing changelog state: AllEventsUpToLockProcessed:177240580961633980
09:56:49.616  Tables locked
09:56:49.620  Intercepted changelog state AllEventsUpToLockProcessed   ← binlog event seen
09:56:49.620  Handled changelog state AllEventsUpToLockProcessed       ← goroutine spawned
09:56:49.623  Waiting for events up to lock
09:56:52.617  Timeout while waiting for events up to lock              ← 3s later, timed out
09:56:52.617  Dropping magic cut-over table
09:56:53.053  Tables unlocked
09:56:53.058  Looking for magic cut-over table                         ← no further attempts

waitForEventsUpToLock times out and exits, but the sentinel func is still in the queue.
When the applier eventually reaches it and executes it, it tries to send to
allEventsUpToLockProcessed — an unbuffered channel. The only goroutine that reads from
this channel is waitForEventsUpToLock, which has already exited. The applier blocks
permanently on that send.

On subsequent cutover attempts, we never reach waitForEventsUpToLock at all. The
sleepWhileTrue heartbeat lag guard at the top of cutOver() sees high heartbeat lag —
caused by the blocked applier, which has stopped processing all binlog events including
heartbeats — and spins forever waiting for it to recover. It never does. The migration is
permanently stalled with no error, no restart, and no further progress. This matches the
symptoms in #971.

Fix

Buffer allEventsUpToLockProcessed to MaxRetries(). There are arguably less wasteful
approaches, but this is the simplest: the applier's send always completes immediately
regardless of whether waitForEventsUpToLock is still listening. Stale sentinels from
timed-out attempts accumulate in the buffer and are discarded by the existing
state != challenge skip loop on subsequent calls to waitForEventsUpToLock.

…binlog sentinel

Buffer allEventsUpToLockProcessed to MaxRetries() so the applier's send always
completes immediately even after waitForEventsUpToLock has timed out and exited.
@ad-murray
Copy link
Copy Markdown

I think this is the largest ratio of PR description to lines of code I have ever come across

@VarunChandola VarunChandola reopened this Mar 4, 2026
@meiji163 meiji163 added the bug label Mar 9, 2026
@meiji163
Copy link
Copy Markdown
Contributor

meiji163 commented Mar 9, 2026

Thanks for the bug fix @VarunChandola. Can you gofmt -s -w go/ the code if the PR is ready for review?

@meiji163 meiji163 marked this pull request as ready for review April 29, 2026 17:12
@meiji163 meiji163 requested a review from rashiq as a code owner April 29, 2026 17:12
Copilot AI review requested due to automatic review settings April 29, 2026 17:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a gh-ost cutover failure mode where a timed-out waitForEventsUpToLock can leave a stale sentinel that later blocks the single-threaded applier goroutine, permanently stalling the migration.

Changes:

  • Make Migrator.allEventsUpToLockProcessed a buffered channel sized to MaxRetries() to avoid applier deadlock after cutover timeouts.
  • Add inline documentation explaining why the channel is buffered.
Show a summary per file
File Description
go/logic/migrator.go Buffers the lock-sentinel delivery channel to prevent applier deadlock when cutover retries time out.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 1/1 changed files
  • Comments generated: 2

Comment thread go/logic/migrator.go
Comment on lines +114 to +116
// Buffered to MaxRetries() to prevent a deadlock when waitForEventsUpToLock times out.
// (see https://github.com/github/gh-ost/pull/1637)
allEventsUpToLockProcessed: make(chan *lockProcessedStruct, context.MaxRetries()),
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make(chan *lockProcessedStruct, context.MaxRetries()) won’t compile because the channel buffer size parameter must be of type int, but MigrationContext.MaxRetries() returns int64 (go/base/context.go:495). Convert to int (and consider guarding against overflow if you want to be extra defensive).

Copilot uses AI. Check for mistakes.
Comment thread go/logic/migrator.go
Comment on lines +114 to +116
// Buffered to MaxRetries() to prevent a deadlock when waitForEventsUpToLock times out.
// (see https://github.com/github/gh-ost/pull/1637)
allEventsUpToLockProcessed: make(chan *lockProcessedStruct, context.MaxRetries()),
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is meant to prevent a specific deadlock scenario, but there’s no regression test exercising the “timeout then applier executes stale sentinel” behavior. Consider adding a unit test that constructs a Migrator, performs a non-blocking send on allEventsUpToLockProcessed without any receiver, and/or asserts the channel has the expected buffering based on MaxRetries() so future refactors don’t reintroduce the unbuffered-channel deadlock.

Copilot uses AI. Check for mistakes.
@meiji163 meiji163 merged commit 4deeadf into github:master Apr 29, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cut over not retrying ?

4 participants