Skip to content

[Bug][Zeta] Job stuck permanently after master failover, unable to complete (affects BATCH / bounded source / job shutdown phase) #10834

@nzw921rx

Description

@nzw921rx

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

After master node failover or restart, a BATCH job or bounded Streaming job gets permanently stuck in RUNNING state and never completes. The same issue occurs when a Streaming job triggers a cancel or savepoint and master failover happens during the shutdown phase.

Affected scenarios:

  • BATCH jobs
  • Streaming jobs with bounded sources (e.g. Kafka consuming up to a specified offset)
  • Streaming jobs undergoing active cancel or savepoint when master failover occurs

Normal flow:

Source finishes reading → SourceSplitEnumeratorTask sends LastCheckpointNotifyOperation to master → CheckpointCoordinator adds task to readyToCloseStartingTask set → once all collected, triggers COMPLETED_POINT_TYPE → job completes normally.

Broken flow after master failover:

  1. readyToCloseStartingTask is pure in-memory state (not persisted). When the old master dies, this set is lost.
  2. SourceSplitEnumeratorTask sends LastCheckpointNotifyOperation only once, then transitions to PREPARE_CLOSE. After the new master takes over, the task is already in PREPARE_CLOSE state and never re-sends the notification.
  3. Deadlock forms:
    • Coordinator waiting for tasks: readyToCloseStartingTask is empty, can never be filled, COMPLETED_POINT_TYPE is never triggered.
    • Tasks waiting for coordinator: stuck in PREPARE_CLOSE, waiting for the COMPLETED_POINT_TYPE barrier to proceed.
    • Both sides wait for each other indefinitely — job is permanently stuck.

Log pattern observed:

Periodic checkpoints keep completing normally, but COMPLETED_POINT_TYPE is never triggered. Checkpoint IDs increment from 30 to 50+ with each round completing successfully:

[seatunnel-coordinator-service-22] - pending checkpoint(30/...) notify finished!
[seatunnel-coordinator-service-22] - start notify checkpoint completed, job id: ..., pipeline id: 1, checkpoint id: 30
[seatunnel-coordinator-service-47] - wait checkpoint completed: 30
[seatunnel-coordinator-service-22] - pending checkpoint(31/...) notify finished!
[seatunnel-coordinator-service-22] - start notify checkpoint completed, job id: ..., pipeline id: 1, checkpoint id: 31
[seatunnel-coordinator-service-47] - wait checkpoint completed: 31
... (continues indefinitely, checkpoint id keeps incrementing)

Normal checkpoints succeed, confirming the engine is running. However, COMPLETED_POINT_TYPE is never triggered and the job never ends.

SeaTunnel Version

2.3.13

SeaTunnel Config

Not applicable (engine-internal issue, not config-dependent).

Running Command

Not applicable.

Error Exception

No exception thrown. The job stays in RUNNING state indefinitely without any error or completion log.
Symptom: periodic checkpoints keep completing normally (checkpoint id keeps incrementing), but COMPLETED_POINT_TYPE is never triggered and the job never ends.

Zeta or Flink or Spark Version

2.3.13

Java or Scala Version

jdk1.8

Screenshots

Image

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions