Search before asking
What happened
After master node failover or restart, a BATCH job or bounded Streaming job gets permanently stuck in RUNNING state and never completes. The same issue occurs when a Streaming job triggers a cancel or savepoint and master failover happens during the shutdown phase.
Affected scenarios:
- BATCH jobs
- Streaming jobs with bounded sources (e.g. Kafka consuming up to a specified offset)
- Streaming jobs undergoing active cancel or savepoint when master failover occurs
Normal flow:
Source finishes reading → SourceSplitEnumeratorTask sends LastCheckpointNotifyOperation to master → CheckpointCoordinator adds task to readyToCloseStartingTask set → once all collected, triggers COMPLETED_POINT_TYPE → job completes normally.
Broken flow after master failover:
readyToCloseStartingTask is pure in-memory state (not persisted). When the old master dies, this set is lost.
SourceSplitEnumeratorTask sends LastCheckpointNotifyOperation only once, then transitions to PREPARE_CLOSE. After the new master takes over, the task is already in PREPARE_CLOSE state and never re-sends the notification.
- Deadlock forms:
- Coordinator waiting for tasks:
readyToCloseStartingTask is empty, can never be filled, COMPLETED_POINT_TYPE is never triggered.
- Tasks waiting for coordinator: stuck in
PREPARE_CLOSE, waiting for the COMPLETED_POINT_TYPE barrier to proceed.
- Both sides wait for each other indefinitely — job is permanently stuck.
Log pattern observed:
Periodic checkpoints keep completing normally, but COMPLETED_POINT_TYPE is never triggered. Checkpoint IDs increment from 30 to 50+ with each round completing successfully:
[seatunnel-coordinator-service-22] - pending checkpoint(30/...) notify finished!
[seatunnel-coordinator-service-22] - start notify checkpoint completed, job id: ..., pipeline id: 1, checkpoint id: 30
[seatunnel-coordinator-service-47] - wait checkpoint completed: 30
[seatunnel-coordinator-service-22] - pending checkpoint(31/...) notify finished!
[seatunnel-coordinator-service-22] - start notify checkpoint completed, job id: ..., pipeline id: 1, checkpoint id: 31
[seatunnel-coordinator-service-47] - wait checkpoint completed: 31
... (continues indefinitely, checkpoint id keeps incrementing)
Normal checkpoints succeed, confirming the engine is running. However, COMPLETED_POINT_TYPE is never triggered and the job never ends.
SeaTunnel Version
2.3.13
SeaTunnel Config
Not applicable (engine-internal issue, not config-dependent).
Running Command
Error Exception
No exception thrown. The job stays in RUNNING state indefinitely without any error or completion log.
Symptom: periodic checkpoints keep completing normally (checkpoint id keeps incrementing), but COMPLETED_POINT_TYPE is never triggered and the job never ends.
Zeta or Flink or Spark Version
2.3.13
Java or Scala Version
jdk1.8
Screenshots
Are you willing to submit PR?
Code of Conduct
Search before asking
What happened
After master node failover or restart, a BATCH job or bounded Streaming job gets permanently stuck in RUNNING state and never completes. The same issue occurs when a Streaming job triggers a cancel or savepoint and master failover happens during the shutdown phase.
Affected scenarios:
Normal flow:
Source finishes reading →
SourceSplitEnumeratorTasksendsLastCheckpointNotifyOperationto master →CheckpointCoordinatoradds task toreadyToCloseStartingTaskset → once all collected, triggersCOMPLETED_POINT_TYPE→ job completes normally.Broken flow after master failover:
readyToCloseStartingTaskis pure in-memory state (not persisted). When the old master dies, this set is lost.SourceSplitEnumeratorTasksendsLastCheckpointNotifyOperationonly once, then transitions toPREPARE_CLOSE. After the new master takes over, the task is already inPREPARE_CLOSEstate and never re-sends the notification.readyToCloseStartingTaskis empty, can never be filled,COMPLETED_POINT_TYPEis never triggered.PREPARE_CLOSE, waiting for theCOMPLETED_POINT_TYPEbarrier to proceed.Log pattern observed:
Periodic checkpoints keep completing normally, but
COMPLETED_POINT_TYPEis never triggered. Checkpoint IDs increment from 30 to 50+ with each round completing successfully:Normal checkpoints succeed, confirming the engine is running. However,
COMPLETED_POINT_TYPEis never triggered and the job never ends.SeaTunnel Version
2.3.13
SeaTunnel Config
Running Command
Error Exception
Zeta or Flink or Spark Version
2.3.13
Java or Scala Version
jdk1.8
Screenshots
Are you willing to submit PR?
Code of Conduct