[Bug][Zeta] Job stuck permanently after master failover, unable to complete (affects BATCH / bounded source / job shutdown phase)

### Search before asking

- [x] I had searched in the [issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22) and found no similar issues.


### What happened

After master node failover or restart, a BATCH job or bounded Streaming job gets permanently stuck in RUNNING state and never completes. The same issue occurs when a Streaming job triggers a cancel or savepoint and master failover happens during the shutdown phase.

**Affected scenarios:**
- BATCH jobs
- Streaming jobs with bounded sources (e.g. Kafka consuming up to a specified offset)
- Streaming jobs undergoing active cancel or savepoint when master failover occurs

**Normal flow:**

Source finishes reading → `SourceSplitEnumeratorTask` sends `LastCheckpointNotifyOperation` to master → `CheckpointCoordinator` adds task to `readyToCloseStartingTask` set → once all collected, triggers `COMPLETED_POINT_TYPE` → job completes normally.

**Broken flow after master failover:**

1. `readyToCloseStartingTask` is pure in-memory state (not persisted). When the old master dies, this set is lost.
2. `SourceSplitEnumeratorTask` sends `LastCheckpointNotifyOperation` only once, then transitions to `PREPARE_CLOSE`. After the new master takes over, the task is already in `PREPARE_CLOSE` state and never re-sends the notification.
3. Deadlock forms:
   - Coordinator waiting for tasks: `readyToCloseStartingTask` is empty, can never be filled, `COMPLETED_POINT_TYPE` is never triggered.
   - Tasks waiting for coordinator: stuck in `PREPARE_CLOSE`, waiting for the `COMPLETED_POINT_TYPE` barrier to proceed.
   - Both sides wait for each other indefinitely — job is permanently stuck.

**Log pattern observed:**

Periodic checkpoints keep completing normally, but `COMPLETED_POINT_TYPE` is never triggered. Checkpoint IDs increment from 30 to 50+ with each round completing successfully:

```
[seatunnel-coordinator-service-22] - pending checkpoint(30/...) notify finished!
[seatunnel-coordinator-service-22] - start notify checkpoint completed, job id: ..., pipeline id: 1, checkpoint id: 30
[seatunnel-coordinator-service-47] - wait checkpoint completed: 30
[seatunnel-coordinator-service-22] - pending checkpoint(31/...) notify finished!
[seatunnel-coordinator-service-22] - start notify checkpoint completed, job id: ..., pipeline id: 1, checkpoint id: 31
[seatunnel-coordinator-service-47] - wait checkpoint completed: 31
... (continues indefinitely, checkpoint id keeps incrementing)
```

Normal checkpoints succeed, confirming the engine is running. However, `COMPLETED_POINT_TYPE` is never triggered and the job never ends.


### SeaTunnel Version

2.3.13

### SeaTunnel Config

```conf
Not applicable (engine-internal issue, not config-dependent).
```

### Running Command

```shell
Not applicable.
```

### Error Exception

```log
No exception thrown. The job stays in RUNNING state indefinitely without any error or completion log.
Symptom: periodic checkpoints keep completing normally (checkpoint id keeps incrementing), but COMPLETED_POINT_TYPE is never triggered and the job never ends.
```

### Zeta or Flink or Spark Version

2.3.13

### Java or Scala Version

jdk1.8

### Screenshots

<img width="1454" height="879" alt="Image" src="https://github.com/user-attachments/assets/04c0cdb3-1b3e-4e72-ad7d-95a9bc006622" />

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][Zeta] Job stuck permanently after master failover, unable to complete (affects BATCH / bounded source / job shutdown phase) #10834

Search before asking

What happened

SeaTunnel Version

SeaTunnel Config

Running Command

Error Exception

Zeta or Flink or Spark Version

Java or Scala Version

Screenshots

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug][Zeta] Job stuck permanently after master failover, unable to complete (affects BATCH / bounded source / job shutdown phase) #10834

Description

Search before asking

What happened

SeaTunnel Version

SeaTunnel Config

Running Command

Error Exception

Zeta or Flink or Spark Version

Java or Scala Version

Screenshots

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions