Skip to content

[BugFix] Fix get_tasks returns empty task list bug#7557

Open
liyonghua0910 wants to merge 1 commit intoPaddlePaddle:developfrom
liyonghua0910:develop+20260422_fix_get_tasks
Open

[BugFix] Fix get_tasks returns empty task list bug#7557
liyonghua0910 wants to merge 1 commit intoPaddlePaddle:developfrom
liyonghua0910:develop+20260422_fix_get_tasks

Conversation

@liyonghua0910
Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 22, 2026

Thanks for your contribution!

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-22 15:45:50

📋 Review 摘要

PR 概述:修复 TP(张量并行)模式下 get_tasks() 因竞态条件返回空任务列表的 Bug。
变更范围fastdeploy/worker/worker_process.py — Worker 事件循环同步逻辑
影响面 Tag[Engine]


📝 PR 规范检查

PR 标题含有效 [BugFix] Tag,但 Motivation / Modifications 正文均为空,未描述变更内容与影响面。建议补充(可直接复制):

## Motivation
在 TP(张量并行)模式下,多 TP rank 共享任务队列。`get_tasks()` 只有在所有客户端
读取完毕后才清空任务列表。若 rank-0 在 rank-N 完成 `get_tasks()` 并重置
`exist_task_signal` 之前回到 while 循环顶部,会再次进入 if 块并调用 `get_tasks()`。
此时任务列表已被清空,导致 `tasks=[]`,触发 `assert len(tasks) > 0`。

## Modifications
在 `get_tasks()` 和信号重置之后、EP barrier 之前,增加一次 `_tp_barrier_wait()`,
确保所有 TP rank 同步后才能继续,消除竞态窗口。

问题

级别 文件 概述
🟡 建议 worker_process.py:551 新增 barrier 缺少注释说明意图
🟡 建议 Bug fix 无回归测试

关于回归测试tests/worker/test_worker_process.py 中已有 _tp_barrier_wait 的 Mock,建议补充一个模拟 rank-0 提前回到循环的竞态场景测试,防止未来引入相同问题。


总体评价

修复逻辑正确,在 get_tasks() 和信号重置之后新增 _tp_barrier_wait() 是消除该竞态的最小代价方案,与文件中其他 barrier 的使用模式一致。建议补充 PR 描述和行内注释,提高可维护性。

self.task_queue.read_finish_flag.set(0)
else:
self.exist_task_signal.value[0] = ExistTaskStatus.EMPTY
self._tp_barrier_wait() if tp_size > 1 else None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 建议补充注释说明此 barrier 的作用

此行是竞态修复的关键点,但缺少说明 为什么 需要在这里同步。建议参考上方第 484-485 行的注释风格,补充如下:

# Synchronize all TP workers after get_tasks() and signal reset to prevent
# rank-0 from looping back and calling get_tasks() again before rank-N resets
# exist_task_signal, which would result in an empty task list.
self._tp_barrier_wait() if tp_size > 1 else None

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@7707be8). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7557   +/-   ##
==========================================
  Coverage           ?   72.46%           
==========================================
  Files              ?      419           
  Lines              ?    57633           
  Branches           ?     9033           
==========================================
  Hits               ?    41765           
  Misses             ?    13051           
  Partials           ?     2817           
Flag Coverage Δ
GPU 72.46% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants