[RL] Remove redundant barrier and optimize model weights signal broadcast#7545
[RL] Remove redundant barrier and optimize model weights signal broadcast#7545Sunny-bot1 wants to merge 2 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-21 21:05:40
📋 Review 摘要
PR 概述:在 dynamic_load_weight 场景下,将 _broadcast_model_weights_signal 从 GPU tensor broadcast 替换为 broadcast_object_list,并将冗余的无条件 barrier 移入 != NORMAL 分支,减少不必要的同步开销。
变更范围:fastdeploy/engine/args_utils.py、fastdeploy/worker/worker_process.py
影响面 Tag:[RL] [Engine]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| ❓ 疑问 | args_utils.py:595 |
移除 dynamic_load_weight 对 enable_overlap_schedule 的禁用,兼容性未说明 |
| ❓ 疑问 | worker_process.py:489 |
barrier 后置依赖 broadcast 一致性保证,建议补充注释 |
总体评价
两处优化方向正确:用 broadcast_object_list 替换 GPU tensor broadcast 可明显减少 DtoH 同步开销;barrier 后置在 broadcast 语义保证下逻辑安全。主要疑问是 args_utils.py 中 dynamic_load_weight + enable_overlap_schedule 同时启用的兼容性是否已充分验证,建议在 PR 描述中补充说明。
| or (self.speculative_config is not None and self.enable_logprob) | ||
| or self.splitwise_role == "prefill" | ||
| or self.dynamic_load_weight | ||
| ): |
There was a problem hiding this comment.
❓ 疑问 移除 dynamic_load_weight 对 enable_overlap_schedule 的禁用
原来当 dynamic_load_weight=True 时,会强制将 enable_overlap_schedule 设为 False,防止两者同时生效。本次 PR 移除了该限制,但未在 PR 描述中说明 dynamic_load_weight + enable_overlap_schedule 的兼容性已验证。
请确认:
- 两者同时启用是否已在生产环境/测试环境中验证过?
event_loop_normal中 overlap schedule 路径下的_tp_barrier_wait与_broadcast_model_weights_signal的调用顺序是否在 overlap 模式下仍然正确?
| paddle.distributed.barrier() | ||
| if self.model_weights_signal[0] != ModelWeightsStatus.NORMAL: | ||
| if self.ranks > 1: | ||
| paddle.distributed.barrier() |
There was a problem hiding this comment.
❓ 疑问 barrier 移入 != NORMAL 分支后的逻辑依赖
原逻辑:先无条件 barrier(确保所有 rank 信号一致后再判断),再决定是否更新权重。
新逻辑:依赖 _broadcast_model_weights_signal 中的 broadcast_object_list 已保证各 rank 信号一致,因此移除无条件 barrier 在理论上是安全的——各 rank 在 broadcast 完成后持有相同信号值,!= NORMAL 的判断结果也一致,不会发生部分 rank 进入 barrier 而其他 rank 不进入的死锁场景。
建议在此处补充注释,例如:
# broadcast_object_list guarantees signal consistency across ranks,
# so barrier is only needed when actually entering weight update/clear.
if self.model_weights_signal[0] != ModelWeightsStatus.NORMAL:
if self.ranks > 1:
paddle.distributed.barrier()这样后续维护者能快速理解为何 barrier 可以安全后置。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7545 +/- ##
==========================================
Coverage ? 73.10%
==========================================
Files ? 419
Lines ? 57519
Branches ? 9011
==========================================
Hits ? 42049
Misses ? 12639
Partials ? 2831
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
在 dynamic_load_weight 场景下,
_broadcast_model_weights_signal使用 GPU tensor进行 broadcast,会触发 NCCL 通信流操作及
.numpy()的 DtoH 同步;同时在读取信号后无条件调用
paddle.distributed.barrier(),即使信号为 NORMAL(无需更新权重)时也会引入不必要的全局同步开销。
Modifications
1. 优化
_broadcast_model_weights_signalpaddle.distributed.broadcast+.numpy()的方式替换为
paddle.distributed.broadcast_object_list2. 移除冗余的
paddle.distributed.barrier()model_weights_signal != NORMAL分支内,仅在真正需要执行权重更新/清除时才做全局同步
3. RL场景下打开 overlap schedule
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.