[RL] Remove redundant barrier and optimize model weights signal broadcast by Sunny-bot1 · Pull Request #7545 · PaddlePaddle/FastDeploy

Sunny-bot1 · 2026-04-21T12:41:24Z

Motivation

在 dynamic_load_weight 场景下，_broadcast_model_weights_signal 使用 GPU tensor
进行 broadcast，会触发 NCCL 通信流操作及 .numpy() 的 DtoH 同步；同时在读取信号后
无条件调用 paddle.distributed.barrier()，即使信号为 NORMAL（无需更新权重）时也会
引入不必要的全局同步开销。

Modifications

1. 优化 _broadcast_model_weights_signal

将 GPU tensor + paddle.distributed.broadcast + .numpy() 的方式
替换为 paddle.distributed.broadcast_object_list
避免了 GPU tensor 的创建、NCCL 通信及 DtoH 同步开销
纯 CPU 侧操作，语义更清晰

2. 移除冗余的 paddle.distributed.barrier()

原 barrier 在判断信号值之前无条件执行，NORMAL 状态下纯属浪费
将 barrier 移入 model_weights_signal != NORMAL 分支内，
仅在真正需要执行权重更新/清除时才做全局同步

3. RL场景下打开 overlap schedule

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-21T12:41:31Z

Thanks for your contribution!

liyonghua0910

LGTM

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-21 21:05:40

📋 Review 摘要

PR 概述：在 dynamic_load_weight 场景下，将 _broadcast_model_weights_signal 从 GPU tensor broadcast 替换为 broadcast_object_list，并将冗余的无条件 barrier 移入 != NORMAL 分支，减少不必要的同步开销。
变更范围：fastdeploy/engine/args_utils.py、fastdeploy/worker/worker_process.py
影响面 Tag：[RL] [Engine]

问题

级别	文件	概述
❓ 疑问	`args_utils.py:595`	移除 `dynamic_load_weight` 对 `enable_overlap_schedule` 的禁用，兼容性未说明
❓ 疑问	`worker_process.py:489`	barrier 后置依赖 broadcast 一致性保证，建议补充注释

总体评价

两处优化方向正确：用 broadcast_object_list 替换 GPU tensor broadcast 可明显减少 DtoH 同步开销；barrier 后置在 broadcast 语义保证下逻辑安全。主要疑问是 args_utils.py 中 dynamic_load_weight + enable_overlap_schedule 同时启用的兼容性是否已充分验证，建议在 PR 描述中补充说明。

PaddlePaddle-bot · 2026-04-21T13:06:40Z

            or (self.speculative_config is not None and self.enable_logprob)
            or self.splitwise_role == "prefill"
-            or self.dynamic_load_weight
        ):


❓ 疑问 移除 dynamic_load_weight 对 enable_overlap_schedule 的禁用

原来当 dynamic_load_weight=True 时，会强制将 enable_overlap_schedule 设为 False，防止两者同时生效。本次 PR 移除了该限制，但未在 PR 描述中说明 dynamic_load_weight + enable_overlap_schedule 的兼容性已验证。

请确认：

两者同时启用是否已在生产环境/测试环境中验证过？

event_loop_normal 中 overlap schedule 路径下的 _tp_barrier_wait 与 _broadcast_model_weights_signal 的调用顺序是否在 overlap 模式下仍然正确？

PaddlePaddle-bot · 2026-04-21T13:06:40Z

-                    paddle.distributed.barrier()
                if self.model_weights_signal[0] != ModelWeightsStatus.NORMAL:
+                    if self.ranks > 1:
+                        paddle.distributed.barrier()


❓ 疑问 barrier 移入 != NORMAL 分支后的逻辑依赖

原逻辑：先无条件 barrier（确保所有 rank 信号一致后再判断），再决定是否更新权重。

新逻辑：依赖 _broadcast_model_weights_signal 中的 broadcast_object_list 已保证各 rank 信号一致，因此移除无条件 barrier 在理论上是安全的——各 rank 在 broadcast 完成后持有相同信号值，!= NORMAL 的判断结果也一致，不会发生部分 rank 进入 barrier 而其他 rank 不进入的死锁场景。

建议在此处补充注释，例如：

# broadcast_object_list guarantees signal consistency across ranks, # so barrier is only needed when actually entering weight update/clear. if self.model_weights_signal[0] != ModelWeightsStatus.NORMAL: if self.ranks > 1: paddle.distributed.barrier()

这样后续维护者能快速理解为何 barrier 可以安全后置。

codecov-commenter · 2026-04-21T14:54:50Z

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8883757). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/worker_process.py	0.00%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7545   +/-   ##
==========================================
  Coverage           ?   73.10%           
==========================================
  Files              ?      419           
  Lines              ?    57519           
  Branches           ?     9011           
==========================================
  Hits               ?    42049           
  Misses             ?    12639           
  Partials           ?     2831

Flag	Coverage Δ
GPU	`73.10% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rm signal sync

53ef672

Sunny-bot1 had a problem deploying to Metax_ci April 21, 2026 12:41 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

overlap

9574748

Sunny-bot1 had a problem deploying to Metax_ci April 21, 2026 13:01 — with GitHub Actions Failure

liyonghua0910 approved these changes Apr 21, 2026

View reviewed changes

PaddlePaddle-bot reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RL] Remove redundant barrier and optimize model weights signal broadcast#7545

[RL] Remove redundant barrier and optimize model weights signal broadcast#7545
Sunny-bot1 wants to merge 2 commits intoPaddlePaddle:developfrom
Sunny-bot1:rl_signal

Sunny-bot1 commented Apr 21, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented Apr 21, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

liyonghua0910 left a comment

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Apr 21, 2026

Uh oh!

Sunny-bot1 Apr 21, 2026

Uh oh!

PaddlePaddle-bot Apr 21, 2026

Uh oh!

codecov-commenter commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Sunny-bot1 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented Apr 21, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

liyonghua0910 left a comment

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

总体评价

Uh oh!

PaddlePaddle-bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Sunny-bot1 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 21, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunny-bot1 commented Apr 21, 2026 •

edited

Loading