Skip to content

[RL] Remove redundant barrier and optimize model weights signal broadcast#7545

Open
Sunny-bot1 wants to merge 2 commits intoPaddlePaddle:developfrom
Sunny-bot1:rl_signal
Open

[RL] Remove redundant barrier and optimize model weights signal broadcast#7545
Sunny-bot1 wants to merge 2 commits intoPaddlePaddle:developfrom
Sunny-bot1:rl_signal

Conversation

@Sunny-bot1
Copy link
Copy Markdown
Collaborator

@Sunny-bot1 Sunny-bot1 commented Apr 21, 2026

Motivation

在 dynamic_load_weight 场景下,_broadcast_model_weights_signal 使用 GPU tensor
进行 broadcast,会触发 NCCL 通信流操作及 .numpy() 的 DtoH 同步;同时在读取信号后
无条件调用 paddle.distributed.barrier(),即使信号为 NORMAL(无需更新权重)时也会
引入不必要的全局同步开销。

Modifications

1. 优化 _broadcast_model_weights_signal

  • 将 GPU tensor + paddle.distributed.broadcast + .numpy() 的方式
    替换为 paddle.distributed.broadcast_object_list
  • 避免了 GPU tensor 的创建、NCCL 通信及 DtoH 同步开销
  • 纯 CPU 侧操作,语义更清晰

2. 移除冗余的 paddle.distributed.barrier()

  • 原 barrier 在判断信号值之前无条件执行,NORMAL 状态下纯属浪费
  • 将 barrier 移入 model_weights_signal != NORMAL 分支内,
    仅在真正需要执行权重更新/清除时才做全局同步

3. RL场景下打开 overlap schedule

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 21, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown
Collaborator

@liyonghua0910 liyonghua0910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-21 21:05:40

📋 Review 摘要

PR 概述:在 dynamic_load_weight 场景下,将 _broadcast_model_weights_signal 从 GPU tensor broadcast 替换为 broadcast_object_list,并将冗余的无条件 barrier 移入 != NORMAL 分支,减少不必要的同步开销。
变更范围fastdeploy/engine/args_utils.pyfastdeploy/worker/worker_process.py
影响面 Tag[RL] [Engine]


问题

级别 文件 概述
❓ 疑问 args_utils.py:595 移除 dynamic_load_weightenable_overlap_schedule 的禁用,兼容性未说明
❓ 疑问 worker_process.py:489 barrier 后置依赖 broadcast 一致性保证,建议补充注释

总体评价

两处优化方向正确:用 broadcast_object_list 替换 GPU tensor broadcast 可明显减少 DtoH 同步开销;barrier 后置在 broadcast 语义保证下逻辑安全。主要疑问是 args_utils.pydynamic_load_weight + enable_overlap_schedule 同时启用的兼容性是否已充分验证,建议在 PR 描述中补充说明。

or (self.speculative_config is not None and self.enable_logprob)
or self.splitwise_role == "prefill"
or self.dynamic_load_weight
):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 移除 dynamic_load_weightenable_overlap_schedule 的禁用

原来当 dynamic_load_weight=True 时,会强制将 enable_overlap_schedule 设为 False,防止两者同时生效。本次 PR 移除了该限制,但未在 PR 描述中说明 dynamic_load_weight + enable_overlap_schedule 的兼容性已验证。

请确认:

  1. 两者同时启用是否已在生产环境/测试环境中验证过?
  2. event_loop_normal 中 overlap schedule 路径下的 _tp_barrier_wait_broadcast_model_weights_signal 的调用顺序是否在 overlap 模式下仍然正确?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

待验证

paddle.distributed.barrier()
if self.model_weights_signal[0] != ModelWeightsStatus.NORMAL:
if self.ranks > 1:
paddle.distributed.barrier()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 barrier 移入 != NORMAL 分支后的逻辑依赖

原逻辑:先无条件 barrier(确保所有 rank 信号一致后再判断),再决定是否更新权重。

新逻辑:依赖 _broadcast_model_weights_signal 中的 broadcast_object_list 已保证各 rank 信号一致,因此移除无条件 barrier 在理论上是安全的——各 rank 在 broadcast 完成后持有相同信号值,!= NORMAL 的判断结果也一致,不会发生部分 rank 进入 barrier 而其他 rank 不进入的死锁场景。

建议在此处补充注释,例如:

# broadcast_object_list guarantees signal consistency across ranks,
# so barrier is only needed when actually entering weight update/clear.
if self.model_weights_signal[0] != ModelWeightsStatus.NORMAL:
    if self.ranks > 1:
        paddle.distributed.barrier()

这样后续维护者能快速理解为何 barrier 可以安全后置。

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8883757). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/worker_process.py 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7545   +/-   ##
==========================================
  Coverage           ?   73.10%           
==========================================
  Files              ?      419           
  Lines              ?    57519           
  Branches           ?     9011           
==========================================
  Hits               ?    42049           
  Misses             ?    12639           
  Partials           ?     2831           
Flag Coverage Δ
GPU 73.10% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants