Skip to content

[Feature] Support NaiveProposer for most cases#7669

Open
huicongyao wants to merge 5 commits intoPaddlePaddle:developfrom
huicongyao:develop
Open

[Feature] Support NaiveProposer for most cases#7669
huicongyao wants to merge 5 commits intoPaddlePaddle:developfrom
huicongyao:develop

Conversation

@huicongyao
Copy link
Copy Markdown
Contributor

@huicongyao huicongyao commented Apr 29, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

--speculative-config '{"method": "naive"}'

Function Tests

测试场景 模式 测试状态
裸FD 集中式
裸FD P/D分离
裸FD 集中式 + logprob
裸FD PD分离 + logprob
裸FD overlap
RL场景 naive spec各功能 待测试

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 29, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented Apr 29, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 17:29:44

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

CI 仍在进行中,1 个 required 任务失败(Approval),另有 3 个运行中、4 个等待中。需要获取指定成员审批方可通过 Approval 检查

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 24 2 5 5 0

2 任务状态汇总

2.1 Required任务:2/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 9s PR问题:修改受限目录,缺少4项指定成员审批 请联系指定 Reviewer 进行审批 Job -
run_ce_cases - 运行中 - Job -
base_tests - 运行中 - Job -
stable_tests - 运行中 - Job -
⏸️ run_tests_with_coverage - 等待中 - - -
⏸️ run_4_cards_tests - 等待中 - - -
⏸️ run_xpu_4cards_cases - 等待中 - - -
⏸️ run_xpu_8cards_cases - 等待中 - - -
其余 2 个必选任务通过 - - - - -

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 14s Job -
run_iluvatar_cases - Job -
Trigger Jenkins for PR - Job -
⏸️ CI_HPU - - -
其余 22 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 审批缺失(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 基础设施(审批流程)
  • 置信度: 高
  • 根因摘要: PR修改了受限目录,check_approval.sh 检测到缺少4项指定成员审批
  • 分析器: 通用分析(fallback)

根因详情:

scripts/check_approval.sh 脚本检测到本 PR 修改了需要特定成员审批的受保护目录/功能区域,共发现 4 项未满足的审批要求,脚本以 exit code 6 退出(exit code = 审批错误数量)。

关键日志:

0. You must have one FastDeploy RD (qingqing01, Jiang-Jia-Jun, heavengate) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404, yongqiangma) approval for adding custom op.
2. You must have one FastDeploy RD gongshaotian approval for modifing [fastdeploy/model_executor/graph_optimization].
3. You must have one FastDeploy RD (freeliuzc, Deleter-D) approval for modifing [fastdeploy/spec_decode,custom_ops/gpu_ops/speculate_decoding].
There are 4 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. @freeliuzc@Deleter-Dfastdeploy/spec_decodecustom_ops/gpu_ops/speculate_decoding 修改进行 Review 并 Approve。
  2. 如有 custom op 新增,请 @qingqing01 / @Jiang-Jia-Jun / @heavengate(FastDeploy RD)和 @jeff41404 / @yongqiangma(PaddlePaddle RD)审批。
  3. fastdeploy/model_executor/graph_optimization 有修改,还需 @gongshaotian 审批。

修复建议摘要: 请 @freeliuzc/@Deleter-D 等指定 Reviewer 完成审批

关联变更: PR 修改了 fastdeploy/spec_decodecustom_ops/gpu_ops/speculate_decodingfastdeploy/model_executor/graph_optimization 等受保护目录

链接: 查看日志

@huicongyao huicongyao changed the title [Feature] Support NaiveProposer for most cases and set to default [Feature] Support NaiveProposer for most cases Apr 29, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 27.27273% with 24 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@ecce6b5). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/spec_decode/naive.py 0.00% 9 Missing ⚠️
fastdeploy/worker/gpu_model_runner.py 46.15% 6 Missing and 1 partial ⚠️
...astdeploy/model_executor/layers/sample/logprobs.py 0.00% 3 Missing ⚠️
...cutor/layers/sample/ops/speculate_logprob_utils.py 25.00% 3 Missing ⚠️
fastdeploy/spec_decode/types.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7669   +/-   ##
==========================================
  Coverage           ?   71.61%           
==========================================
  Files              ?      397           
  Lines              ?    55733           
  Branches           ?     8715           
==========================================
  Hits               ?    39914           
  Misses             ?    13073           
  Partials           ?     2746           
Flag Coverage Δ
GPU 71.61% <27.27%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在把 SpecMethod.NAIVE 纳入现有 speculative decoding 框架的“统一 proposer 流程”,让大多数 speculative 相关路径不再依赖 “NAIVE 时 proposer=None” 的特例分支;同时补齐 NAIVE + logprob 场景下需要的 cu_batch_token_offset 计算能力。

Changes:

  • SpecMethod.NAIVE 引入 NaiveProposer(no-op),并在 GPU runner 的 dummy/postprocess 流程中统一调用 proposer.run()
  • PD 分离(prefill/decode)链路中,为 NAIVE 模式补齐首 token 的 draft_token_ids 传递与 decode 侧首步 draft_tokens/seq_lens_this_time 初始化。
  • 新增 speculate_compute_cu_batch_offset GPU op,并在 NAIVE logprob 构建流程中计算 cu_batch_token_offset

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
fastdeploy/worker/gpu_model_runner.py NAIVE 分支补齐 draft_tokens 初始化、dummy/postprocess 中统一 proposer 调用,以及 infer_seed 更新逻辑调整
fastdeploy/spec_decode/types.py SpecMethod.NAIVE 现在创建 NaiveProposer,不再返回 None
fastdeploy/spec_decode/naive.py 新增 NaiveProposer(no-op proposer)
fastdeploy/output/token_processor.py splitwise prefill 角色下,NAIVE 仅生成 1 token 时也填充 draft_token_ids 供 decode 侧初始化
fastdeploy/model_executor/layers/sample/ops/speculate_logprob_utils.py 新增 speculate_compute_cu_batch_offset Python 封装
fastdeploy/model_executor/layers/sample/ops/init.py 导出 speculate_compute_cu_batch_offset
fastdeploy/model_executor/layers/sample/logprobs.py NAIVE logprob 构建时使用 accept_tokens[:real_bsz] 并计算 cu_batch_token_offset
fastdeploy/model_executor/graph_optimization/cudagraph_piecewise_backend.py 更健壮地判断 real_bsz_to_captured_size 是否为空,避免空字典场景误判
fastdeploy/engine/sched/resource_manager_v1.py PD decode 侧接收 prefill 输出时,NAIVE 也拷贝 draft_token_ids
fastdeploy/engine/common_engine.py 同上:NAIVE + PD decode 侧拷贝 draft_token_ids
custom_ops/gpu_ops/speculate_decoding/speculate_logprob_utils.cu 新增 SpeculateComputeCuBatchOffset 内核与 static op 注册
custom_ops/gpu_ops/cpp_extensions.cc 新增 pybind 导出 speculate_compute_cu_batch_offset

Comment on lines 80 to 90
def create_proposer(self, fd_config, **kwargs) -> Optional["Proposer"]:
"""Factory method: create the appropriate Proposer for this method.

Args:
fd_config: FDConfig instance.
**kwargs: Method-specific args forwarded to the Proposer constructor.
MTP requires: main_model, local_rank, device_id, share_inputs.

Returns:
Proposer instance, or None for NAIVE.
"""
Comment on lines 80 to +94
def create_proposer(self, fd_config, **kwargs) -> Optional["Proposer"]:
"""Factory method: create the appropriate Proposer for this method.

Args:
fd_config: FDConfig instance.
**kwargs: Method-specific args forwarded to the Proposer constructor.
MTP requires: main_model, local_rank, device_id, share_inputs.

Returns:
Proposer instance, or None for NAIVE.
"""
if self == SpecMethod.NAIVE:
return None
from fastdeploy.spec_decode.naive import NaiveProposer

return NaiveProposer(fd_config)
Comment on lines +27 to +31
Proposer for NaiveProposer.

Not propose draft tokens, simply utilizing the framework
to place the last autoregressively generated token in
the first position of draft_tokens.
Comment on lines +62 to +67
Compute cumulative batch offset via inclusive prefix sum of accept_num.
"""
if current_platform.is_cuda():
from fastdeploy.model_executor.ops.gpu import speculate_compute_cu_batch_offset

speculate_compute_cu_batch_offset(cu_batch_token_offset, accept_num, real_bsz)
Comment on lines +188 to +191
speculate_compute_cu_batch_offset(
share_inputs["cu_batch_token_offset"],
share_inputs["accept_num"],
max_occupied_slots,
Comment on lines +56 to +69
def speculate_compute_cu_batch_offset(
cu_batch_token_offset: paddle.Tensor,
accept_num: paddle.Tensor,
real_bsz: int,
):
"""
Compute cumulative batch offset via inclusive prefix sum of accept_num.
"""
if current_platform.is_cuda():
from fastdeploy.model_executor.ops.gpu import speculate_compute_cu_batch_offset

speculate_compute_cu_batch_offset(cu_batch_token_offset, accept_num, real_bsz)
else:
raise NotImplementedError
# Get real shape (total num tokens)
if self.speculative_decoding and all(self.real_bsz_to_captured_size.values()):
if (
self.speculative_decoding
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里加是为了兜住什么条件

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

real_bsz_to_captured_size在naive 模式下为空,但是all(self.real_bsz_to_captured_size.values())为true,会导致这块代码运行出错

# NAIVE mode: one token per request, logits are already correct
output_logits = logits
token_ids = share_inputs["accept_tokens"][:max_occupied_slots, 0]
token_ids = share_inputs["accept_tokens"][:real_bsz, 0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可以复用投机解码分支吗?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该可以

Comment thread fastdeploy/output/token_processor.py Outdated
result.outputs.draft_token_ids = copy.deepcopy(token_ids)
elif (
self.cfg.speculative_config.method == SpecMethod.NAIVE
and self.cfg.scheduler_config.splitwise_role == "prefill"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的判断条件有点乱,建议直接根据方法判断,不能既有 len 又有方法

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已优化这部分的判断逻辑

class NaiveProposer(Proposer):
"""
Proposer for NaiveProposer.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是哪些变量必须使用 proposer 呢,建议删除此结构,不增加冗余代码

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在initialize_forwardmeta处self.proposer.fd_config.model_config.moe_phase.phase = "decode"

raise ValueError(
"Expected at least 2 draft tokens for speculative suffix decode, "
f"but got {len(draft_tokens_to_write)} for request {request.request_id}."
if self.spec_method in (SpecMethod.MTP, SpecMethod.SUFFIX):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suffix 暂时不需要传此token,另外即使传,这里也是动态的

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这Naive支持之前,对suffix也有传,去掉的话单测会不通过,此处为了保持一致

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 17:31:44

📋 Review 摘要

PR 概述:为 NaiveProposer 实现完整框架接入,支持集中式/PD分离/logprob/overlap 等主要场景

变更范围fastdeploy/spec_decode/custom_ops/gpu_ops/speculate_decoding/fastdeploy/worker/gpu_model_runner.pyfastdeploy/model_executor/fastdeploy/engine/fastdeploy/output/

影响面 Tag[Speculative Decoding] [OP] [Graph Optimization] [Engine]

📝 PR 规范检查

## Motivation## Modifications 内容为空(仅模板占位符),且缺少必填章节 ## Accuracy Tests(PR body 中以非标准 ## Function Tests 替代)。建议替换 PR 描述为以下完整内容:

PR 描述建议(可直接复制):

## Motivation
NaiveProposer 在 PD 分离和集中式场景下均未能有效接入推理框架:NAIVE 模式下 `create_proposer` 返回 `None`,导致多处需特殊判断 `proposer is None`;logprob 路径缺少 `cu_batch_token_offset` 的计算支持;PD 分离场景中 `draft_token_ids` 未正确传播。本 PR 将 NaiveProposer 接入框架全链路,支持集中式、PD 分离、logprob、overlap 等主要场景。

## Modifications
- `fastdeploy/spec_decode/naive.py`:新增 `NaiveProposer` 类,`_run_impl` 为 no-op
- `fastdeploy/spec_decode/types.py`:NAIVE 的 `create_proposer` 改为返回 `NaiveProposer(fd_config)` 而非 `None`
- `fastdeploy/worker/gpu_model_runner.py``insert_tasks_v1` 新增 NAIVE 路径(写 1 个 token,seq_len=1);NAIVE 模式下同步更新 `infer_seed`
- `fastdeploy/engine/common_engine.py``resource_manager_v1.py`:PD 分离 decode 节点条件扩展至 NAIVE
- `fastdeploy/output/token_processor.py`:prefill 阶段 NAIVE 模式设置 `draft_token_ids`,供 decode 节点初始化首个 decode step
- `custom_ops/gpu_ops/speculate_decoding/speculate_logprob_utils.cu`:新增 `SpeculateComputeCuBatchOffset` kernel 及注册
- `fastdeploy/model_executor/layers/sample/logprobs.py`:NAIVE logprob 路径调用新 kernel 计算 `cu_batch_token_offset`
- `fastdeploy/model_executor/graph_optimization/cudagraph_piecewise_backend.py`:修复 `real_bsz_to_captured_size` 空 dict 时的潜在误进入问题

## Usage or Command
`--speculative-config '{"method": "naive"}'`

## Accuracy Tests
N/A(NAIVE 模式为框架接入变更,不影响模型 logits 精度;功能测试如下)

| 测试场景 | 模式 | 测试状态 |
|---------|------|---------|
| 裸FD | 集中式 ||
| 裸FD | P/D分离 ||
| 裸FD | 集中式 + logprob ||
| 裸FD | PD分离 + logprob ||
| 裸FD | overlap ||
| RL场景 | naive spec各功能 | 待测试 |

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🟡 建议 fastdeploy/model_executor/layers/sample/logprobs.py:191 speculate_compute_cu_batch_offset 传入 max_occupied_slots 而非 real_bsz,与同块 token_ids[:real_bsz] 语义不一致
🟡 建议 custom_ops/gpu_ops/speculate_decoding/speculate_logprob_utils.cu 新增 kernel 未在 tests/operators/ 补充单元测试(A3 要求)

总体评价

实现思路清晰,kernel 注册、Python 绑定、调用侧均同步更新,cudagraph 空 dict bug fix 值得肯定。建议确认 max_occupied_slotsreal_bsz 在 NAIVE 模式下是否始终相等,并补充 kernel 单元测试。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants