[Feature] Support NaiveProposer for most cases#7669
[Feature] Support NaiveProposer for most cases#7669huicongyao wants to merge 5 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览CI 仍在进行中,1 个 required 任务失败(
2 任务状态汇总2.1 Required任务:2/10 通过
2.2 可选任务 — 22/26 通过
3 失败详情(仅 required)Approval — 审批缺失(置信度: 高)Approval
根因详情:
关键日志: 修复建议:
修复建议摘要: 请 @freeliuzc/@Deleter-D 等指定 Reviewer 完成审批 关联变更: PR 修改了 链接: 查看日志 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7669 +/- ##
==========================================
Coverage ? 71.61%
==========================================
Files ? 397
Lines ? 55733
Branches ? 8715
==========================================
Hits ? 39914
Misses ? 13073
Partials ? 2746
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
该 PR 旨在把 SpecMethod.NAIVE 纳入现有 speculative decoding 框架的“统一 proposer 流程”,让大多数 speculative 相关路径不再依赖 “NAIVE 时 proposer=None” 的特例分支;同时补齐 NAIVE + logprob 场景下需要的 cu_batch_token_offset 计算能力。
Changes:
- 为
SpecMethod.NAIVE引入NaiveProposer(no-op),并在 GPU runner 的 dummy/postprocess 流程中统一调用proposer.run()。 - PD 分离(prefill/decode)链路中,为 NAIVE 模式补齐首 token 的
draft_token_ids传递与 decode 侧首步draft_tokens/seq_lens_this_time初始化。 - 新增
speculate_compute_cu_batch_offsetGPU op,并在 NAIVE logprob 构建流程中计算cu_batch_token_offset。
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/worker/gpu_model_runner.py | NAIVE 分支补齐 draft_tokens 初始化、dummy/postprocess 中统一 proposer 调用,以及 infer_seed 更新逻辑调整 |
| fastdeploy/spec_decode/types.py | SpecMethod.NAIVE 现在创建 NaiveProposer,不再返回 None |
| fastdeploy/spec_decode/naive.py | 新增 NaiveProposer(no-op proposer) |
| fastdeploy/output/token_processor.py | splitwise prefill 角色下,NAIVE 仅生成 1 token 时也填充 draft_token_ids 供 decode 侧初始化 |
| fastdeploy/model_executor/layers/sample/ops/speculate_logprob_utils.py | 新增 speculate_compute_cu_batch_offset Python 封装 |
| fastdeploy/model_executor/layers/sample/ops/init.py | 导出 speculate_compute_cu_batch_offset |
| fastdeploy/model_executor/layers/sample/logprobs.py | NAIVE logprob 构建时使用 accept_tokens[:real_bsz] 并计算 cu_batch_token_offset |
| fastdeploy/model_executor/graph_optimization/cudagraph_piecewise_backend.py | 更健壮地判断 real_bsz_to_captured_size 是否为空,避免空字典场景误判 |
| fastdeploy/engine/sched/resource_manager_v1.py | PD decode 侧接收 prefill 输出时,NAIVE 也拷贝 draft_token_ids |
| fastdeploy/engine/common_engine.py | 同上:NAIVE + PD decode 侧拷贝 draft_token_ids |
| custom_ops/gpu_ops/speculate_decoding/speculate_logprob_utils.cu | 新增 SpeculateComputeCuBatchOffset 内核与 static op 注册 |
| custom_ops/gpu_ops/cpp_extensions.cc | 新增 pybind 导出 speculate_compute_cu_batch_offset |
| def create_proposer(self, fd_config, **kwargs) -> Optional["Proposer"]: | ||
| """Factory method: create the appropriate Proposer for this method. | ||
|
|
||
| Args: | ||
| fd_config: FDConfig instance. | ||
| **kwargs: Method-specific args forwarded to the Proposer constructor. | ||
| MTP requires: main_model, local_rank, device_id, share_inputs. | ||
|
|
||
| Returns: | ||
| Proposer instance, or None for NAIVE. | ||
| """ |
| def create_proposer(self, fd_config, **kwargs) -> Optional["Proposer"]: | ||
| """Factory method: create the appropriate Proposer for this method. | ||
|
|
||
| Args: | ||
| fd_config: FDConfig instance. | ||
| **kwargs: Method-specific args forwarded to the Proposer constructor. | ||
| MTP requires: main_model, local_rank, device_id, share_inputs. | ||
|
|
||
| Returns: | ||
| Proposer instance, or None for NAIVE. | ||
| """ | ||
| if self == SpecMethod.NAIVE: | ||
| return None | ||
| from fastdeploy.spec_decode.naive import NaiveProposer | ||
|
|
||
| return NaiveProposer(fd_config) |
| Proposer for NaiveProposer. | ||
|
|
||
| Not propose draft tokens, simply utilizing the framework | ||
| to place the last autoregressively generated token in | ||
| the first position of draft_tokens. |
| Compute cumulative batch offset via inclusive prefix sum of accept_num. | ||
| """ | ||
| if current_platform.is_cuda(): | ||
| from fastdeploy.model_executor.ops.gpu import speculate_compute_cu_batch_offset | ||
|
|
||
| speculate_compute_cu_batch_offset(cu_batch_token_offset, accept_num, real_bsz) |
| speculate_compute_cu_batch_offset( | ||
| share_inputs["cu_batch_token_offset"], | ||
| share_inputs["accept_num"], | ||
| max_occupied_slots, |
| def speculate_compute_cu_batch_offset( | ||
| cu_batch_token_offset: paddle.Tensor, | ||
| accept_num: paddle.Tensor, | ||
| real_bsz: int, | ||
| ): | ||
| """ | ||
| Compute cumulative batch offset via inclusive prefix sum of accept_num. | ||
| """ | ||
| if current_platform.is_cuda(): | ||
| from fastdeploy.model_executor.ops.gpu import speculate_compute_cu_batch_offset | ||
|
|
||
| speculate_compute_cu_batch_offset(cu_batch_token_offset, accept_num, real_bsz) | ||
| else: | ||
| raise NotImplementedError |
| # Get real shape (total num tokens) | ||
| if self.speculative_decoding and all(self.real_bsz_to_captured_size.values()): | ||
| if ( | ||
| self.speculative_decoding |
There was a problem hiding this comment.
real_bsz_to_captured_size在naive 模式下为空,但是all(self.real_bsz_to_captured_size.values())为true,会导致这块代码运行出错
| # NAIVE mode: one token per request, logits are already correct | ||
| output_logits = logits | ||
| token_ids = share_inputs["accept_tokens"][:max_occupied_slots, 0] | ||
| token_ids = share_inputs["accept_tokens"][:real_bsz, 0] |
| result.outputs.draft_token_ids = copy.deepcopy(token_ids) | ||
| elif ( | ||
| self.cfg.speculative_config.method == SpecMethod.NAIVE | ||
| and self.cfg.scheduler_config.splitwise_role == "prefill" |
There was a problem hiding this comment.
这里的判断条件有点乱,建议直接根据方法判断,不能既有 len 又有方法
| class NaiveProposer(Proposer): | ||
| """ | ||
| Proposer for NaiveProposer. | ||
|
|
There was a problem hiding this comment.
这里是哪些变量必须使用 proposer 呢,建议删除此结构,不增加冗余代码
There was a problem hiding this comment.
在initialize_forwardmeta处self.proposer.fd_config.model_config.moe_phase.phase = "decode"
| raise ValueError( | ||
| "Expected at least 2 draft tokens for speculative suffix decode, " | ||
| f"but got {len(draft_tokens_to_write)} for request {request.request_id}." | ||
| if self.spec_method in (SpecMethod.MTP, SpecMethod.SUFFIX): |
There was a problem hiding this comment.
suffix 暂时不需要传此token,另外即使传,这里也是动态的
There was a problem hiding this comment.
这Naive支持之前,对suffix也有传,去掉的话单测会不通过,此处为了保持一致
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-11 17:31:44
📋 Review 摘要
PR 概述:为 NaiveProposer 实现完整框架接入,支持集中式/PD分离/logprob/overlap 等主要场景
变更范围:fastdeploy/spec_decode/、custom_ops/gpu_ops/speculate_decoding/、fastdeploy/worker/gpu_model_runner.py、fastdeploy/model_executor/、fastdeploy/engine/、fastdeploy/output/
影响面 Tag:[Speculative Decoding] [OP] [Graph Optimization] [Engine]
📝 PR 规范检查
## Motivation 和 ## Modifications 内容为空(仅模板占位符),且缺少必填章节 ## Accuracy Tests(PR body 中以非标准 ## Function Tests 替代)。建议替换 PR 描述为以下完整内容:
PR 描述建议(可直接复制):
## Motivation
NaiveProposer 在 PD 分离和集中式场景下均未能有效接入推理框架:NAIVE 模式下 `create_proposer` 返回 `None`,导致多处需特殊判断 `proposer is None`;logprob 路径缺少 `cu_batch_token_offset` 的计算支持;PD 分离场景中 `draft_token_ids` 未正确传播。本 PR 将 NaiveProposer 接入框架全链路,支持集中式、PD 分离、logprob、overlap 等主要场景。
## Modifications
- `fastdeploy/spec_decode/naive.py`:新增 `NaiveProposer` 类,`_run_impl` 为 no-op
- `fastdeploy/spec_decode/types.py`:NAIVE 的 `create_proposer` 改为返回 `NaiveProposer(fd_config)` 而非 `None`
- `fastdeploy/worker/gpu_model_runner.py`:`insert_tasks_v1` 新增 NAIVE 路径(写 1 个 token,seq_len=1);NAIVE 模式下同步更新 `infer_seed`
- `fastdeploy/engine/common_engine.py`、`resource_manager_v1.py`:PD 分离 decode 节点条件扩展至 NAIVE
- `fastdeploy/output/token_processor.py`:prefill 阶段 NAIVE 模式设置 `draft_token_ids`,供 decode 节点初始化首个 decode step
- `custom_ops/gpu_ops/speculate_decoding/speculate_logprob_utils.cu`:新增 `SpeculateComputeCuBatchOffset` kernel 及注册
- `fastdeploy/model_executor/layers/sample/logprobs.py`:NAIVE logprob 路径调用新 kernel 计算 `cu_batch_token_offset`
- `fastdeploy/model_executor/graph_optimization/cudagraph_piecewise_backend.py`:修复 `real_bsz_to_captured_size` 空 dict 时的潜在误进入问题
## Usage or Command
`--speculative-config '{"method": "naive"}'`
## Accuracy Tests
N/A(NAIVE 模式为框架接入变更,不影响模型 logits 精度;功能测试如下)
| 测试场景 | 模式 | 测试状态 |
|---------|------|---------|
| 裸FD | 集中式 | ✅ |
| 裸FD | P/D分离 | ✅ |
| 裸FD | 集中式 + logprob | ✅ |
| 裸FD | PD分离 + logprob | ✅ |
| 裸FD | overlap | ✅ |
| RL场景 | naive spec各功能 | 待测试 |
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/model_executor/layers/sample/logprobs.py:191 |
speculate_compute_cu_batch_offset 传入 max_occupied_slots 而非 real_bsz,与同块 token_ids[:real_bsz] 语义不一致 |
| 🟡 建议 | custom_ops/gpu_ops/speculate_decoding/speculate_logprob_utils.cu |
新增 kernel 未在 tests/operators/ 补充单元测试(A3 要求) |
总体评价
实现思路清晰,kernel 注册、Python 绑定、调用侧均同步更新,cudagraph 空 dict bug fix 值得肯定。建议确认 max_occupied_slots 与 real_bsz 在 NAIVE 模式下是否始终相等,并补充 kernel 单元测试。
Motivation
Modifications
Usage or Command
--speculative-config '{"method": "naive"}'Function Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.