Skip to content

[Speculative Decoding] Refine ngram kernel signature and adapt ngram proposer#7774

Open
NKNaN wants to merge 1 commit intoPaddlePaddle:developfrom
NKNaN:spec-ngram
Open

[Speculative Decoding] Refine ngram kernel signature and adapt ngram proposer#7774
NKNaN wants to merge 1 commit intoPaddlePaddle:developfrom
NKNaN:spec-ngram

Conversation

@NKNaN
Copy link
Copy Markdown
Contributor

@NKNaN NKNaN commented May 11, 2026

Motivation

投机解码 ngram 方法端到端结果验证

Modifications

  1. 测试脚本(AI studio A800单卡环境能够跑通):

    # test.py
    from fastdeploy import LLM, SamplingParams
    
    # 场景1:代码生成——变量名、关键字、结构大量重复,ngram 命中率高
    msg1 = [
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": (
            "用 Python 写一个 Student 类,包含以下方法:\n"
            "1. __init__(self, name, age, score)\n"
            "2. get_name(self) 返回 self.name\n"
            "3. get_age(self) 返回 self.age\n"
            "4. get_score(self) 返回 self.score\n"
            "5. set_name(self, name) 设置 self.name\n"
            "6. set_age(self, age) 设置 self.age\n"
            "7. set_score(self, score) 设置 self.score\n"
            "8. __repr__(self) 返回 f'Student(name={self.name}, age={self.age}, score={self.score})'\n"
            "请完整实现所有方法。"
        )},
    ]
    
    # 场景2:结构化列表——每条格式相同,生成时前缀 n-gram 高度重复
    msg2 = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": (
            "请列出20个中国城市,每条格式为:\n"
            "城市名:xxx,省份:xxx,人口:约xxx万,著名景点:xxx\n"
            "请严格按照这个格式输出全部20条,不要省略。"
        )},
    ]
    
    messages = [msg1, msg2]
    
    # 采样参数
    sampling_params = SamplingParams(top_p=0.95, max_tokens=6400)
    
    # 加载模型
    llm = LLM(
        model="baidu/ERNIE-4.5-0.3B-Paddle",
        tensor_parallel_size=1,
        max_model_len=8192,
        speculative_config={
            "method": "ngram",
            "num_speculative_tokens": 5,   # 每轮最多投机 5 个 draft token,范围 [1, 5]
            "max_ngram_size": 5,           # 最大 n-gram 窗口,默认 5
        },
       # enable_overlap_schedule=True,
    )
    
    outputs = llm.chat(messages, sampling_params)
    
    # 输出结果
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs.text
        print(prompt)
        print(generated_text)
  2. 修改ngram kernel接口:
    由于 input_ids 和 pre_ids 目前全部并入 token_ids_all,将原本接口中的 input_ids 删除,prompt tokens 和 predict tokens 完全由 token_ids_all 负责记录。

  3. 确认修改后的 ngram match kernel 端到端执行正确:

    1. token_ids_all 与 input_ids_cpu 的初始化:
    # fastdeploy\worker\input_batch.py: 114-115
    self.token_ids_all = paddle.full(
        [max_num_seqs, self.model_config.max_model_len], ...
    )
    # fastdeploy\worker\input_batch.py: 280-281
    self.input_ids_cpu = paddle.full(
        shape=[max_num_seqs, self.model_config.max_model_len], ...
    )
    1. 验证 token_ids_all prompt 部分的写入(gpu_model_runner中)和读取(NgramProposer._run_impl中)内容一致(通过打印 log 查看):
    # fastdeploy\worker\gpu_model_runner.py: 916-919
    # prompt_tokens
    async_set_value(self.share_inputs["token_ids_all"][idx : idx + 1, :prompt_len], prompt_token_ids)
    # generated_token_ids fill -1
    self.share_inputs["token_ids_all"][idx : idx + 1, prompt_len:] = -1
    
    ## 在此处打印 token_ids_all[i, 0:20] 和 token_ids_all[i, prompt_len-3:prompt_len+3] 到日志
    logger.info(f"[NGRAM][VERIFY-WRITE] idx={idx} prompt_len={prompt_len} "
            f"token_ids_all[0:20]={self.share_inputs['token_ids_all'][idx, :20].tolist()} "
            f"token_ids_all[pl-3:pl+3]={self.share_inputs['token_ids_all'][idx, prompt_len-3:prompt_len+3].tolist()}")
    # 在 ngram.py _run_impl 开头添加
    def _run_impl(self, share_inputs):
        """
        run
        """
    if not hasattr(self, '_debug_call_count'):
        self._debug_call_count = 0
    if self._debug_call_count < 3:
        pl = share_inputs["prompt_lens"]
        tia = share_inputs["token_ids_all"]
        si = share_inputs["step_idx"]
        for bid in range(pl.shape[0]):
            plen = int(pl[bid].item())
            if plen > 0:
                logger.info(f"[NGRAM][VERIFY-READ] call={self._debug_call_count} bid={bid} "
                            f"step_idx={int(si[bid].item())} prompt_len={plen} "
                            f"token_ids_all[0:20]={tia[bid, :20].tolist()} "
                            f"token_ids_all[pl-3:pl]={tia[bid, plen-3:plen].tolist()}"
                            f"seq_lens_dec={int(share_inputs['seq_lens_decoder'][bid].item())} ")
        self._debug_call_count += 1
    
    ngram_match(...)
    # 查看log
    (base) aistudio@ssh-5453289-10284016-bf48d89cf-ph9f8:~/FastDeploy$ grep '\[NGRAM\]' log/paddle/workerlog.0
    INFO     2026-05-10 13:31:12,504 684126 gpu_model_runner.py[line:920] [NGRAM][VERIFY-WRITE] idx=0 prompt_len=168 token_ids_all[0:20]=[100273, 2520, 524, 274, 20472, 17461, 27963, 93937, 23, 2969, 93963, 16816, 12199, 93919, 94667, 748, 36619, 69716, 93956, 10553] token_ids_all[pl-3:pl+3]=[92267, 93963, 93919, -1, -1, -1]
    INFO     2026-05-10 13:31:12,514 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=0 step_idx=1 prompt_len=168 token_ids_all[0:20]=[100273, 2520, 524, 274, 20472, 17461, 27963, 93937, 23, 2969, 93963, 16816, 12199, 93919, 94667, 748, 36619, 69716, 93956, 10553] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=168 
    INFO     2026-05-10 13:31:12,516 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=1 step_idx=13 prompt_len=4096 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,516 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=2 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,517 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=3 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,517 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=4 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,517 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=5 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,518 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=6 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,519 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=7 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,522 684126 gpu_model_runner.py[line:920] [NGRAM][VERIFY-WRITE] idx=1 prompt_len=54 token_ids_all[0:20]=[100273, 2969, 93963, 69157, 63191, 5, 3, 94016, 1358, 3671, 93956, 94405, 94525, 14246, 94022, 94035, 23, 3671, 94312, 94035] token_ids_all[pl-3:pl+3]=[92267, 93963, 93919, -1, -1, -1]
    INFO     2026-05-10 13:31:12,533 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=0 step_idx=2 prompt_len=168 token_ids_all[0:20]=[100273, 2520, 524, 274, 20472, 17461, 27963, 93937, 23, 2969, 93963, 16816, 12199, 93919, 94667, 748, 36619, 69716, 93956, 10553] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=169 
    INFO     2026-05-10 13:31:12,533 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=1 step_idx=1 prompt_len=54 token_ids_all[0:20]=[100273, 2969, 93963, 69157, 63191, 5, 3, 94016, 1358, 3671, 93956, 94405, 94525, 14246, 94022, 94035, 23, 3671, 94312, 94035] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=54 
    INFO     2026-05-10 13:31:12,535 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=2 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,535 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=3 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,536 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=4 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,536 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=5 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,536 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=6 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,536 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=7 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,541 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=0 step_idx=3 prompt_len=168 token_ids_all[0:20]=[100273, 2520, 524, 274, 20472, 17461, 27963, 93937, 23, 2969, 93963, 16816, 12199, 93919, 94667, 748, 36619, 69716, 93956, 10553] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=170 
    INFO     2026-05-10 13:31:12,542 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=1 step_idx=2 prompt_len=54 token_ids_all[0:20]=[100273, 2969, 93963, 69157, 63191, 5, 3, 94016, 1358, 3671, 93956, 94405, 94525, 14246, 94022, 94035, 23, 3671, 94312, 94035] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=55 
    INFO     2026-05-10 13:31:12,542 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=2 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,542 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=3 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,542 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=4 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,543 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=5 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,543 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=6 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,543 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=7 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0

    token_ids_all 为 5 时是 dummy batch,seq_lens_decoder=0。除此之外可以看到 token_ids_all prompt 部分的读和写的内容一致。

    1. 验证 ngram.py ngram_match 调用后如果匹配到了ngram, 则得到的 draft_token[i, 1:proposed_length] in token_ids_all[:prompt_len+step_idx[i]] == True
    ngram_match(...)
    
    # 在 ngram.py _run_impl 结尾添加
    if not hasattr(self, '_debug_call_count'):
        self._debug_call_count = 0
    if self._debug_call_count < 50:
        tia = share_inputs["token_ids_all"]
        pl  = share_inputs["prompt_lens"] 
        si  = share_inputs["step_idx"]
        dt  = share_inputs["draft_tokens"]
        slt = share_inputs["seq_lens_this_time"]
        print(f"[NGRAM-DEBUG] call={self._debug_call_count} "
            f"slt={slt.tolist()} "
            f"step_idx={si.tolist()} "
            f"prompt_lens={pl.tolist()} "
            f"draft_token_num={share_inputs['actual_draft_token_num'].tolist()} "
            f"seq_dec={share_inputs['seq_lens_decoder'].tolist()}")
        for bid in range(slt.shape[0]):
            n_proposed = int(slt[bid].item()) - 1
            if n_proposed <= 0:
                continue
            step = int(si[bid].item())
            plen = int(pl[bid].item())
            context = tia[bid, :plen + step].tolist()
            proposed = dt[bid, 1:1 + n_proposed].tolist()
    
            # 在 context 中查找 proposed 序列
            found = any(
                context[i:i + n_proposed] == proposed
                for i in range(len(context) - n_proposed + 1)
            )
            logger.info(f"[NGRAM][E2E] call={self._debug_call_count} bid={bid} step_idx={step}"
                        f"proposed={proposed} found_in_context={found}")
        self._debug_call_count += 1
    # 查看 [NGRAM-DEBUG]
    (base) aistudio@ssh-5453289-10284016-bf48d89cf-ph9f8:~/FastDeploy$ grep '\[NGRAM-DEBUG\]' log/paddle/workerlog.0
    [NGRAM-DEBUG] call=0 slt=[1] step_idx=[[1], [13], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [4096], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[168, 0, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=1 slt=[1, 1] step_idx=[[2], [1], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[169, 54, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=2 slt=[1, 1] step_idx=[[3], [2], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[170, 55, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=3 slt=[6, 1] step_idx=[[4], [3], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[171, 56, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=4 slt=[1, 6] step_idx=[[5], [4], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[172, 57, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=5 slt=[6, 1] step_idx=[[6], [5], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[173, 58, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=6 slt=[6, 6] step_idx=[[7], [6], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[174, 59, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=7 slt=[6, 1] step_idx=[[8], [7], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[175, 60, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=8 slt=[6, 6] step_idx=[[10], [8], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[177, 61, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=9 slt=[1, 6] step_idx=[[11], [9], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[178, 62, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=10 slt=[1, 6] step_idx=[[12], [10], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[179, 63, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=11 slt=[6, 6] step_idx=[[13], [11], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[180, 64, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=12 slt=[6, 6] step_idx=[[15], [13], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[182, 66, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=13 slt=[1, 6] step_idx=[[16], [14], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[183, 67, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=14 slt=[1, 1] step_idx=[[17], [16], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[184, 69, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=15 slt=[6, 6] step_idx=[[18], [17], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[185, 70, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=16 slt=[1, 1] step_idx=[[19], [18], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[186, 71, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=17 slt=[6, 1] step_idx=[[20], [19], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[187, 72, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=18 slt=[6, 6] step_idx=[[21], [20], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[188, 73, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=19 slt=[6, 1] step_idx=[[22], [21], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[189, 74, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=20 slt=[1, 1] step_idx=[[23], [22], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[190, 75, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=21 slt=[1, 6] step_idx=[[24], [23], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[191, 76, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=22 slt=[6, 6] step_idx=[[25], [24], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[192, 77, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=23 slt=[6, 6] step_idx=[[31], [25], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[198, 78, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=24 slt=[1, 1] step_idx=[[35], [26], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[202, 79, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=25 slt=[6, 6] step_idx=[[36], [27], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[203, 80, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=26 slt=[1, 1] step_idx=[[37], [28], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[204, 81, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=27 slt=[1, 6] step_idx=[[38], [29], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[205, 82, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=28 slt=[6, 1] step_idx=[[39], [30], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[206, 83, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=29 slt=[4, 6] step_idx=[[40], [31], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[207, 84, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=30 slt=[1, 6] step_idx=[[41], [32], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[208, 85, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=31 slt=[1, 1] step_idx=[[42], [33], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[209, 86, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=32 slt=[1, 1] step_idx=[[43], [34], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[210, 87, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=33 slt=[6, 6] step_idx=[[44], [35], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[211, 88, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=34 slt=[6, 6] step_idx=[[45], [36], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[212, 89, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=35 slt=[6, 1] step_idx=[[46], [38], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[213, 91, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=36 slt=[1, 1] step_idx=[[47], [39], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[214, 92, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=37 slt=[6, 6] step_idx=[[48], [40], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[215, 93, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=38 slt=[6, 1] step_idx=[[49], [41], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[216, 94, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=39 slt=[1, 1] step_idx=[[50], [42], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[217, 95, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=40 slt=[1, 6] step_idx=[[51], [43], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[218, 96, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=41 slt=[6, 4] step_idx=[[52], [44], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[219, 97, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=42 slt=[6, 1] step_idx=[[53], [46], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[220, 99, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=43 slt=[6, 6] step_idx=[[54], [47], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[221, 100, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=44 slt=[6, 1] step_idx=[[56], [48], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[223, 101, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=45 slt=[6, 6] step_idx=[[57], [49], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[224, 102, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=46 slt=[6, 1] step_idx=[[58], [50], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[225, 103, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=47 slt=[1, 1] step_idx=[[59], [51], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[226, 104, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=48 slt=[6, 6] step_idx=[[60], [52], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[227, 105, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=49 slt=[6, 3] step_idx=[[61], [53], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[228, 106, 0, 0, 0, 0, 0, 0]
    # 查看 [NGRAM]
    (base) aistudio@ssh-5453289-10284016-bf48d89cf-ph9f8:~/FastDeploy$ grep '\[NGRAM\]' log/paddle/workerlog.0
    INFO     2026-05-10 15:48:31,030 726574 ngram.py[line:84] [NGRAM][E2E] call=3 bid=0 step_idx=4proposed=[93949, 695, 7858, 804, 93937] found_in_context=True
    INFO     2026-05-10 15:48:31,033 726574 ngram.py[line:84] [NGRAM][E2E] call=4 bid=1 step_idx=4proposed=[23, 3671, 94312, 94035, 14045] found_in_context=True
    INFO     2026-05-10 15:48:31,037 726574 ngram.py[line:84] [NGRAM][E2E] call=5 bid=0 step_idx=6proposed=[93956, 10553, 4923, 1919, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,041 726574 ngram.py[line:84] [NGRAM][E2E] call=6 bid=0 step_idx=7proposed=[4162, 1919, 93977, 23, 92267] found_in_context=True
    INFO     2026-05-10 15:48:31,043 726574 ngram.py[line:84] [NGRAM][E2E] call=6 bid=1 step_idx=6proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,046 726574 ngram.py[line:84] [NGRAM][E2E] call=7 bid=0 step_idx=8proposed=[10553, 4923, 1919, 94035, 23] found_in_context=True
    INFO     2026-05-10 15:48:31,050 726574 ngram.py[line:84] [NGRAM][E2E] call=8 bid=0 step_idx=10proposed=[1919, 93977, 23, 92267, 93963] found_in_context=True
    INFO     2026-05-10 15:48:31,050 726574 ngram.py[line:84] [NGRAM][E2E] call=8 bid=1 step_idx=8proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,053 726574 ngram.py[line:84] [NGRAM][E2E] call=9 bid=1 step_idx=9proposed=[14045, 94466, 93956, 17340, 33015] found_in_context=True
    INFO     2026-05-10 15:48:31,057 726574 ngram.py[line:84] [NGRAM][E2E] call=10 bid=1 step_idx=10proposed=[93937, 42854, 94035, 3991, 93956] found_in_context=True
    INFO     2026-05-10 15:48:31,060 726574 ngram.py[line:84] [NGRAM][E2E] call=11 bid=0 step_idx=13proposed=[23, 4, 93937, 1377, 1472] found_in_context=True
    INFO     2026-05-10 15:48:31,060 726574 ngram.py[line:84] [NGRAM][E2E] call=11 bid=1 step_idx=11proposed=[3, 94016, 1358, 3671, 93956] found_in_context=True
    INFO     2026-05-10 15:48:31,064 726574 ngram.py[line:84] [NGRAM][E2E] call=12 bid=0 step_idx=15proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,064 726574 ngram.py[line:84] [NGRAM][E2E] call=12 bid=1 step_idx=13proposed=[94016, 1358, 3671, 93956, 94405] found_in_context=True
    INFO     2026-05-10 15:48:31,067 726574 ngram.py[line:84] [NGRAM][E2E] call=13 bid=1 step_idx=14proposed=[93956, 17340, 33015, 94035, 14045] found_in_context=True
    INFO     2026-05-10 15:48:31,074 726574 ngram.py[line:84] [NGRAM][E2E] call=15 bid=0 step_idx=18proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,074 726574 ngram.py[line:84] [NGRAM][E2E] call=15 bid=1 step_idx=17proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,081 726574 ngram.py[line:84] [NGRAM][E2E] call=17 bid=0 step_idx=20proposed=[69716, 93956, 10553, 4923, 1919] found_in_context=True
    INFO     2026-05-10 15:48:31,085 726574 ngram.py[line:84] [NGRAM][E2E] call=18 bid=0 step_idx=21proposed=[16816, 12199, 93919, 94667, 748] found_in_context=True
    INFO     2026-05-10 15:48:31,085 726574 ngram.py[line:84] [NGRAM][E2E] call=18 bid=1 step_idx=20proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,090 726574 ngram.py[line:84] [NGRAM][E2E] call=19 bid=0 step_idx=22proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,098 726574 ngram.py[line:84] [NGRAM][E2E] call=21 bid=1 step_idx=23proposed=[3671, 94312, 94035, 14045, 93956] found_in_context=True
    INFO     2026-05-10 15:48:31,102 726574 ngram.py[line:84] [NGRAM][E2E] call=22 bid=0 step_idx=25proposed=[1472, 6946, 804, 93938, 853] found_in_context=True
    INFO     2026-05-10 15:48:31,102 726574 ngram.py[line:84] [NGRAM][E2E] call=22 bid=1 step_idx=24proposed=[3, 94016, 1358, 3671, 93956] found_in_context=True
    INFO     2026-05-10 15:48:31,106 726574 ngram.py[line:84] [NGRAM][E2E] call=23 bid=0 step_idx=31proposed=[4816, 93938, 10714, 93948, 23] found_in_context=True
    INFO     2026-05-10 15:48:31,106 726574 ngram.py[line:84] [NGRAM][E2E] call=23 bid=1 step_idx=25proposed=[42854, 94035, 3991, 93956, 20932] found_in_context=True
    INFO     2026-05-10 15:48:31,112 726574 ngram.py[line:84] [NGRAM][E2E] call=25 bid=0 step_idx=36proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,113 726574 ngram.py[line:84] [NGRAM][E2E] call=25 bid=1 step_idx=27proposed=[23, 3671, 94312, 94035, 14045] found_in_context=True
    INFO     2026-05-10 15:48:31,120 726574 ngram.py[line:84] [NGRAM][E2E] call=27 bid=1 step_idx=29proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,123 726574 ngram.py[line:84] [NGRAM][E2E] call=28 bid=0 step_idx=39proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,127 726574 ngram.py[line:84] [NGRAM][E2E] call=29 bid=0 step_idx=40proposed=[3099, 23, 283] found_in_context=True
    INFO     2026-05-10 15:48:31,127 726574 ngram.py[line:84] [NGRAM][E2E] call=29 bid=1 step_idx=31proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,132 726574 ngram.py[line:84] [NGRAM][E2E] call=30 bid=1 step_idx=32proposed=[4, 5, 3, 3, 94466] found_in_context=True
    INFO     2026-05-10 15:48:31,142 726574 ngram.py[line:84] [NGRAM][E2E] call=33 bid=0 step_idx=44proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,142 726574 ngram.py[line:84] [NGRAM][E2E] call=33 bid=1 step_idx=35proposed=[94016, 1358, 3671, 93956, 94405] found_in_context=True
    INFO     2026-05-10 15:48:31,146 726574 ngram.py[line:84] [NGRAM][E2E] call=34 bid=0 step_idx=45proposed=[3099, 23, 283, 44055, 934] found_in_context=True
    INFO     2026-05-10 15:48:31,147 726574 ngram.py[line:84] [NGRAM][E2E] call=34 bid=1 step_idx=36proposed=[93956, 73776, 93956, 94112, 96674] found_in_context=True
    INFO     2026-05-10 15:48:31,154 726574 ngram.py[line:84] [NGRAM][E2E] call=35 bid=0 step_idx=46proposed=[16816, 12199, 93919, 94667, 748] found_in_context=True
    INFO     2026-05-10 15:48:31,160 726574 ngram.py[line:84] [NGRAM][E2E] call=37 bid=0 step_idx=48proposed=[93938, 4816, 93938, 10714, 93948] found_in_context=True
    INFO     2026-05-10 15:48:31,161 726574 ngram.py[line:84] [NGRAM][E2E] call=37 bid=1 step_idx=40proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,166 726574 ngram.py[line:84] [NGRAM][E2E] call=38 bid=0 step_idx=49proposed=[16816, 12199, 93919, 94667, 748] found_in_context=True
    INFO     2026-05-10 15:48:31,173 726574 ngram.py[line:84] [NGRAM][E2E] call=40 bid=1 step_idx=43proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,176 726574 ngram.py[line:84] [NGRAM][E2E] call=41 bid=0 step_idx=52proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,176 726574 ngram.py[line:84] [NGRAM][E2E] call=41 bid=1 step_idx=44proposed=[94822, 93956, 97249] found_in_context=True
    INFO     2026-05-10 15:48:31,179 726574 ngram.py[line:84] [NGRAM][E2E] call=42 bid=0 step_idx=53proposed=[3099, 23, 283, 44055, 934] found_in_context=True
    INFO     2026-05-10 15:48:31,183 726574 ngram.py[line:84] [NGRAM][E2E] call=43 bid=0 step_idx=54proposed=[920, 853, 93963, 37993, 28685] found_in_context=True
    INFO     2026-05-10 15:48:31,183 726574 ngram.py[line:84] [NGRAM][E2E] call=43 bid=1 step_idx=47proposed=[3671, 94312, 94035, 14045, 93956] found_in_context=True
    INFO     2026-05-10 15:48:31,186 726574 ngram.py[line:84] [NGRAM][E2E] call=44 bid=0 step_idx=56proposed=[93938, 10714, 93948, 23, 5] found_in_context=True
    INFO     2026-05-10 15:48:31,189 726574 ngram.py[line:84] [NGRAM][E2E] call=45 bid=0 step_idx=57proposed=[16816, 12199, 93919, 94667, 748] found_in_context=True
    INFO     2026-05-10 15:48:31,190 726574 ngram.py[line:84] [NGRAM][E2E] call=45 bid=1 step_idx=49proposed=[42854, 94035, 3991, 93956, 20932] found_in_context=True
    INFO     2026-05-10 15:48:31,193 726574 ngram.py[line:84] [NGRAM][E2E] call=46 bid=0 step_idx=58proposed=[28685, 23, 283, 93963, 920] found_in_context=True
    INFO     2026-05-10 15:48:31,200 726574 ngram.py[line:84] [NGRAM][E2E] call=48 bid=0 step_idx=60proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,200 726574 ngram.py[line:84] [NGRAM][E2E] call=48 bid=1 step_idx=52proposed=[23, 3671, 94312, 94035, 14045] found_in_context=True
    INFO     2026-05-10 15:48:31,204 726574 ngram.py[line:84] [NGRAM][E2E] call=49 bid=0 step_idx=61proposed=[3099, 23, 283, 44055, 934] found_in_context=True
    INFO     2026-05-10 15:48:31,204 726574 ngram.py[line:84] [NGRAM][E2E] call=49 bid=1 step_idx=53proposed=[94035, 10985] found_in_context=True

    kernel 中的 ngram 地址计算 bug 修复后日志打印结果显示能够匹配到,且存在经过 verify 后在一步 decode 中接受了多个token的情况,如:
    [NGRAM-DEBUG] call=22 slt=[6, 6] step_idx=[[25], [24], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[192, 77, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=23 slt=[6, 6] step_idx=[[31], [25], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[198, 78, 0, 0, 0, 0, 0, 0]
    相邻两次proposer.run()时,step_idx[0] 从 25 增加到 31,seq_len_decoder[0] 从 192 增加到 198

  4. CUDAGraph 适配

    1. proposer.run() 在 gpu_runner._postprocess() 中执行,这部分不被 CUDAGraph 录制
    2. draft token 的 verify 需要一次性输入多个 token,所以会改变decode时录制的 expected_decode_len 和 batch_size,所以在 gpu worker 的 warmup 阶段需要将预计改变的形状提前录制好,需要修改 gpu_runner.capture_model() 和 FDConfig 对应的地方
    3. 测试脚本的 FDConfig 默认已经开启了 CUDAGraph
  5. Overlap Schedule 适配

    1. input_ids_cpu 在 input_batch.py 中初始化时没有设定 pin_memory,不参与 overlap
    2. 测试脚本开启 enable_overlap_schedule=True 后 log 中仍能够打印出正确匹配且上一步 decode 接受了多个token的情况

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 11, 2026

Thanks for your contribution!

@paddle-bot paddle-bot Bot added the contributor External developers label May 11, 2026
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 16:27:40

📋 Review 摘要

PR 概述:重构 ngram kernel 接口,移除冗余的 input_ids/input_ids_len 参数,统一由 token_ids_all 承载 prompt 与生成 token,并修复 ngram 偏移量 bug(cur_step_idx + 1 - ngram_sizecur_step_idx - ngram_size)。
变更范围custom_ops/gpu_ops/speculate_decoding/fastdeploy/spec_decode/ngram.pyfastdeploy/worker/gpu_model_runner.pyfastdeploy/config.pytests/spec_decode/
影响面 Tag[Speculative Decoding] [OP] [FDConfig]

📝 PR 规范检查

标题含 [Speculative Decoding](官方 Tag)✓;PR body 结构包含 Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 所有必填段落 ✓,无需修改建议。

问题

级别 文件 概述
🟡 建议 fastdeploy/worker/gpu_model_runner.py:2100 CUDAGraph capture 新增 NGRAM,需确认其他硬件 Runner 是否同步
❓ 疑问 tests/spec_decode/test_ngram_gpu_kernel.py:285 _make_mixed_test_datastep_idx 保留旧语义(gen_len-1),与 _make_ngram_test_data 新语义(gen_len)不一致,需确认是否有意为之

总体评价

本次 PR 将 ngram kernel 接口从需要单独传入 input_ids/input_ids_len 简化为完全依赖 token_ids_all,同时修复了 ngram 偏移量 bug,逻辑清晰正确、端到端验证充分(有详细的 VERIFY-WRITE/READ 及 E2E 匹配日志)。主要关注点是多硬件 Runner 同步及测试数据语义一致性,无阻塞性问题。

elif self.speculative_decoding and self.spec_method in [
SpecMethod.MTP,
SpecMethod.SUFFIX,
SpecMethod.NGRAM,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 其他硬件 Runner 的同步检查

gpu_model_runner.pycapture_model() 新增了 SpecMethod.NGRAM,按照 A6 多硬件同步原则,如果 dcu_model_runner.pyiluvatar_model_runner.py 等文件的 capture_model() 中也存在形如 self.spec_method in [SpecMethod.MTP, SpecMethod.SUFFIX] 的分支,则同样需要加入 SpecMethod.NGRAM

建议确认其他硬件 Runner 是否支持投机解码,若支持则补充同步。

pre_ids[b, :gen_len] = input_ids[b, src : src + gen_len]
# step_idx = last valid position (0-based index)
# step_idx = last valid position (0-based index), matches hybrid kernel semantics
step_idx[b] = gen_len - 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 _make_mixed_test_datastep_idx 语义与 _make_ngram_test_data 不一致

本次 PR 将 _make_ngram_test_datastep_idxgen_len - 1(0-based 最后有效位置)改为 gen_len(生成 token 数量),同时内核偏移公式也从 cur_step_idx + 1 - ngram_size 改为 cur_step_idx - ngram_size

但此函数 _make_mixed_test_data 仍保持 step_idx[b] = gen_len - 1,注释说明是 matches hybrid kernel semantics。请确认:

  1. _make_mixed_test_data 是否对应一个未在本次修改的独立 kernel 路径(即旧语义 kernel)?
  2. 若对应同一个 ngram_match kernel,则 step_idx 值应同步改为 gen_len,否则测试数据与内核语义不符,可能导致 CPU 参考实现与 GPU kernel 计算结果不一致。

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 16:53:54

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⚠️1 个 Required 任务失败1 个 Required 任务运行中1 个 Required 任务等待中,需优先处理 Required 失败任务方可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 28 3 2 2 1

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 11s PR问题:改动受保护目录,需指定RD审批 @freeliuzc@Deleter-D 审批本PR Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
⏸️ xpu_4cards_case_test / run_xpu_4cards_cases - 等待中 - - -
其余 7 个必选任务通过 - - - - -

2.2 可选任务 — 21/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 11s Job -
Trigger Jenkins for PR (CI_METAX) 23m36s Job -
Run iluvatar Tests / run_iluvatar_cases - Job -
⏸️ CI_HPU - - -
其余 21 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 流程审批(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 流程审批
  • 置信度: 高
  • 根因摘要: PR改动spec_decode受保护目录,缺少指定RD成员审批
  • 分析器: 通用分析(fallback)

根因详情:
本 PR 修改了受保护目录 fastdeploy/spec_decode 和/或 custom_ops/gpu_ops/speculate_decoding 下的文件。根据仓库审批规则,此类变更必须由指定的 FastDeploy RD 成员(freeliuzc(liuzichang01)Deleter-D(wangyanpeng04))至少一人 Approve 后,CI 审批检查才会通过,方可合并。

关键日志:

0. You must have one FastDeploy RD (freeliuzc(liuzichang01), Deleter-D(wangyanpeng04)) approval for modifing [fastdeploy/spec_decode,custom_ops/gpu_ops/speculate_decoding].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. @freeliuzc@Deleter-D 在本 PR 上完成代码 Review 并点击 Approve,CI 将自动重跑通过。

修复建议摘要: 请 @freeliuzc@Deleter-D 审批本PR

关联变更: PR 修改了 fastdeploy/spec_decode 和/或 custom_ops/gpu_ops/speculate_decoding 目录下的文件
链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@d70f33d). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7774   +/-   ##
==========================================
  Coverage           ?   71.66%           
==========================================
  Files              ?      396           
  Lines              ?    55706           
  Branches           ?     8712           
==========================================
  Hits               ?    39921           
  Misses             ?    13040           
  Partials           ?     2745           
Flag Coverage Δ
GPU 71.66% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants