whisper : add --seg-len-hint to discourage progressively shorter segments#3742
Open
lizthegrey wants to merge 2 commits intoggml-org:masterfrom
Open
whisper : add --seg-len-hint to discourage progressively shorter segments#3742lizthegrey wants to merge 2 commits intoggml-org:masterfrom
lizthegrey wants to merge 2 commits intoggml-org:masterfrom
Conversation
…ents When processing long audio, whisper tends to produce progressively shorter segments because timestamp tokens in the decoder prompt context condition the model to insert more frequent segment breaks. Add a seg_len_hint parameter (in ms) that thins timestamp tokens in the rolling prompt context, keeping at most one per seg_len_hint interval. This breaks the feedback loop while preserving text tokens for continuity. The model can still break on natural boundaries (speaker turns, pauses) — the hint only affects context conditioning, not the actual segment creation. Usage: --seg-len-hint 2000 (for ~2 second target segments) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The initial --seg-len-hint commit wired the flag into whisper-cli but not whisper-server. Mirrors the existing best_of / beam_size pattern at server.cpp:221-222 (CLI) and :505-511 (POST form field) and assigns the value to wparams.seg_len_hint during inference setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When processing very long audio (multi-hour streams, podcasts, etc.) — particularly content with run-on sentences and few natural pauses — whisper tends to produce progressively shorter segments. This happens because timestamp tokens accumulate in the decoder's rolling prompt context (
prompt_past1), conditioning the model to insert segment breaks more frequently. Over time this creates a feedback loop where short segments beget shorter segments, eventually degrading to one word per line.This PR adds a
seg_len_hintparameter (in milliseconds) that thins timestamp tokens in the rolling prompt context, keeping at most one perseg_len_hintinterval. Text tokens are always preserved for continuity. The model can still break segments on natural boundaries (speaker turns, pauses) — the hint only affects context conditioning, not actual segment creation. Short segments are still produced where genuinely appropriate.seg_len_hintinwhisper_full_params(default 0 = off)--seg-len-hint N/-slh N--max-len 1word-level timestamp modeAlternatives considered
Post-processing merge of short segments: After decoding completes, merge adjacent short segments until they reach a minimum character count, flushing on punctuation/time gaps/speaker turns. Discarded because it does not address the root cause — the decoder still wastes inference cycles producing one-word segments, and any merging heuristic either prevents genuinely short segments from appearing (e.g. rhetorical pauses) or requires complex rules about when merging is appropriate.
Enforcing a minimum segment length: Suppress timestamp tokens during decoding if the current segment is too short. Discarded for similar reasons — it fights the model's output rather than fixing the conditioning that causes the problem, and it prevents the model from producing short segments where the audio genuinely warrants them.
The approach taken here (thinning timestamps in the prompt context) addresses the feedback loop at its source: the model no longer sees a dense history of frequent segment breaks, so it stops being primed to produce more of them.
Test plan
--seg-len-hint 2000on long audio — segments stay at natural clause/sentence length--seg-len-hint 0(default) — no behavior change--max-len 1— word-level timestamps still work correctly