Skip to content

whisper : add --seg-len-hint to discourage progressively shorter segments#3742

Open
lizthegrey wants to merge 2 commits intoggml-org:masterfrom
lizthegrey:lizf.seg-len-hint
Open

whisper : add --seg-len-hint to discourage progressively shorter segments#3742
lizthegrey wants to merge 2 commits intoggml-org:masterfrom
lizthegrey:lizf.seg-len-hint

Conversation

@lizthegrey
Copy link
Copy Markdown

@lizthegrey lizthegrey commented Apr 4, 2026

Summary

When processing very long audio (multi-hour streams, podcasts, etc.) — particularly content with run-on sentences and few natural pauses — whisper tends to produce progressively shorter segments. This happens because timestamp tokens accumulate in the decoder's rolling prompt context (prompt_past1), conditioning the model to insert segment breaks more frequently. Over time this creates a feedback loop where short segments beget shorter segments, eventually degrading to one word per line.

This PR adds a seg_len_hint parameter (in milliseconds) that thins timestamp tokens in the rolling prompt context, keeping at most one per seg_len_hint interval. Text tokens are always preserved for continuity. The model can still break segments on natural boundaries (speaker turns, pauses) — the hint only affects context conditioning, not actual segment creation. Short segments are still produced where genuinely appropriate.

  • New field seg_len_hint in whisper_full_params (default 0 = off)
  • CLI flag: --seg-len-hint N / -slh N
  • Does not affect --max-len 1 word-level timestamp mode

Alternatives considered

Post-processing merge of short segments: After decoding completes, merge adjacent short segments until they reach a minimum character count, flushing on punctuation/time gaps/speaker turns. Discarded because it does not address the root cause — the decoder still wastes inference cycles producing one-word segments, and any merging heuristic either prevents genuinely short segments from appearing (e.g. rhetorical pauses) or requires complex rules about when merging is appropriate.

Enforcing a minimum segment length: Suppress timestamp tokens during decoding if the current segment is too short. Discarded for similar reasons — it fights the model's output rather than fixing the conditioning that causes the problem, and it prevents the model from producing short segments where the audio genuinely warrants them.

The approach taken here (thinning timestamps in the prompt context) addresses the feedback loop at its source: the model no longer sees a dense history of frequent segment breaks, so it stops being primed to produce more of them.

Test plan

  • --seg-len-hint 2000 on long audio — segments stay at natural clause/sentence length
  • --seg-len-hint 0 (default) — no behavior change
  • --max-len 1 — word-level timestamps still work correctly
  • Short audio (JFK sample) — no change in output
  • MLK "I Have a Dream" (16 min, archive.org) — no regression; rhetorical short phrases like "Go back to Mississippi" and "We cannot turn back" correctly remain as short segments where MLK pauses for effect

lizthegrey and others added 2 commits April 3, 2026 21:29
…ents

When processing long audio, whisper tends to produce progressively
shorter segments because timestamp tokens in the decoder prompt context
condition the model to insert more frequent segment breaks.

Add a seg_len_hint parameter (in ms) that thins timestamp tokens in
the rolling prompt context, keeping at most one per seg_len_hint
interval. This breaks the feedback loop while preserving text tokens
for continuity. The model can still break on natural boundaries
(speaker turns, pauses) — the hint only affects context conditioning,
not the actual segment creation.

Usage: --seg-len-hint 2000 (for ~2 second target segments)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The initial --seg-len-hint commit wired the flag into whisper-cli but not
whisper-server. Mirrors the existing best_of / beam_size pattern at
server.cpp:221-222 (CLI) and :505-511 (POST form field) and assigns the
value to wparams.seg_len_hint during inference setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant