Problem
whisper.cpp's whisper_vad_detect_speech resets LSTM hidden/cell states on every call (ggml_backend_buffer_clear(vctx->buffer, 0) at whisper.cpp:5131). This is by design for whisper.cpp's one-shot file processing use case, but means our streaming usage (calling it repeatedly with 512-sample chunks) loses temporal context between calls.
The upstream Silero VAD model is designed to be stateful — LSTM state should carry across 512-sample chunks, just like TEN-VAD carries state across 256-sample hops.
Proposed change
Since we fork whisper.cpp, add a non-breaking way to skip the buffer clear:
- Option A: Add a
bool reset_state parameter or flag to whisper_vad_detect_speech
- Option B: Add a separate
whisper_vad_reset_state() function and remove the auto-reset from detect_speech
- Option C: Just remove the
ggml_backend_buffer_clear line and let callers explicitly reset via a new function when needed
Also need to make VadBackend.reset() for Silero call the new reset function (currently a no-op).
Impact
- No change to our calling code —
SileroVad.chunkProbS16 already calls once per chunk
- Probabilities should improve with temporal context
- Likely improves Silero's accuracy on our regression tests
- No upstream issue/PR exists for this (checked Feb 2026)
Priority
Low — TEN-VAD is our default and already stateful. This only affects --vad silero.
Problem
whisper.cpp's
whisper_vad_detect_speechresets LSTM hidden/cell states on every call (ggml_backend_buffer_clear(vctx->buffer, 0)at whisper.cpp:5131). This is by design for whisper.cpp's one-shot file processing use case, but means our streaming usage (calling it repeatedly with 512-sample chunks) loses temporal context between calls.The upstream Silero VAD model is designed to be stateful — LSTM state should carry across 512-sample chunks, just like TEN-VAD carries state across 256-sample hops.
Proposed change
Since we fork whisper.cpp, add a non-breaking way to skip the buffer clear:
bool reset_stateparameter or flag towhisper_vad_detect_speechwhisper_vad_reset_state()function and remove the auto-reset fromdetect_speechggml_backend_buffer_clearline and let callers explicitly reset via a new function when neededAlso need to make
VadBackend.reset()for Silero call the new reset function (currently a no-op).Impact
SileroVad.chunkProbS16already calls once per chunkPriority
Low — TEN-VAD is our default and already stateful. This only affects
--vad silero.