One C++ binary, twenty-one ASR backends, zero Python dependencies.
CrispASR is a fork of whisper.cpp that extends it into a unified speech recognition tool called crispasr, backed by full ggml C++ runtimes for major open-weights ASR architectures. One build, one binary, one consistent CLI — pick the backend at the command line or let CrispASR auto-detect it from your GGUF file.
$ crispasr -m ggml-base.en.bin -f samples/jfk.wav # OpenAI Whisper
$ crispasr -m parakeet-tdt-0.6b.gguf -f samples/jfk.wav # NVIDIA Parakeet
$ crispasr -m canary-1b-v2.gguf -f samples/jfk.wav # NVIDIA Canary
$ crispasr -m voxtral-mini-3b-2507.gguf -f samples/jfk.wav # Mistral Voxtral
$ crispasr --backend qwen3 -m auto -f samples/jfk.wav # -m auto downloadsNo Python. No PyTorch. No separate per-model binary. No pip install. Just one C++ binary and a GGUF file.
| Project | What it does |
|---|---|
| CrispASR | This repo — C++ speech recognition engine. 11 backends, CLI + HTTP server + C-ABI + Python/Rust/Dart bindings. |
| CrisperWeaver | Cross-platform Flutter transcription app built on CrispASR. Desktop + mobile, all 10 backends, model browser with download queue, mic capture, SRT/VTT/JSON export, diarization, batch processing. Fully offline. |
| CrispEmbed | Text embedding engine via ggml — same philosophy as CrispASR but for retrieval. 10 architectures (XLM-R, Qwen3-Embed, Gemma3, ModernBERT, ...), dense + sparse + ColBERT + reranking. 9.5x faster than ONNX on CPU, GPU via CUDA/Metal/Vulkan. Python/Rust/Dart bindings. |
| Susurrus | Python ASR GUI with 9 backends (faster-whisper, mlx-whisper, voxtral, insanely-fast-whisper, ...). The Python counterpart to CrispASR's C++ approach. |
- Supported backends
- Feature matrix
- Install & build
- Quick start
- CLI reference
- Voice Activity Detection (VAD)
- Word-level timestamps via CTC alignment
- Output formats
- Auto-download (
-m auto) - Audio formats
- Architecture
- Adding a new backend
- HOWTO Quantize
- Branch state & roadmap
- GPU backend selection
- Debugging & profiling
- Credits
| Backend | Model | Architecture | Languages | License |
|---|---|---|---|---|
| whisper | ggml-base.en.bin and all OpenAI Whisper variants |
Encoder-decoder transformer | 99 | MIT |
| whisper | distil-whisper/distil-large-v3 |
Distilled Whisper: 32L encoder + 2L decoder (6.3x faster, 513 MB Q5_0) | English | MIT |
| parakeet | nvidia/parakeet-tdt-0.6b-v3 |
FastConformer + TDT | 25 EU (auto-detect) | CC-BY-4.0 |
| parakeet | nvidia/parakeet-tdt_ctc-0.6b-ja |
FastConformer-TDT-CTC, xscaling, 80 mels | Japanese | CC-BY-4.0 |
| canary | nvidia/canary-1b-v2 |
FastConformer + Transformer decoder | 25 EU (explicit -sl/-tl) |
CC-BY-4.0 |
| cohere | CohereLabs/cohere-transcribe-03-2026 |
Conformer + Transformer | 13 | Apache-2.0 |
| granite | ibm-granite/granite-speech-{3.2-8b,3.3-2b,3.3-8b,4.0-1b} |
Conformer + BLIP-2 Q-Former + Granite LLM (μP) | en fr de es pt ja | Apache-2.0 |
| fastconformer-ctc | nvidia/stt_en_fastconformer_ctc_large |
FastConformer + CTC (NeMo family, all sizes) | en | CC-BY-4.0 |
| voxtral | mistralai/Voxtral-Mini-3B-2507 |
Whisper encoder + Mistral 3B LLM, audio-token injection | 8 | Apache-2.0 |
| voxtral4b | mistralai/Voxtral-Mini-4B-Realtime-2602 |
Causal RoPE+SwiGLU encoder + 3.4B LLM with adaptive RMSNorm + sliding window | 13, realtime streaming | Apache-2.0 |
| qwen3 | Qwen/Qwen3-ASR-0.6B |
Whisper-style audio encoder + Qwen3 0.6B LLM | 30 + 22 Chinese dialects | Apache-2.0 |
| wav2vec2 | jonatasgrosman/wav2vec2-large-xlsr-53-english |
CNN + 24-layer transformer + CTC head (any Wav2Vec2ForCTC) | per-model (en, de, multilingual available) | Apache-2.0 |
| wav2vec2 | facebook/data2vec-audio-base-960h |
Data2Vec Audio: 5-layer pos_conv, post-norm (79 MB Q4_K) | English | Apache-2.0 |
| wav2vec2 | facebook/hubert-large-ls960-ft |
HuBERT Large: pre-norm, single pos_conv (212 MB Q4_K) | English | Apache-2.0 |
| glm-asr | zai-org/GLM-ASR-Nano-2512 |
Whisper encoder (partial RoPE) + 4-frame projector + Llama 1.5B LLM (GQA) | 17 (Mandarin, English, Cantonese, ...) | MIT |
| kyutai-stt | kyutai/stt-1b-en_fr |
Mimi neural audio codec (SEANet + 8L transformer + RVQ) + 16L causal LM (SwiGLU, RMSNorm) | en, fr | MIT |
| firered-asr | FireRedTeam/FireRedASR2-AED |
Conformer encoder + CTC + beam search decoder; also LID (120 languages via FireRedLID GGUF) | Mandarin, English, 20+ Chinese dialects | Apache-2.0 |
| moonshine | UsefulSensors/moonshine-{tiny,base} |
Conv stem + 6L transformer encoder + 6L decoder (288–416d, partial RoPE, SiLU); multilingual variants (ja, ko, zh, ar, vi, uk) | English + 6 langs | MIT |
| moonshine-streaming | UsefulSensors/moonshine-streaming-{tiny,small,medium} |
Streaming ASR: raw-waveform frontend + sliding-window encoder + autoregressive decoder (34–245M, designed for edge/low-latency) | English | MIT |
| gemma4-e2b | google/gemma-4-E2B-it |
USM Conformer encoder (12L, 1024d) + Gemma4 LLM decoder (35L, 1536d, GQA, PLE); 128-bin log-mel, 30s max | 140+ langs | Apache-2.0 |
| omniasr | facebook/omniASR-CTC-{300M,1B} |
wav2vec2-style CNN + 24–48L transformer + CTC head | 1600+ | Apache-2.0 |
| omniasr | omniASR-LLM-300M-v2 |
Same encoder + 12L LLaMA decoder (SwiGLU, RoPE); autoregressive, best quality | 1600+ | Apache-2.0 |
| vibevoice | microsoft/VibeVoice-ASR |
σ-VAE ConvNeXt encoders + Qwen2.5-7B decoder; timestamps, diarization, hotwords | 50+ | MIT |
Text-to-Speech models (TTS — --tts flag):
| Backend | Models | Architecture | Languages | License |
|---|---|---|---|---|
| vibevoice-tts | VibeVoice-Realtime-0.5B |
4L base + 20L TTS LM + DPM-Solver++ + σ-VAE decoder; pre-computed voice presets | en | MIT |
| vibevoice-tts | VibeVoice-1.5B |
28L Qwen2 LM + DPM-Solver++ + σ-VAE decoder; voice cloning from audio | en, zh | MIT |
| qwen3-tts | Qwen3-TTS-12Hz-0.6B-Base |
Qwen3 talker LM + 12 Hz RVQ speech tokenizer; baked voice pack GGUF or runtime WAV + --ref-text |
multilingual, per base model | Apache-2.0 |
Post-processing models (work with all backends):
| Model | Task | Architecture | Languages | License | HuggingFace |
|---|---|---|---|---|---|
| FireRedPunc | Punctuation restoration | BERT-base (12L, d=768), 5 classes | Chinese + English | Apache-2.0 | cstr/fireredpunc-GGUF |
| fullstop-punc | Punctuation restoration | XLM-RoBERTa-large (24L, d=1024), 6 classes | EN, DE, FR, IT | MIT | cstr/fullstop-punc-multilang-GGUF |
All runtimes share ggml-based inference. The speech-LLM backends (qwen3, voxtral, voxtral4b, granite, glm-asr, kyutai-stt) inject audio encoder frames directly into an autoregressive language model's input embeddings, instead of using a dedicated CTC/transducer/seq2seq decoder. The fastconformer-ctc backend hosts the NeMo FastConformer-CTC standalone ASR family (small through xxlarge, same architecture as the canary aligner) with greedy CTC decoding.
Run crispasr --list-backends to see it live. Each backend declares capabilities at runtime; if you ask for a feature the selected backend does not support, CrispASR prints a warning and silently ignores the flag.
| Feature | whisper | parakeet | canary | cohere | granite | voxtral | voxtral4b | qwen3 | fc-ctc | wav2vec2 | glm-asr | kyutai-stt | firered | moonshine | omniasr |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Native timestamps | ✔ | ✔ | ✔ | ✔ | |||||||||||
| CTC timestamps | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |||||
| Word-level timing | ✔ | ✔ | ✔ | ✔ | -am |
-am |
-am |
-am |
-am |
-am |
-am |
-am |
-am |
-am |
|
| Per-token confidence | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |||||||
| Language auto-detect | ✔ | ✔ | LID | LID | LID | LID | LID | ✔ | LID | LID | ✔ | LID | LID | LID | LID |
| Speech translation | ✔ | ✔ | ✔ | ✔ | ✔ | ||||||||||
| Speaker diarization | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | all | all | all | all | all | all | all |
| Grammar (GBNF) | ✔ | ||||||||||||||
| Temperature sampling | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |||||
| Beam search | ✔ | ✔ | |||||||||||||
Best-of-N (--best-of) |
✔ | ✔ | ✔ | ✔ | ✔ | ||||||||||
| Flash attention | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||
| KV cache | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | * | |||
| Punctuation toggle | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||||||||
| Punc restoration | pp | pp | pp | pp | pp | pp | pp | pp | pp | pp | pp | pp | pp | pp | pp |
| Source / target language | ✔ | ✔ | ✔ | ✔ | |||||||||||
Audio Q&A (--ask) |
* | ✔ | * | ||||||||||||
Streaming (--stream/--mic/--live) |
✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Auto-download (-m auto) |
✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Key: ✔ = native/built-in, -am = via CTC forced aligner (-am canary-ctc-aligner.gguf or -am qwen3-forced-aligner.gguf), LID = via external language identification pre-step (-l auto), all = via --diarize post-step (not declared by backend but always available), pp = via --punc-model post-processor (FireRedPunc or fullstop-punc), * = omniasr-LLM has KV cache (CTC variant does not).
Speaker diarization is available for all backends as a post-processing step via --diarize:
--diarize-method energy/xcorr— stereo-only, no extra deps--diarize-method pyannote— native GGUF runtime (no Python, no sherpa-onnx). Pass--sherpa-segment-model pyannote-v3-seg.gguffor the pyannote v3 segmentation model. Falls back to sherpa subprocess for.onnxmodels.--diarize-method sherpa/ecapa— calls an externally-installed sherpa-onnx subprocess with speaker embedding models.--diarize-method vad-turns— mono-friendly, assigns speaker labels at gap boundaries
Language identification for backends without native LID: --lid-backend whisper (default, 75 MB ggml-tiny.bin), --lid-backend silero (native GGUF, 16 MB, 95 languages), or --lid-backend firered (FireRedLID, 1.7 GB, 120 languages — Conformer encoder + Transformer decoder).
Voice activity detection: --vad uses the default Silero VAD (~885 KB, auto-downloaded). Each VAD segment is transcribed independently, producing separate SRT/VTT entries with correct timestamps. Use --vad --split-on-punct for best subtitle output. Four VAD backends: Silero (default), FireRedVAD (-vm firered, recommended), MarbleNet (-vm marblenet, 439 KB, 6 languages), Whisper-VAD-EncDec (-vm whisper-vad, experimental).
Punctuation restoration (--punc-model): CTC-based backends output lowercase without punctuation. Add --punc-model fireredpunc-q8_0.gguf (or fullstop-punc-q4_k.gguf for DE/FR/IT) to restore punctuation and capitalization. See the post-processing table above for model details. Also available via Python/Rust/Dart wrappers (crispasr.PuncModel).
Which backends produce punctuation natively?
| Backend | Punctuation | Capitalization | Notes |
|---|---|---|---|
| whisper | ✔ | ✔ | Full punctuation and casing |
| parakeet | ✔ | ✔ | |
| canary | ✔ | ✔ | |
| cohere | ✔ | ✔ | Toggleable via --no-punctuation |
| granite | ✔ | ✔ | LLM output |
| voxtral | ✔ | ✔ | LLM output |
| voxtral4b | ✔ | ✔ | LLM output |
| qwen3 | ✔ | ✔ | LLM output |
| glm-asr | ✔ | ✔ | LLM output |
| kyutai-stt | ✔ | ✔ | LLM output |
| moonshine | ✔ | ✔ | Encoder-decoder output |
| fastconformer-ctc | no | no | CTC — add --punc-model |
| wav2vec2 | no | no | CTC — add --punc-model |
| firered-asr | no | no | CTC — add --punc-model |
| omniasr (CTC) | no | no | CTC — add --punc-model |
| omniasr (LLM) | ✔ | ✔ | Autoregressive decoder |
Other freely-licensed alternatives that could be added: felflare/bert-restore-punctuation (MIT, English, includes truecasing), xashru/punctuation-restoration (Apache-2.0, 40+ languages, BiLSTM-CRF).
Progressive subtitle output (--flush-after): By default, non-whisper backends buffer all segments and print output at the end. For real-time subtitle consumption (PotPlayer, custom media players), use --flush-after 1 to print each SRT entry to stdout immediately after its VAD segment is transcribed:
crispasr --backend parakeet -m parakeet.gguf --vad --flush-after 1 -osrt -f long_audio.wav
# SRT entries appear progressively as each segment finishesJSON output with language detection: When using -l auto -oj, the JSON output includes detected language info:
{
"crispasr": {
"backend": "cohere",
"language": "en",
"language_detected": "en",
"language_confidence": 0.977,
"language_source": "ecapa"
},
"transcription": [...]
}| Need | Pick |
|---|---|
| Battle-tested, all features exposed | whisper |
| Lowest English WER | cohere |
| Fastest (16x realtime on CPU) | moonshine (tiny), fc-ctc (10x) |
| Multilingual + word timestamps + fast | parakeet (2.9x RT) |
| Multilingual with explicit language control | canary |
| Speech translation (X→en or en→X) | canary, voxtral, qwen3 |
| 30 languages + Chinese dialects | qwen3 |
| 1600+ languages | omniasr (CTC or LLM) |
| Realtime streaming ASR (<500 ms latency) | voxtral4b |
| Highest-quality offline speech-LLM | voxtral |
| Apache-licensed speech-LLM | granite, voxtral, qwen3, omniasr-llm |
| Lightweight CTC-only (fast, no decoder) | wav2vec2, fc-ctc, data2vec, omniasr |
| Mandarin + Chinese dialects | firered-asr, qwen3, glm-asr |
Cohere, canary, granite, voxtral and voxtral4b need an explicit
language code up front. If you don't know the language, pass
-l auto and crispasr runs an optional LID pre-step before the main
transcribe() call:
# Downloads ggml-tiny.bin (75 MB, 99 languages) on first use
crispasr --backend cohere -m $TC/cohere-transcribe-q5_0.gguf \
-f unknown.wav -l auto
# crispasr[lid]: detected 'en' (p=0.977) via whisper-tiny
# crispasr: LID -> language = 'en' (whisper, p=0.977)These LID providers are available:
--lid-backend whisper(default) — uses a small multilingual ggml-*.bin model via the whisper.cpp C API. Auto-downloads ~75 MB on first use. 99 languages.--lid-backend silero— native GGUF port of Silero's 95-language classifier. 16 MB F32, pure C++. Faster and smaller than whisper-tiny but slightly less accurate on long audio (>20s).--lid-backend ecapa— recommended: ECAPA-TDNN (Apache-2.0). Purpose-built for language ID. Very high accuracy on TTS benchmark. Two variants via--lid-model:cstr/ecapa-lid-107-GGUF— VoxLingua107, 43 MB F16, 107 languages, ISO codes (en, de, ...). Default.cstr/ecapa-lid-commonlanguage-GGUF— CommonLanguage, 40 MB F16, 45 languages, full names (English, German, ...).
--lid-backend firered— FireRedLID (Conformer encoder + Transformer decoder). Q4_K (544 MB), 120 languages including Chinese dialects. Slower but covers more languages.
These VAD providers are available:
- Silero VAD (default) — ~885 KB, auto-downloaded via
--vad. Industry-standard, well-tested. - FireRedVAD — DFSMN-based, 2.4 MB, F1=97.57%. Pass
--vad -vm fireredto auto-download. Recommended. - MarbleNet — NVIDIA 1D separable CNN, 439 KB, 6 languages (EN/DE/FR/ES/RU/ZH). Pass
--vad -vm marblenetto auto-download. Smallest model. (cstr/marblenet-vad-GGUF) - Whisper-VAD-EncDec (experimental) — Whisper-base encoder + TransformerDecoder head, 22 MB Q4_K. Trained on Japanese ASMR; may not generalise well to all domains. Pass
--vad -vm whisper-vad. Slower than others (~1s vs ~50ms). (cstr/whisper-vad-encdec-asmr-GGUF)
Pass --lid-backend off to skip LID entirely.
- C++17 compiler (GCC 10+, Clang 12+, MSVC 19.30+)
- CMake 3.14+
- Optional:
libavformat/libavcodec/libavutil/libswresamplefor Opus/M4A ingestion (WHISPER_FFMPEG=ON) - Optional:
libopenblas/MKL/Accelerate — speeds up CPU-side matmuls for the Conformer-based encoders (parakeet, canary, cohere, granite, fastconformer-ctc). The ggml CPU backend picks up BLAS automatically when it's present at build time; no CrispASR configure flag needed. - Optional: CUDA/Metal/Vulkan GPU backend — enabled via ggml's standard flags (
GGML_CUDA=ON,GGML_METAL=ON, etc.). On CUDA you can setGGML_CUDA_ENABLE_UNIFIED_MEMORY=1to allow swapping to system RAM when VRAM is exhausted. curlorwgeton$PATHif you want to use-m autoauto-download- Optional:
sherpa-onnxbinaries on$PATHif you want--diarize-method sherpawith ONNX models
No Python, PyTorch, or pip required at runtime.
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target crispasrThe produced binary is build/bin/crispasr with a build/bin/whisper-cli backward-compatibility alias.
Two batch scripts handle the Windows build without requiring a pre-opened Developer Command Prompt. They use vswhere.exe to locate Visual Studio 2022 automatically, call vcvars64.bat, then drive CMake + Ninja.
build-windows.batProduces build\bin\crispasr.exe. Extra CMake flags can be appended:
build-windows.bat -DWHISPER_CURL=ON :: enable libcurl fallback
build-windows.bat -DGGML_CUDA=ON :: NVIDIA GPU (CUDA must be installed)What it does:
- Locates
vswhere.exeunder%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\ - Finds the latest VS 2022 installation that includes the VC++ toolchain
- Calls
vcvars64.batto initialize the 64-bit MSVC environment - Runs
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release [extra flags] - Builds the
whisper-clitarget →build\bin\crispasr.exe
build-vulkan.batProduces build-vulkan\bin\crispasr.exe with the Vulkan compute backend enabled. In addition to the VS detection above, it:
- Checks
%VULKAN_SDK%. If unset, scansC:\VulkanSDK\for the newest installed version and setsVULKAN_SDKaccordingly. - Adds
-DGGML_VULKAN=ON -DGGML_CUDA=OFFso CUDA is not accidentally pulled in if the CUDA toolkit is also installed. - Writes the build into a separate
build-vulkan\directory so it coexists with a CPU build.
Important:
build-windows.bat -DGGML_CUDA=ONproduces a CUDA build, not a Vulkan build.--gpu-backend vulkanonly works if the binary was actually built with Vulkan support.- On hybrid laptops, Vulkan device
0may be the integrated GPU. Use-dev Nto pin the discrete GPU if needed.
:: Typical usage — VULKAN_SDK is picked up automatically
build-vulkan.bat
:: Override Vulkan SDK location explicitly
set VULKAN_SDK=C:\VulkanSDK\1.4.304.1
build-vulkan.bat
:: Run on Vulkan, pinned to GPU 1 (for example: NVIDIA on a hybrid laptop)
build-vulkan\bin\crispasr.exe --gpu-backend vulkan -dev 1 -m model.gguf -f audio.wavBoth scripts exit with a non-zero code and a [ERROR] message if any step fails (VS not found, CMake configure error, build error).
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON # NVIDIA
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_METAL=ON # Apple Silicon# Install ffmpeg dev libs first:
# apt install libavformat-dev libavcodec-dev libavutil-dev libswresample-dev
cmake -B build-ffmpeg -DCMAKE_BUILD_TYPE=Release -DWHISPER_FFMPEG=ON
cmake --build build-ffmpeg -j$(nproc) --target crispasrUpstream bug warning.
.m4a/.mp4/.webmcontainers currently crashwhisper.cpp's ffmpeg integration. For those formats, pre-convert to WAV:ffmpeg -i input.opus -ar 16000 -ac 1 -c:a pcm_s16le -y /tmp/audio.wavBare-codec
.opusfiles work fine withWHISPER_FFMPEG=ON.
# Download a whisper model (same as upstream whisper.cpp)
./models/download-ggml-model.sh base.en
./build/bin/crispasr -m models/ggml-base.en.bin -f samples/jfk.wav
# [00:00:00.000 --> 00:00:07.940] And so my fellow Americans ask not what your country can do for you
# [00:00:07.940 --> 00:00:10.760] ask what you can do for your country.# Grab the quantized model (~467 MB)
curl -L -o parakeet.gguf \
https://huggingface.co/cstr/parakeet-tdt-0.6b-v3-GGUF/resolve/main/parakeet-tdt-0.6b-v3-q4_k.gguf
./build/bin/crispasr -m parakeet.gguf -f samples/jfk.wav
# Auto-detected backend 'parakeet' from GGUF metadata.
# And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
# Word-level timestamps (one line per word)
./build/bin/crispasr -m parakeet.gguf -f samples/jfk.wav -ml 1# Transcription (source == target)
./build/bin/crispasr --backend canary -m canary-1b-v2-q5_0.gguf -f audio.de.wav -sl de -tl de
# Translation (German speech → English text)
./build/bin/crispasr --backend canary -m canary-1b-v2-q5_0.gguf -f audio.de.wav -sl de -tl en
# ...or use the familiar whisper-cli flag:
./build/bin/crispasr --backend canary -m canary-1b-v2-q5_0.gguf -f audio.de.wav -l de --translate# First run downloads ~2.5 GB to ~/.cache/crispasr/ via curl, then runs
./build/bin/crispasr --backend voxtral -m auto -f samples/jfk.wav
# Subsequent runs use the cached file
./build/bin/crispasr --backend voxtral -m auto -f samples/jfk.wav -l en./build/bin/crispasr --backend qwen3 -m auto -f audio.zh.wav# English (Q4_K quantized, 212 MB — 6x smaller than F16)
curl -L -o wav2vec2-en-q4k.gguf \
https://huggingface.co/cstr/wav2vec2-large-xlsr-53-english-GGUF/resolve/main/wav2vec2-xlsr-en-q4_k.gguf
./build/bin/crispasr -m wav2vec2-en-q4k.gguf -f samples/jfk.wav
# and so my fellow americans ask not what your country can do for you ask what you can do for your country
# German
curl -L -o wav2vec2-de-q4k.gguf \
https://huggingface.co/cstr/wav2vec2-large-xlsr-53-german-GGUF/resolve/main/wav2vec2-xlsr-de-q4_k.gguf
./build/bin/crispasr -m wav2vec2-de-q4k.gguf -f audio.de.wav
# Convert any HuggingFace Wav2Vec2ForCTC model:
python models/convert-wav2vec2-to-gguf.py \
--model-dir jonatasgrosman/wav2vec2-large-xlsr-53-german \
--output wav2vec2-de.gguf --dtype f32
# Then optionally quantize:
./build/bin/crispasr-quantize wav2vec2-de.gguf wav2vec2-de-q4k.gguf q4_k# Pipe audio from ffmpeg, sox, or any tool that outputs raw PCM:
ffmpeg -i audio.wav -f s16le -ar 16000 -ac 1 - | \
crispasr --stream -m model.gguf
# Live microphone transcription (auto-detects arecord/sox/ffmpeg):
crispasr --mic -m model.gguf
# Continuous live mode (prints each chunk as a new line, never stops):
crispasr --live -m model.gguf
# With progress monitor symbols (▶ processing, ✓ got text, · silence):
crispasr --live --monitor -m model.gguf
# Per-token confidence and alternative candidates:
crispasr -m model.gguf -f audio.wav --altStreaming works with all backends. The --stream-step (default 3s), --stream-length (default 10s), and --stream-keep (default 200ms overlap) flags control the sliding window.
CrispASR currently exposes two TTS backends through the unified CLI:
vibevoice-tts and qwen3-tts. Both write 24 kHz mono WAV via
--tts-output.
VibeVoice realtime (vibevoice-tts, preset voice GGUF):
./build/bin/crispasr \
--backend vibevoice-tts \
-m ~/.cache/crispasr/vibevoice-realtime-0.5b-q4_k.gguf \
--voice ~/.cache/crispasr/vibevoice-voice-emma.gguf \
--tts "Hello, how are you today?" \
--tts-output hello.wavQwen3-TTS (qwen3-tts, runtime WAV clone):
./build/bin/crispasr \
--backend qwen3-tts \
-m /Volumes/backups/ai/crispasr-models/qwen3-tts-0.6b-talker.gguf \
--codec-model /Volumes/backups/ai/crispasr-models/qwen3-tts-tokenizer-12hz.gguf \
--voice samples/qwen3_tts/clone.wav \
--ref-text "Okay, yeah. I resent you, I love you, I respect you. But you know what - You blew it, and thanks to you." \
--tts "Hello there" \
--tts-output hello.wavQwen3-TTS (qwen3-tts, baked voice-pack GGUF):
./build/bin/crispasr \
--backend qwen3-tts \
-m /Volumes/backups/ai/crispasr-models/qwen3-tts-0.6b-talker.gguf \
--codec-model /Volumes/backups/ai/crispasr-models/qwen3-tts-tokenizer-12hz.gguf \
--voice /tmp/qwen3-tts-voice-pack.gguf \
--tts "Hello there" \
--tts-output hello.wavNotes:
vibevoice-ttsuses--voicefor its voice prompt or preset. The realtime0.5Bflow is typically driven by a voice GGUF.qwen3-ttsalways needs the separate codec GGUF. Pass it explicitly with--codec-model, or placeqwen3-tts-tokenizer-12hz.ggufnext to the talker model so the CLI can auto-discover it.- When
qwen3-tts --voicepoints to a.wav,--ref-textis required. - When
qwen3-tts --voicepoints to a.gguf, it is treated as a baked voice pack and--ref-textis ignored. qwen3-ttsquantization is not quality-equivalent across variants. Treatf16talker +f16codec as the reference baseline.- Recommended quantized deployment today:
q8_0talker with thef16codec. - Lower-bit talker quants such as
q6_k,q5_k, andq4_kcan still produce usable audio, but they drift noticeably in intermediate tensor checks. Voice similarity, prompt adherence, prosody, and multilingual robustness may change. Use them only when memory matters more than strict fidelity. - Quantizing the tokenizer / codec hurts
qwen3-ttsearlier than talker-only quantization. Prefer keepingqwen3-tts-tokenizer-12hz.ggufatf16.
GGUF downloads: cstr/vibevoice-realtime-0.5b-GGUF, cstr/vibevoice-1.5b-GGUF, cstr/qwen3-tts-0.6b-base-GGUF, cstr/qwen3-tts-tokenizer-12hz-GGUF
# Start server with model loaded once
crispasr --server -m model.gguf --port 8080
# Transcribe via HTTP (model stays loaded between requests):
curl -F "file=@audio.wav" http://localhost:8080/inference
# Returns JSON: {"text": "...", "segments": [...], "backend": "parakeet", "duration": 11.0}
# Hot-swap to a different model at runtime:
curl -F "model=path/to/other-model.gguf" http://localhost:8080/load
# Check server status:
curl http://localhost:8080/health
# {"status": "ok", "backend": "parakeet"}
# List available backends:
curl http://localhost:8080/backends
# {"backends": ["whisper","parakeet","canary",...], "active": "parakeet"}The server loads the model once at startup and keeps it in memory. Subsequent /inference requests reuse the loaded model with no reload overhead. Requests are mutex-serialized. Use --host 0.0.0.0 to accept remote connections.
To require API keys, set the CRISPASR_API_KEYS env var (comma-separated). Do not pass keys as CLI arguments — they would be visible in ps/top. Protected endpoints accept either Authorization: Bearer <key> or X-API-Key: <key>. /health remains public for container health checks.
CRISPASR_API_KEYS=key-one,key-two crispasr --server -m model.gguf
curl -H "Authorization: Bearer key-one" \
-F "file=@audio.wav" \
http://localhost:8080/v1/audio/transcriptionsThe repo includes a root-level docker-compose.yml for running the persistent HTTP server against a mounted model directory.
cp .env.example .env
# Edit CRISPASR_MODEL to point at a file mounted under ./models
docker compose up --build
# Health check
curl http://localhost:8080/health
# OpenAI-compatible transcription API
curl http://localhost:8080/v1/audio/transcriptions \
-H "Authorization: Bearer $CRISPASR_API_KEY" \
-F "file=@audio.wav" \
-F "response_format=verbose_json"By default the compose stack:
- builds from
.devops/main.Dockerfile - mounts
./modelsinto/models - stores auto-downloaded models in the Docker-managed
crispasr-cachevolume at/cache - serves on
http://localhost:8080
If you want /cache to be a host directory instead, replace the crispasr-cache:/cache volume with ./cache:/cache and make it writable by the container user before startup:
mkdir -p cache models
sudo chown -R "$(id -u):$(id -g)" cache modelsYou can cap or raise build parallelism with CRISPASR_BUILD_JOBS:
docker compose build --build-arg CRISPASR_BUILD_JOBS=8For CUDA builds, use the override file:
docker compose -f docker-compose.yml -f docker-compose.cuda.yml up --buildWe publish two CUDA tags on ghcr.io/crispstrobe/crispasr. Pick the one that matches your host driver:
| Tag | CUDA | Min NVIDIA driver | Supported arches | Notes |
|---|---|---|---|---|
main-cuda |
13.0 | R535+ (R580+ for full features) | sm_75…sm_120 incl. RTX 50xx (Blackwell) | Default. Pull this on modern hosts. |
main-cuda-12 |
12.4 | R510+ | sm_75…sm_90 (RTX 20/30/40-series, Hopper) | Legacy compat — use on RHEL 7/8, older Ubuntu LTS, or any host that hasn't updated drivers in a while. RTX 50xx is not supported here. |
Quick check: nvidia-smi shows your driver version in the top-right. If it's R535 or higher, pull main-cuda. If it's R510–R534, pull main-cuda-12. If it's older than R510, update your driver — neither image will work.
docker pull ghcr.io/crispstrobe/crispasr:main-cuda # modern hosts
docker pull ghcr.io/crispstrobe/crispasr:main-cuda-12 # legacy driverThere is also a Gradio-based Hugging Face Space wrapper under hf-space/. It starts the CrispASR HTTP server inside the container and provides a small browser UI on top of the OpenAI-compatible transcription endpoint.
Build it locally with:
docker build -f hf-space/Dockerfile -t crispasr-hf-space .
docker run --rm -p 7860:7860 -p 8080:8080 \
-e CRISPASR_MODEL=/models/ggml-base.en.bin \
-v "$PWD/models:/models" \
crispasr-hf-spaceThe compose files default to local image tags (crispasr-local:*) so they don't depend on pulling a published registry image first.
You can override the loaded model and startup flags through .env:
CRISPASR_MODEL=/models/parakeet-tdt-0.6b-v2.ggufCRISPASR_BACKEND=parakeetCRISPASR_LANGUAGE=autoCRISPASR_AUTO_DOWNLOAD=1CRISPASR_CACHE_DIR=/cacheCRISPASR_API_KEYS=key-one,key-twoCRISPASR_EXTRA_ARGS=--no-punctuation
The service is configured to avoid serving as root by default:
user: "${CRISPASR_UID:-1000}:${CRISPASR_GID:-1000}"security_opt: ["no-new-privileges:true"]
The server exposes POST /v1/audio/transcriptions, a drop-in replacement for the OpenAI Whisper API. Any tool that speaks the OpenAI transcription protocol (LiteLLM, LangChain, custom clients) can point at CrispASR with zero code changes.
# Same curl syntax as the OpenAI API:
curl http://localhost:8080/v1/audio/transcriptions \
-H "Authorization: Bearer $CRISPASR_API_KEY" \
-F "file=@audio.wav" \
-F "response_format=json"
# {"text": "And so, my fellow Americans, ask not what your country can do for you..."}
# Verbose JSON with per-segment timestamps (matches OpenAI's format):
curl http://localhost:8080/v1/audio/transcriptions \
-H "Authorization: Bearer $CRISPASR_API_KEY" \
-F "file=@audio.wav" \
-F "response_format=verbose_json"
# {"task": "transcribe", "language": "en", "duration": 11.0, "text": "...", "segments": [...]}
# SRT subtitles:
curl http://localhost:8080/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "response_format=srt"
# Plain text:
curl http://localhost:8080/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "response_format=text"Supported form fields:
| Field | Description |
|---|---|
file |
Audio file (required) |
model |
Ignored (uses the loaded model) |
language |
ISO-639-1 code (default: server's -l setting) |
prompt |
Initial prompt / context |
response_format |
json (default), verbose_json, text, srt, vtt |
temperature |
Sampling temperature (default: 0.0) |
GET /v1/models returns an OpenAI-compatible model list with the currently loaded model.
crispasr extends upstream whisper-cli's argument set with a handful of backend-dispatch flags. Every historical whisper flag still works — when you don't pass --backend, whisper is the default.
| Flag | Meaning |
|---|---|
-m FNAME, --model FNAME |
Path to a model file, or auto to download a default for the selected backend |
--backend NAME |
Force a specific backend. Default: auto-detected from GGUF metadata + filename heuristics |
-f FNAME, --file FNAME |
Input audio (can repeat; also accepts positional filenames) |
-t N, --threads N |
Thread count (default: min(4, nproc)) |
-l LANG, --language LANG |
ISO-639-1 code (default: en) |
--list-backends |
Print the capability matrix and exit |
| Flag | Output |
|---|---|
-otxt |
Plain text to <audio>.txt |
-osrt |
SubRip (SRT) to <audio>.srt |
-ovtt |
WebVTT to <audio>.vtt |
-ocsv |
CSV (start, end, text) |
-oj, -ojf |
JSON (compact or full with word/token arrays) |
-olrc |
LRC lyrics format |
-of FNAME |
Output file base (no extension) |
-np |
No prints (suppress stderr progress) |
-pc |
Color-code output by token confidence (where supported) |
--no-timestamps |
Plain text only, no timing |
-ml N |
Max chars per display segment. 0=unlimited, 1=per-word, N=split at word boundaries |
-sp, --split-on-punct |
Split subtitle lines at sentence-ending punctuation (. ! ?). Creates readable subtitles even for CTC models that produce long segments |
| Flag | Meaning |
|---|---|
--vad |
Enable Silero VAD. Auto-downloads ggml-silero-v5.1.2.bin (~885 KB) to ~/.cache/crispasr/ on first use |
--vad-model FNAME |
Override the VAD model path (default: auto) |
-vt F |
VAD threshold (default 0.5) |
-vspd N |
VAD min speech duration (ms, default 250) |
-vsd N |
VAD min silence duration (ms, default 100) |
-ck N, --chunk-seconds N |
Fallback chunk size when VAD is off (default 30 s) |
| Flag | Meaning |
|---|---|
-tp F, --temperature F |
Sampling temperature. 0 = pure argmax (default, bit-identical). > 0 enables multinomial sampling for whisper, voxtral, voxtral4b, qwen3, granite |
-bs N, --beam-size N |
Beam search width (whisper only) |
-tpi F, --temperature-inc F |
Whisper temperature-fallback increment |
--grammar FNAME |
GBNF grammar file (whisper only, including --backend whisper) |
--grammar-rule NAME |
Top-level rule name in the grammar |
--prompt STR |
Initial prompt for whisper |
| Flag | Meaning |
|---|---|
-l auto, --detect-language |
Auto-detect the input language. Backends without native lang-detect (cohere, canary, granite, voxtral, voxtral4b) get it via the LID pre-step |
--lid-backend NAME |
LID provider: whisper (default), silero (95 langs, 16 MB), ecapa (107 or 45 langs, 40-43 MB), firered (120 langs, 544 MB), or off |
--lid-model FNAME |
Override the LID model path (default: auto-downloads ggml-tiny.bin ~75 MB on first use) |
| Flag | Meaning |
|---|---|
-am FNAME, --aligner-model FNAME |
CTC aligner GGUF for word-level timestamps |
-n N, --max-new-tokens N |
Max tokens the LLM may generate (default 512) |
| Flag | Meaning |
|---|---|
-sl LANG, --source-lang LANG |
Source language (canary) |
-tl LANG, --target-lang LANG |
Target language (canary; set different from -sl for X→Y translation) |
-tr, --translate |
Translate to English (whisper, canary) |
--no-punctuation |
Disable punctuation in the output. Native for cohere/canary, post-processed for everyone else |
| Flag | Meaning |
|---|---|
-t N, --threads N |
Threads per inference call (default min(4, nproc)) |
-p N, --processors N |
Run N parallel decoder states (whisper only — uses whisper_full_parallel) |
--no-gpu / --device N |
Disable GPU or pin to GPU N |
These work both with the historical default whisper code path AND with --backend whisper. The historical path retains a few extras unique to it (-owts karaoke, full-mode JSON DTW tokens, -di stereo diarize) — pass a ggml-*.bin model without --backend to get them.
--diarize, -tdrz/--tinydiarize, --carry-initial-prompt, -dtw, -fa/-nfa, -suppress-regex, -suppress-nst, and the full upstream whisper-cli --help list.
Every non-whisper backend uses the Silero VAD model to segment long audio into speech regions, stitch them into a single contiguous buffer (with 0.1s silence gaps, matching whisper.cpp's internal approach), transcribe in one pass, and remap timestamps back to original-audio positions. This preserves cross-segment context and avoids boundary artifacts. Short VAD segments (< 3s) are auto-merged, and oversized segments are split at --chunk-seconds boundaries. Whisper handles VAD internally via wparams.vad.
# Just pass --vad — the model is auto-downloaded on first use
./build/bin/crispasr --backend parakeet -m parakeet.gguf -f long_audio.wav \
--vad -osrt
# Or point at an existing GGUF
./build/bin/crispasr --backend parakeet -m parakeet.gguf -f long_audio.wav \
--vad-model ~/models/ggml-silero-v5.1.2.bin -osrtThe cached model lives at ~/.cache/crispasr/ggml-silero-v5.1.2.bin (~885 KB). If you don't provide --vad, CrispASR falls back to fixed 30-second chunking (configurable via -ck). Encoder cost is O(T²) in the frame count, so for multi-minute audio you really want VAD.
Recommended for subtitles:
crispasr --backend parakeet -m parakeet.gguf -f long_audio.wav --vad -osrt --split-on-punctAccurate subtitle timing:
- Best timing quality: use parakeet. Its native TDT timestamps are more accurate and more natural than the forced-aligner fallback used by LLM backends.
- Best default subtitle flags: use
--vad --split-on-punct. VAD segments at natural speech pauses, then CrispASR stitches/remaps timestamps back to the original timeline. This avoids the mid-sentence boundary problems of fixed 30-second chunking. - For backends without native timestamps (
cohere,granite,voxtral,voxtral4b,qwen3): use a CTC aligner together with--vad. Without VAD, leading silence can throw off sentence starts, especially for the qwen3 forced aligner. - If parakeet is too heavy for very long audio: keep parakeet for timing quality, but cap memory use with fixed chunking:
./build/bin/crispasr --backend parakeet -m parakeet.gguf -f long_audio.wav \
--chunk-seconds 180 -osrt --split-on-punctIn practice:
parakeet --vad -osrt --split-on-punctis the best default for subtitle generationcohere/canary/qwen3/... --vad -am <aligner.gguf> -osrt --split-on-punctis the best fallback when you need another backend--chunk-seconds Nis mainly a VRAM-control knob for long audio, not the preferred path for subtitle accuracy
The LLM-based backends (qwen3, voxtral, voxtral4b, granite) don't emit timestamps natively. CrispASR supports a second-pass forced alignment via NVIDIA's canary-ctc-aligner — a 600M-param FastConformer + CTC head that works on any transcript + audio pair in 25+ European languages.
# Grab the aligner once (~400 MB)
curl -L -o canary-ctc-aligner.gguf \
https://huggingface.co/cstr/canary-ctc-aligner-GGUF/resolve/main/canary-ctc-aligner-q5_0.gguf
# Now any LLM backend can produce word-level SRT output
./build/bin/crispasr --backend voxtral -m auto -f samples/jfk.wav \
-am canary-ctc-aligner.gguf -osrt -ml 1
# [00:00:00.240 --> 00:00:00.640] And
# [00:00:00.640 --> 00:00:00.880] so,
# [00:00:00.880 --> 00:00:01.040] my
# ...Alignment granularity is one encoder frame, ~80 ms.
For subtitle output, prefer adding --vad --split-on-punct:
./build/bin/crispasr --backend cohere -m cohere.gguf -f talk.wav \
-am canary-ctc-aligner.gguf --vad -osrt --split-on-punctNotes:
- The aligner path is a fallback for backends that lack native timestamps.
qwen3-forced-aligneris more sensitive to leading silence;--vadis strongly recommended with it.- Parakeet remains the better choice when timestamp quality is the top priority.
CrispASR writes these formats side-by-side with the input audio (e.g. jfk.wav → jfk.srt, jfk.vtt, jfk.json). The JSON layout:
{
"crispasr": {
"backend": "parakeet",
"model": "parakeet-tdt-0.6b-v3-q4_k.gguf",
"language":"en"
},
"transcription": [
{
"timestamps": { "from": "00:00:00,240", "to": "00:00:10,880" },
"offsets": { "from": 240, "to": 10880 },
"text": "And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country."
}
]
}Add -ojf (--output-json-full) to include per-word words[] and per-token tokens[] arrays when the backend populates them.
All three wrappers are thin shells over the same C-ABI surface in
src/crispasr_c_api.cpp. Anything the CLI can do — transcribe, VAD,
diarize, LID, align, download — is one function call in every
language.
from crispasr import (
Session, diarize_segments, detect_language_pcm,
align_words, cache_ensure_file, registry_lookup,
)
# Transcribe (all 10 backends via one session object)
sess = Session("parakeet-tdt-0.6b-v3-q4_k.gguf")
segs = sess.transcribe_vad(pcm, "silero-v5.1.2.bin") # stitched VAD pass
# Run each shared post-step standalone
lang = detect_language_pcm(pcm, model_path="ggml-tiny.bin")
diarize_segments(my_segs, pcm, method=DiarizeMethod.VAD_TURNS)
words = align_words("canary-ctc-aligner.gguf", "hello world", pcm)
# Auto-download a canonical model
entry = registry_lookup("parakeet")
path = cache_ensure_file(entry.filename, entry.url)use crispasr::{
Session, DiarizeMethod, DiarizeOptions, DiarizeSegment,
LidMethod, detect_language_pcm, align_words,
cache_ensure_file, registry_lookup,
};
let sess = Session::open("cohere-transcribe-q4_k.gguf", 4)?;
let segs = sess.transcribe_vad(&pcm, "silero-v5.1.2.bin", None)?;
let entry = registry_lookup("canary")?.unwrap();
let path = cache_ensure_file(&entry.filename, &entry.url, false, None)?;import 'package:crispasr/crispasr.dart' as crispasr;
final sess = crispasr.CrispasrSession.open(modelPath, backend: 'parakeet');
final segs = sess.transcribeVad(pcm, vadModelPath);
final lang = crispasr.detectLanguagePcm(
pcm: pcm, method: crispasr.LidMethod.whisper, modelPath: tinyPath);
final words = crispasr.alignWords(
alignerModel: ctcPath, transcript: text, pcm: pcm);Reference application: CrisperWeaver — a cross-platform Flutter desktop/mobile transcription app built on package:crispasr. Ships with model browser + downloader (all 10 backends + quants), drag-and-drop files, mic capture, SRT/VTT/TXT export, per-run performance metrics, and full en/de i18n. The v0.5.4 release uses the new transcribeVad path so every non-whisper backend benefits from stitched Silero VAD with zero CrisperWeaver-side work.
./build-ios.sh # iOS xcframework with Metal
./build-android.sh --vulkan # Android NDK with Vulkan GPUWhen you pass -m auto (or -m default), CrispASR downloads the default quantized model for the selected backend into ~/.cache/crispasr/ on first use. The registry (kept in sync with src/crispasr_model_registry.cpp):
| Backend | Download | Approx size |
|---|---|---|
| whisper | ggerganov/whisper.cpp/ggml-base.en.bin |
~147 MB |
| parakeet | cstr/parakeet-tdt-0.6b-v3-GGUF |
~467 MB |
| canary | cstr/canary-1b-v2-GGUF |
~600 MB |
| voxtral | cstr/voxtral-mini-3b-2507-GGUF |
~2.5 GB |
| voxtral4b | cstr/voxtral-mini-4b-realtime-GGUF |
~3.3 GB |
| granite | cstr/granite-speech-4.0-1b-GGUF |
~2.94 GB |
| qwen3 | cstr/qwen3-asr-0.6b-GGUF |
~500 MB |
| cohere | cstr/cohere-transcribe-03-2026-GGUF |
~550 MB |
| wav2vec2 | cstr/wav2vec2-large-xlsr-53-english-GGUF |
~212 MB |
| omniasr | cstr/omniASR-CTC-1B-GGUF |
~551 MB |
| omniasr-llm | cstr/omniasr-llm-300m-v2-GGUF |
~580 MB |
| hubert | cstr/hubert-large-ls960-ft-GGUF |
~200 MB |
| data2vec | cstr/data2vec-audio-960h-GGUF |
~60 MB |
Downloads go through curl (preferred) with a wget fallback — no Python, no libcurl link dependency. Works identically on Linux, macOS, and Windows 10+ where curl ships in the base system. Models are cached by filename; re-running is a single stat() check. The same registry + cache helpers are reachable from the wrappers via crispasr.registry_lookup() / crispasr.cache_ensure_file() so Python/Rust callers can drive -m auto-style resolution without re-implementing it.
Every audio path goes through read_audio_data() inherited from upstream whisper.cpp. Two single-header decoders are embedded:
- miniaudio — WAV (any bit depth: 16/24/32 PCM, IEEE float, A-law, μ-law, ADPCM), FLAC, MP3
- stb_vorbis — OGG Vorbis
Out of the box, CrispASR accepts WAV / FLAC / MP3 / OGG Vorbis at any bit depth and any sample rate (auto-resampled to 16 kHz), mono or stereo (auto-mixed to mono).
| Format | Default build | WHISPER_FFMPEG=ON |
|---|---|---|
| WAV / FLAC / MP3 / OGG | ✔ | ✔ |
.opus |
✗ | ✔ |
.m4a / .mp4 / .webm |
✗ | ⚠ upstream crash, pre-convert |
.aiff / .wma / raw PCM |
✗ | pre-convert |
For anything in the bottom half, the reliable path is ffmpeg -i in.X -ar 16000 -ac 1 -c:a pcm_s16le out.wav then pass the WAV.
CrispASR is structured around three layers on top of whisper.cpp. The
split between src/ (library) and examples/cli/ (presentation) is
deliberate: every algorithm — VAD, diarization, LID, CTC alignment,
HF download/cache, model registry — lives in src/ behind a stable
C-ABI (src/crispasr_c_api.cpp), and every consumer (CLI, Dart, Python,
Rust) reaches it through the same symbols. The CLI keeps only
presentation + UX policy.
┌───────────────────────────────────────────────────────────────────┐
│ examples/cli/cli.cpp (the crispasr binary) │
│ Parses CLI args, dispatches to backend when --backend │
│ is set or GGUF arch is non-whisper; otherwise runs whisper_full │
│ unchanged │
├───────────────────────────────────────────────────────────────────┤
│ examples/cli/crispasr_*_cli.{h,cpp} │
│ Thin CLI shims for policy only — auto-download, TTY prompts, │
│ sherpa-ONNX subprocess fallbacks. Delegate the algorithmic │
│ work to the shared library below. │
├───────────────────────────────────────────────────────────────────┤
│ src/crispasr_c_api.cpp — C-ABI (shared with Dart / Python / Rust) │
│ crispasr_vad.{h,cpp} Silero VAD + whisper.cpp-style │
│ stitching, timestamp remap │
│ crispasr_diarize.{h,cpp} energy / xcorr / vad-turns / │
│ native pyannote diarization │
│ crispasr_lid.{h,cpp} whisper-tiny + silero-native LID │
│ crispasr_aligner.{h,cpp} canary-CTC + qwen3-forced-aligner│
│ crispasr_cache.{h,cpp} HF download + ~/.cache/crispasr │
│ crispasr_model_registry.{h,cpp} backend → canonical GGUF URL │
├───────────────────────────────────────────────────────────────────┤
│ src/{whisper,parakeet,canary,canary_ctc,cohere,qwen3_asr, │
│ voxtral,voxtral4b,granite_speech,silero_lid,pyannote_seg}.cpp│
│ Per-model runtimes (public C APIs) │
├───────────────────────────────────────────────────────────────────┤
│ src/core/ — shared model primitives (crispasr-core) │
│ mel.{h,cpp} log-mel spectrogram (NeMo + HF clusters) │
│ ffn.h SwiGLU + SiLU FFN helpers │
│ attention.h Llama-style self-attention + flash-attn │
│ gguf_loader.{h,cpp} Unified GGUF open / weight mmap / lookup │
├───────────────────────────────────────────────────────────────────┤
│ ggml │
└───────────────────────────────────────────────────────────────────┘
Every algorithm listed below is exposed as extern "C" functions
with a crispasr_ prefix. The CLI, Python, Rust, and Dart bindings
all consume the same symbols.
| File | Role |
|---|---|
crispasr_c_api.cpp |
The C-ABI. Exports session open/close/transcribe, VAD, diarize, LID, alignment, cache, registry — everything a wrapper needs. |
crispasr_vad.{h,cpp} |
Silero VAD slicing + whisper.cpp-style stitching with timestamp remapping. Used by crispasr_session_transcribe_vad. |
crispasr_diarize.{h,cpp} |
Four diarizers: energy (stereo), xcorr (stereo, TDOA), vad-turns (mono, timing), pyannote (mono, GGUF). |
crispasr_lid.{h,cpp} |
whisper-tiny + silero-native language ID with process-wide whisper-context cache. |
crispasr_aligner.{h,cpp} |
canary-CTC + Qwen3-ForcedAligner forced alignment behind one entry point; filename-based dispatch. |
crispasr_cache.{h,cpp} |
WinHTTP / curl / wget download into ~/.cache/crispasr/; zombie-file detection. |
crispasr_model_registry.{h,cpp} |
Backend → canonical GGUF URL table; fuzzy filename lookup for "did you mean …?" hints. |
whisper_params.h |
Shared params struct (extracted from cli.cpp, extended). |
| File | Role |
|---|---|
cli.cpp |
crispasr entry point, extended with --backend dispatch branch. |
crispasr_backend.{h,cpp} |
CrispasrBackend abstract class, capability bitmask, factory, GGUF auto-detect. |
crispasr_backend_{parakeet,canary,cohere,granite,voxtral,voxtral4b,qwen3,fastconformer_ctc,wav2vec2,glm_asr,kyutai_stt,firered_asr,moonshine,omniasr,vibevoice}.cpp |
Per-backend thin wrapper over each model's C API. |
crispasr_output.{h,cpp} |
TXT / SRT / VTT / CSV / JSON / LRC writers on crispasr_segment. |
crispasr_vad_cli.{h,cpp} |
Delegates to src/crispasr_vad; adds auto-download for the Silero GGUF. |
crispasr_lid_cli.{h,cpp} |
Delegates to src/crispasr_lid; adds auto-download + sherpa-ONNX subprocess fallback. |
crispasr_diarize_cli.{h,cpp} |
Delegates to src/crispasr_diarize; adds sherpa subprocess fallback + pyannote GGUF auto-download. |
crispasr_model_mgr_cli.{h,cpp} |
Delegates to src/crispasr_model_registry; adds "Download now? [Y/n]" prompt on TTY. |
crispasr_aligner_cli.{h,cpp} |
Adapter converting CrispasrAlignedWord → the CLI's crispasr_word shape. |
crispasr_server.cpp |
HTTP server for the persistent-model mode + OpenAI-compatible endpoints. |
crispasr_llm_pipeline.h |
Templated audio-LLM pipeline (mel → encoder → prompt → KV decode). |
crispasr_run.cpp |
Top-level pipeline dispatch: resolve → detect → load → slice → transcribe → write. |
Duplicated scaffolding is bundled in a single static library, crispasr-core, linked into every non-whisper model target.
| Header | Replaces | Consumers |
|---|---|---|
core/mel.{h,cpp} |
7× copy-pasted STFT + mel filterbank + log + norm | parakeet, canary, canary_ctc, cohere, voxtral, voxtral4b, qwen3 |
core/ffn.h |
4× inline SwiGLU blocks | qwen3, voxtral, voxtral4b, granite |
core/attention.h |
Llama-style self-attention with NEOX RoPE + GQA + flash-attn | voxtral (more coming) |
core/gguf_loader.{h,cpp} |
8× identical two-pass GGUF load + mmap + tensor-map build | all non-whisper models |
core_mel::Params spans both algorithm clusters: the NeMo family (ln + per-mel z-score + (T, n_mels) layout) and the HF/Whisper family (log10 + global clip normalization + (n_mels, T) layout), with knobs for LogGuard (add-epsilon vs max-clip), MatmulPrecision (Float vs Double), FbLayout (MelsFreqs vs FreqsMels), drop_last_frame / drop_first_frame_if_odd, and pad_to_T.
core_gguf::WeightLoad owns the ggml_context, the ggml_backend_buffer_t, and the std::map<std::string, ggml_tensor*> in one struct that models std::move() into their own state. The mmap path has a pread/fseek fallback for filesystems that don't support mmap.
src/whisper.cpp is intentionally not migrated to src/core/ (yet) — it's (for the time being) the battle-tested reference and the crispasr -m ggml-base.en.bin … code path is byte-identical to upstream whisper-cli. This guarantee is a test gate: every CrispASR commit that touches the CLI is checked against it.
Every src/core/ migration commit includes a md5sum-level regression test against samples/jfk.wav:
- mel extraction: bit-identical transcript + SRT on parakeet, canary, canary_ctc, voxtral, voxtral4b, qwen3. Cohere transcript is bit-identical but a single SRT boundary shifts by 80 ms due to the CBLAS→manual-loop matmul accumulator reorder.
- ffn extraction: bit-identical on qwen3, voxtral, voxtral4b, granite.
- gguf_loader extraction: bit-identical on all 8 non-whisper models.
- attention extraction: bit-identical on voxtral (only consumer so far).
Optimization and graph survey (all backends)
| Backend | Arch pattern | ggml graph | Flash attn | KV cache | GPU | Shared core modules |
|---|---|---|---|---|---|---|
| whisper | Enc-dec transformer | ✔ | ✔ | ✔ | CUDA/Metal | (upstream) |
| parakeet | FastConformer + TDT | ✔ | ✔ | partial | CPU | mel, fastconformer |
| canary | FastConformer + Transformer dec | ✔ | ✔ | ✔ | CUDA/Metal | mel, fastconformer |
| cohere | Conformer + Transformer dec | ✔ | ✔ | ✔ | CUDA/Metal | mel |
| granite | Conformer + Q-Former + LLM | ✔ | ✔ | ✔ | CPU | mel, kv_self_attn, swiglu, greedy_decode, bpe |
| voxtral | Whisper enc + Mistral LLM | ✔ | ✔ | ✔ | CUDA/Metal | mel, kv_self_attn, encoder_self_attn, swiglu, greedy_decode, bpe |
| voxtral4b | RoPE enc + 3.4B LLM | ✔ | ✔ | ✔ | CPU | mel, kv_self_attn, encoder_self_attn, swiglu, greedy_decode, bpe |
| qwen3 | Whisper enc + Qwen3 LLM | ✔ | ✔ | ✔ | CUDA/Metal | mel, kv_self_attn, swiglu, greedy_decode, bpe |
| fc-ctc | FastConformer + CTC | ✔ | ✔ | — | CPU | mel, fastconformer |
| wav2vec2 | CNN + Transformer + CTC | ✔ | — | — | CUDA/Metal | gguf_loader |
| glm-asr | Whisper enc + Llama LLM | ✔ | ✔ | ✔ | CPU | mel, kv_self_attn, swiglu, greedy_decode, bpe |
| kyutai-stt | Mimi codec + causal LM | ✔ | ✔ | ✔ | CPU | gguf_loader |
| firered-asr | Conformer + CTC + beam dec | ✔ | ✔ | ✔ | CPU | mel, gguf_loader |
| moonshine | Conv + 6L enc-dec | ✔ | ✔ | ✔ | CPU | (vendored) |
| moonshine-streaming | Sliding-window enc + dec | ✔ | ✔ | ✔ | CPU | (vendored) |
| omniasr | wav2vec2 enc + CTC/LLM | ✔ | ✔ | CTC:— LLM:✔ | CPU | gguf_loader, kv_self_attn, swiglu |
| vibevoice | σ-VAE + Qwen2 7B | ✔ | ✔ | ✔ | CUDA/Metal | gguf_loader |
Architecture families:
- Feedforward CTC (wav2vec2, omniasr-CTC, fc-ctc, firered-asr): No decoder, no KV cache. Fastest. No native punctuation.
- Encoder-decoder (whisper, canary, cohere, moonshine, moonshine-streaming): Cross-attention KV cache, autoregressive text decoder.
- Audio-LLM (granite, voxtral, voxtral4b, qwen3, glm-asr, omniasr-LLM, vibevoice): Audio features injected into LLM embedding space, KV-cached autoregressive decoding.
- Transducer (parakeet): LSTM predictor + joint network, frame-synchronous TDT decoding.
- Codec + LM (kyutai-stt): Neural audio codec (RVQ) → token-based LM.
Optimization opportunities:
- Beam search could be added to all encoder-decoder and Audio-LLM backends (currently only whisper + firered-asr)
- Fused QKV (single matmul for Q/K/V projections) — used in CrispEmbed, applicable to all attention layers
- Temperature sampling could be added to glm-asr, kyutai-stt, firered-asr, moonshine, omniasr-LLM via
core_greedy_decode - GPU offload for CPU-only backends (parakeet, granite, voxtral4b, etc.) — needs ggml_backend_sched with GPU primary
Adding a new ASR model to CrispASR is a focused exercise in five files. The worked examples to copy from are the existing crispasr_backend_*.cpp adapters.
Heads-up on
clang-format— CI pinsclang-format-18. Homebrew's defaultclang-formatkeg is on v22+, which formats some constructs differently and will fail the lint job. Pin it locally withpip install 'clang-format==18.1.8'and put~/Library/Python/3.11/bin(or your equivalent pip user-bin) ahead of/opt/homebrew/binonPATH. One pip install, no LLVM toolchain needed.
Following the established convention:
struct yourmodel_context * yourmodel_init_from_file(const char * path, yourmodel_context_params p);
void yourmodel_free(struct yourmodel_context *);
char * yourmodel_transcribe(struct yourmodel_context *, const float * samples, int n);Use src/core/mel, src/core/ffn, src/core/attention, and src/core/gguf_loader wherever they fit — they cover ~80% of the boilerplate.
Create examples/cli/crispasr_backend_yourmodel.cpp:
#include "crispasr_backend.h"
#include "whisper_params.h"
#include "yourmodel.h"
namespace {
class YourmodelBackend : public CrispasrBackend {
public:
const char * name() const override { return "yourmodel"; }
uint32_t capabilities() const override {
return CAP_TIMESTAMPS_CTC | CAP_AUTO_DOWNLOAD | /* ... */;
}
bool init(const whisper_params & p) override { /* yourmodel_init_from_file(...) */ }
std::vector<crispasr_segment> transcribe(
const float * samples, int n, int64_t t_off,
const whisper_params & p) override { /* call yourmodel_transcribe and return segments */ }
void shutdown() override { /* yourmodel_free(...) */ }
private:
yourmodel_context * ctx_ = nullptr;
};
} // namespace
std::unique_ptr<CrispasrBackend> crispasr_make_yourmodel_backend() {
return std::make_unique<YourmodelBackend>();
}In examples/cli/crispasr_backend.cpp:
std::unique_ptr<CrispasrBackend> crispasr_make_yourmodel_backend();
// ...
if (name == "yourmodel") return crispasr_make_yourmodel_backend();
// ...
std::vector<std::string> crispasr_list_backends() {
return { ..., "yourmodel" };
}Add the architecture string to crispasr_detect_backend_from_gguf() so general.architecture auto-detection works.
In examples/cli/CMakeLists.txt:
add_executable(${TARGET}
# ...
crispasr_backend_yourmodel.cpp
)
target_link_libraries(${TARGET} PRIVATE
# ...
yourmodel_lib
)If your model has a canonical Q4_K HuggingFace release, add it to crispasr_model_mgr.cpp's registry so -m auto works.
For ASR backends, the transcript is the regression target:
./build/bin/crispasr --backend yourmodel -m model.gguf -f samples/jfk.wav -np > before.txt
# ... make changes ...
cmake --build build --target crispasr
./build/bin/crispasr --backend yourmodel -m model.gguf -f samples/jfk.wav -np > after.txt
diff before.txt after.txt && echo BIT-IDENTICALFor TTS backends the output is audio (not text), and diffusion samplers
are stochastic, so a diff of two runs won't compare. The regression
target is "audio cosine similarity vs the official model's output, with
the same Gaussian noise pinned in both runs":
# 1. Run the official PyTorch model with hooks that capture the
# per-frame init noise plus the conditions / latents per frame
HF_HOME=/path/to/hf-cache python tools/run_official_vibevoice.py \
--text "Hello, how are you today?" \
--voice voices_pt/en-Emma_woman.pt \
--output-wav /tmp/ref.wav \
--output-dir /tmp/ref_dump
# 2. Run crispasr with the same noise pinned and per-frame dumps
VIBEVOICE_TTS_NOISE=/tmp/ref_dump/noise.bin \
VIBEVOICE_TTS_DUMP=/tmp/cpp_dump VIBEVOICE_TTS_DUMP_PERFRAME=1 \
./build/bin/crispasr --tts "Hello, how are you today?" \
-m vibevoice-realtime-0.5b-tts-f16.gguf \
--voice vibevoice-voice-emma.gguf \
--tts-output /tmp/cpp.wav -ng
# 3. Audio cos at xcorr peak (accounts for any leading-silence trim)
python -c "import sys; sys.path.insert(0, 'tools'); \
from _audio_diff import cos_report; \
print(cos_report('/tmp/ref.wav', '/tmp/cpp.wav'))"
# OFFICIAL: 182400 samples = 7.60s rms=0.0653
# AFTER_FIX: 171459 samples = 7.14s rms=0.0672
# cos at zero shift = 0.0027
# cos at xcorr peak = 0.9991 lag=7741 samples = 322.5 mscos at xcorr peak ≥ 0.999 is "essentially bit-exact modulo F16
quantization". A drop indicates a real divergence — pair the audio
diff with the per-frame stage diff (next section) to localise.
Bit-identical regression against the previous C++ version proves the change was neutral, but it doesn't tell you the C++ forward pass is correct in the first place. For that, use the ground-truth tools:
# 1. Capture PyTorch reference activations at every named stage
python tools/dump_reference.py --backend voxtral \
--model-dir /path/to/hf/voxtral-mini-3b-2507 \
--audio samples/jfk.wav \
--output /tmp/voxtral-ref.gguf
# 2. Compare your C++ forward pass against the reference, stage by stage
./build/bin/crispasr-diff voxtral \
voxtral-mini-3b-2507-q4_k.gguf \
/tmp/voxtral-ref.gguf \
samples/jfk.wav
#
# [PASS] mel_spectrogram shape=[128,3000] cos_min=0.99998 max_abs=3e-5
# [PASS] projector_output shape=[375,3072] cos_min=0.99985 max_abs=4e-4
# summary: 2 pass, 0 fail, 0 skip (cos threshold 0.999)The Python dumper uses PyTorch forward hooks to capture intermediate
activations (mel, per-encoder-layer output, projector, LLM block output,
logits, argmax) and writes them to a single GGUF tensor archive.
The C++ side loads the archive via core_gguf::load_weights and runs
the backend's public stage helpers (*_compute_mel, *_run_encoder,
etc.) to produce the same tensors, then the shared crispasr_diff::Ref
compares them with cosine similarity per row, max-abs error,
RMS, and — for logits — top-1 argmax match rate.
Adding a new backend to the dumper is a ~60-line file in
tools/reference_backends/<name>.py that registers PyTorch forward
hooks and returns a dict {stage_name: ndarray}. Worked examples:
tools/reference_backends/qwen3.py,voxtral.py,cohere.py,parakeet.py,gemma4.py,omniasr_llm.py,granite.py— encoder-decoder / Audio-LLM ASR backends. Use the_hooks.pyforward_hook helpers (capture_modules,drop_hooks,finalize).tools/reference_backends/qwen3_tts.py,qwen3_tts_codec.py,qwen3_tts_spk.py,qwen3_tts_cenc.py— TTS prefill / encoder backends. Usecapture_modules(..., first_call_only=True)for hooks that fire once per stage (e.g. talker prefill called from insidegenerate()).tools/reference_backends/vibevoice.py,vibevoice_tts.py— VibeVoice σ-VAE encoder + TTS pipeline.
For autoregressive / diffusion-sampler diffs the per-stage capture above isn't enough — bugs that appear only after several AR steps don't show up in a frame-0 cos diff. Two extra helpers:
tools/reference_backends/_iter_capture.py— companion to_hooks.pyfor "monkey-patch a sampler entry point and append one tensor per iteration to a{stage: [...]}dict". Used to capturepos_cond / neg_cond / noise / v_cfg_step0 / latentper frame insidesample_speech_tokensfor vibevoice.tools/_audio_diff.py— sample-wise audio cos at zero shift, cos at the cross-correlation peak (so leading-silence trims and causal-padding offsets don't tank the score), spectral band-power table, and a one-callcos_report(a_path, b_path)for CLI use.
tools/run_official_vibevoice.py is the worked example combining all
three: it loads the upstream model, monkey-patches sample_speech_tokens
via _iter_capture.patch_method, captures acoustic_embed via a
standard forward_hook, and writes both perframe_<stage>_f<NNN>.bin
files (matching the C++ runtime's VIBEVOICE_TTS_DUMP_PERFRAME=1
output) and a noise.bin for the C++ side to replay.
CrispASR includes a unified GGUF re-quantization tool, crispasr-quantize, that works across all supported model families (Whisper, Parakeet, Canary, Cohere, Voxtral, Qwen3, Granite, Wav2Vec2, etc.).
It is a model-agnostic tool that iterates through the GGUF tensor list and re-quantizes eligible 2D weight matrices while preserving metadata and non-quantizable tensors (norms, positional embeddings, biases) in their original types.
./build/bin/crispasr-quantize model-f16.gguf model-quant.gguf <type>| Type | Description |
|---|---|
q4_0 |
4-bit (scale only) |
q4_1 |
4-bit (scale + minimum; slightly higher accuracy than q4_0) |
q5_0 |
5-bit (scale only) |
q5_1 |
5-bit (scale + minimum; slightly higher accuracy than q5_0) |
q8_0 |
8-bit (scale only) |
q2_k |
2-bit K-quant |
q3_k |
3-bit K-quant |
q4_k |
4-bit K-quant (generally preferred over legacy Q4) |
q5_k |
5-bit K-quant (generally preferred over legacy Q5) |
q6_k |
6-bit K-quant |
# Quantize a Parakeet F16 model to Q4_K
./build/bin/crispasr-quantize parakeet-f16.gguf parakeet-q4_k.gguf q4_k
# Quantize a Voxtral model to Q5_0
./build/bin/crispasr-quantize voxtral-f16.gguf voxtral-q5_0.gguf q5_0Note on alignment. K-quants (
q2_kthroughq6_k) require tensor row sizes to be multiples of 256. If a tensor does not meet this requirement (e.g., the 896-wide tensors in some Qwen3-ASR layers), the tool automatically falls back to a compatible legacy quantization type (likeq4_0orq8_0) to ensure the entire model is quantized.
21 ASR backends + 2 punctuation models + VAD/LID/diarization/alignment — all through a unified C-ABI with Python/Rust/Dart wrappers. See PLAN.md for the roadmap, HISTORY.md for completed milestones, and PERFORMANCE.md for benchmarks on Kaggle T4 GPU (21/21 pass, fastest 9.0x RT).
All backends use ggml_backend_init_best() which automatically picks the highest-priority compiled backend: CUDA > Metal > Vulkan > CPU. To force a specific backend:
# Force Vulkan even when CUDA is available
crispasr --gpu-backend vulkan -m model.gguf -f audio.wav
# Pin a specific GPU (useful on Vulkan systems with iGPU + dGPU)
crispasr --gpu-backend vulkan -dev 1 -m model.gguf -f audio.wav
# Force CPU (useful for benchmarking)
crispasr -ng -m model.gguf -f audio.wav
# CUDA unified memory (swap to RAM when VRAM exhausted)
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 crispasr -m model.gguf -f audio.wavBuild flags: -DGGML_CUDA=ON, -DGGML_METAL=ON, -DGGML_VULKAN=ON.
Notes:
--gpu-backend vulkanselects the Vulkan backend, but it does not choose which physical GPU to use. Use-dev Nto select the Vulkan device index.- On some Windows laptops, Vulkan device
0is the Intel iGPU and the NVIDIA GPU is1. If Vulkan looks unexpectedly slow, rerun with-dev 1. - The Windows convenience script
build-vulkan.batcreates a separate Vulkan-capable binary atbuild-vulkan\bin\crispasr.exe.
For most backends, -v / --verbose surfaces per-stage timings and
device picks. For headless / library use (where the CLI flag isn't
plumbed through), set CRISPASR_VERBOSE=1 instead.
# Per-stage timing breakdown (mel / encoder / prefill / decode):
crispasr -v --backend gemma4-e2b -m model.gguf -f audio.wav
# gemma4_e2b: mel 128x1099 (17.2 ms)
# gemma4_e2b: encoder done: 1536x275 (719.0 ms)
# gemma4_e2b: prefill done, first_token=3133 (1464.0 ms)
# gemma4_e2b: decoded 25 tokens (7748.3 ms total)
# crispasr: transcribed 11.0s audio in 7.75s (1.4x realtime)
# Hugging Face access for gated models (Voxtral, Gemma4-E2B, …):
HF_TOKEN=hf_xxx crispasr -m auto --backend gemma4-e2b -f audio.wavThe server has its own auth env: CRISPASR_API_KEYS (see
Server mode).
Per-backend debug / bench / dump-dir env vars (developer)
These are useful when porting a new backend or chasing a regression.
The *_BENCH=1 toggles emit per-stage timings even without -v; the
*_DEBUG=1 toggles emit per-step diagnostic prints; the *_DUMP_DIR=
paths write per-stage F32 tensors for diff-testing against a PyTorch
reference (see Debug a new backend against PyTorch ground truth).
| Env var | Purpose |
|---|---|
CRISPASR_VERBOSE=1 |
Forces verbose mode for any backend (parallel to the -v flag). |
CRISPASR_DUMP_DIR=path/ |
Generic per-stage F32 tensor dump for the crispasr-diff harness. |
GEMMA4_E2B_BENCH=1 |
Per-stage timings for the Gemma-4-E2B backend. |
COHERE_BENCH=1 / COHERE_DEBUG=1 |
Cohere transcribe per-stage timings / per-step diagnostics. |
COHERE_PROF=1 |
Cohere graph-level profiling (per-op timings). |
COHERE_THREADS=N |
Override thread count for the Cohere backend. |
COHERE_DEVICE=cpu|cuda|metal|vulkan |
Force the Cohere backend onto a specific device. |
COHERE_DUMP_ATTN=path/ |
Dump attention activations for Cohere (used by the diff harness). |
FIRERED_BENCH=1 |
Per-stage timings for the FireRedASR backend. |
FIREREDPUNC_DEBUG=1 |
Per-step diagnostics for the FireRed punctuation post-step. |
MOONSHINE_STREAMING_BENCH=1 |
Per-stage timings for moonshine-streaming. |
OMNIASR_BENCH=1 / OMNIASR_DEBUG=1 / OMNIASR_DUMP_DIR= |
OmniASR per-stage timings, diagnostics, and stage dumps. |
PARAKEET_DEBUG=1 |
Parakeet TDT per-step diagnostics (joint network, blank-id sanity). |
QWEN3_TTS_BENCH=1 / QWEN3_TTS_DEBUG=1 / QWEN3_TTS_DUMP_DIR= |
Qwen3-TTS per-stage timings, diagnostics, and stage dumps. |
VIBEVOICE_BENCH=1 / VIBEVOICE_DEBUG=1 / VIBEVOICE_DUMP_DIR= |
VibeVoice ASR per-stage timings, diagnostics, and stage dumps. |
VIBEVOICE_REF_FEATURES=path |
Replace the live encoder with a saved feature tensor (regression harness). |
VIBEVOICE_TTS_DUMP=path/ |
VibeVoice TTS per-stage dumps (token IDs, base/TTS hidden, neg condition, frame-0 noise/v_cfg/latent/acoustic_embed) for the diff harness. |
VIBEVOICE_TTS_DUMP_PERFRAME=1 |
Per-frame VibeVoice TTS dumps written as perframe_<stage>_f<NNN>.bin. Pair with VIBEVOICE_TTS_DUMP=path/ and VIBEVOICE_TTS_NOISE=path for stage-by-stage AR diff against tools/run_official_vibevoice.py. |
VIBEVOICE_TTS_TRACE=1 |
Extra one-line traces (negative-condition prefill rms, scaling/bias factors loaded). Same effect as -vv. |
VIBEVOICE_VOICE_AUDIO=path.wav |
Reference voice WAV for 1.5B-base TTS without a .gguf voice cache. |
VIBEVOICE_TTS_NOISE=path |
Override the per-frame Gaussian init noise. Flat little-endian float32 [N_frames, vae_dim] — typically the noise.bin written by tools/run_official_vibevoice.py. |
VIBEVOICE_VAE_BACKEND=cpu|metal|cuda|vulkan |
Pin the VAE decoder onto a specific backend. |
WAV2VEC2_BENCH=1 / WAV2VEC2_VERBOSE=1 / WAV2VEC2_DUMP_DIR= |
wav2vec2 per-stage timings, verbose graph traces, and stage dumps. |
GRANITE_ENCODER_GRAPH=1 |
Switch the granite_speech encoder from per-layer CPU loops to a single ggml graph. |
CRISPASR_NO_REL_POS=1 |
Ablate the relative-position bias in the Gemma-4 audio encoder (development only). |
ECAPA_REF_FBANK=path |
Reference filterbank tensor for the ECAPA-TDNN LID model (regression harness). |
CRISPASR_SHERPA_LID_BIN=path |
Override the auto-detected sherpa-onnx LID binary. |
WHISPER_ARG_DEVICE=N |
Default GPU device index when -dev isn't passed. |
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 |
Let CUDA swap to RAM when VRAM is exhausted. |
GGML_VK_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES |
Standard ggml/CUDA device-visibility filters. |
HF_TOKEN and HUGGING_FACE_HUB_TOKEN are both honoured for gated-model
downloads (in that order).
- whisper.cpp — the ggml inference engine and the whisper runtime this fork is built on
- ggml — the tensor library everything runs on
- NVIDIA NeMo — parakeet-tdt, canary-1b-v2, canary-ctc aligner, and the FastConformer-CTC family (stt_en_fastconformer_ctc_{large,xlarge,xxlarge})
- Cohere — cohere-transcribe-03-2026
- Qwen team (Alibaba) — Qwen3-ASR-0.6B, Qwen3-ASR-1.7B, Qwen3-ForcedAligner-0.6B
- Mistral AI — Voxtral Mini 3B and 4B Realtime
- IBM Granite team — Granite Speech 3.2-8b, 3.3-2b, 3.3-8b, 4.0-1b
- Meta / wav2vec2 — wav2vec2 CTC models (XLSR-53 English, German, multilingual via any Wav2Vec2ForCTC checkpoint)
- sherpa-onnx — optional diarization via subprocess (ONNX models)
- Silero — VAD (native GGUF) and language identification (native GGUF, 95 languages)
- pyannote — speaker diarization segmentation (native GGUF port)
- miniaudio and stb_vorbis — embedded audio decoders
- Claude Code (Anthropic) — significant portions of the crispasr integration layer, all model converters, and the FastConformer/attention/mel/FFN/BPE core helpers were co-authored with Claude
Same as upstream whisper.cpp: MIT.
Per-model weights are covered by their respective HuggingFace model licenses (see Supported backends). The crispasr binary itself links model runtimes that are mostly permissively licensed (MIT / Apache-2.0 / CC-BY-4.0 for weights).