This repo runs a real-time-ish speech → text → LLM → speech pipeline exposed over WebSocket (FastAPI + Uvicorn).
Core entry points:
- Server:
run_server.py - Client (mic + speaker):
s2s_client.py - Pipeline implementation:
speech_pipeline/
High-level flow (default client):
- Mic capture (16 kHz) in the client.
- Client-side VAD (Silero VAD via
torch.hub) segments speech. - When a speech segment ends, client sends the full utterance to the server over WebSocket.
- Server runs:
- ASR (Whisper via
faster-whisper) → transcription - LLM (Qwen3-8B from local folder) → response text
- TTS (Kokoro ONNX) → response audio (24 kHz)
- ASR (Whisper via
- Server streams back:
- transcription text
- response text
- response audio (PCM int16, base64)
Implements the orchestration layer SpeechToSpeechPipeline.
process_audio_chunk(audio_chunk): feeds chunk to server-side VAD (VADIterator). When a speech segment completes it callsprocess_speech(...).process_speech(audio): non-streaming pipeline- ASR
transcribe(...) - LLM
generate(...) - TTS
synthesize(text, wav_path) - returns a
PipelineResultwith transcription, response text, and full audio.
- ASR
process_speech_streaming(audio): “semi-streaming” output- ASR first
- LLM token streaming (
generate_streaming) - TTS per sentence: whenever a sentence boundary is detected, synthesize that sentence and yield an audio chunk.
WebSocket API + connection management.
- WebSocket endpoint:
ws://<host>:<port>/ws - Two audio input modes:
audio: one complete utterance (client already segmented speech). Server usesprocess_speech_streamingand returnsresponse_chunk+ audio chunks.audio_chunk: streaming chunks (server segments speech using its own VAD). When the server detects end-of-utterance it runsprocess_speechand returns one response.
ASR wrapper WhisperASR (aliased as ConformerASR for backward compatibility).
- Uses
faster_whisper.WhisperModel - Downloads/caches whisper weights under
pretrained_models/whisper
LLM wrapper QwenLLM.
- Loads tokenizer + model from
models/Qwen_Qwen3-8B - Optional 4-bit quantization via
bitsandbytes - Supports:
- non-streaming:
generate(prompt) - streaming:
generate_streaming(prompt)
- non-streaming:
- Optional web search augmentation (DuckDuckGo) if
ddgs/duckduckgo_searchis installed.
TTS wrapper EnglishTTS using Kokoro ONNX.
- Expects local files:
models/kokoro/kokoro-v1.0.onnxmodels/kokoro/voices-v1.0.bin
- Produces 24 kHz audio.
Thin CLI entrypoint.
python run_server.pystarts the FastAPI server.--preloadoptionally loads all models at startup (otherwise first connection triggers lazy load).
Interactive client:
- Records microphone audio at 16 kHz (
sounddevice). - Uses Silero VAD locally to segment speech.
- Sends full utterance to server (
type: "audio"). - Plays server audio responses and suppresses mic while playback is active to reduce feedback.
Connect to:
ws://localhost:8765/ws
- Full utterance (most common, used by
s2s_client.py)
{ "type": "audio", "data": "<base64 PCM int16>", "sample_rate": 16000 }- Streaming chunks (alternative)
{ "type": "audio_chunk", "data": "<base64 PCM int16>", "sample_rate": 16000 }- Text-only
{ "type": "text", "text": "Hello" }- Reset / ping
{ "type": "reset" }{ "type": "ping" }- On connect:
{ "type": "connected", "message": "...", "sample_rate": 16000 }- Transcription:
{ "type": "transcription", "text": "..." }- Response text:
{ "type": "response", "text": "..." }- Streaming response chunks (when using
audiomode):
{ "type": "response_chunk", "text": "..." }- Audio:
{ "type": "audio", "data": "<base64 PCM int16>", "sample_rate": 24000 }- Errors / done:
{ "type": "error", "message": "..." }{ "type": "processing_complete", "processing_time_ms": 1234.5 }Recommended (create a venv first):
pip install -r requirements_speech.txt
pip install sounddevice rich kokoro-onnxNotes:
sounddeviceis required for the client.kokoro-onnxis required for the TTS implementation inspeech_pipeline/tts.py.
This repo expects local model files (these are often large and may be gitignored):
- Qwen model folder:
models/Qwen_Qwen3-8B/(HuggingFace-format directory)
- Kokoro ONNX files:
models/kokoro/kokoro-v1.0.onnxmodels/kokoro/voices-v1.0.bin
Whisper weights are downloaded automatically at runtime into:
pretrained_models/whisper/
python run_server.py --host 0.0.0.0 --port 8765Optional preload (slower startup, faster first request):
python run_server.py --preloadpython s2s_client.py --url ws://localhost:8765/ws- Sample rates:
- Client mic input: 16 kHz
- Server TTS output: 24 kHz
- GPU vs CPU:
- Server defaults to
device="cuda"in the pipeline. If you want CPU-only, you’ll need to adjust whereSpeechToSpeechPipeline(device="cuda")is constructed.
- Server defaults to
- First run can be slow due to model downloads / cache warmup.
- If you see import errors:
- Whisper:
pip install faster-whisper - WebSocket server:
pip install fastapi uvicorn websockets - TTS:
pip install kokoro-onnx
- Whisper: