Skip to content

Add Bookworm OS support and Grok Voice AI integration#47

Open
obikata wants to merge 28 commits intonikivanov:masterfrom
obikata:feature/bookworm-support
Open

Add Bookworm OS support and Grok Voice AI integration#47
obikata wants to merge 28 commits intonikivanov:masterfrom
obikata:feature/bookworm-support

Conversation

@obikata
Copy link
Copy Markdown

@obikata obikata commented Apr 4, 2026

Summary

  • Bookworm OS対応: Raspberry Pi OS (Bookworm/Debian 12) での手動セットアップ手順とコード修正
  • Grok Voice Chat: xAI Realtime APIによるリアルタイム音声会話機能(Push-to-talk方式)
  • カメラ画像認識: Grokが会話の文脈から自律的にカメラ映像を取得・分析して応答
  • UI刷新: Glassmorphismデザインでモダンなインタフェースに

主な変更点

Bookworm OS対応

  • config.txt の場所変更 (/boot/firmware/)
  • python-smbuspython3-smbus、pip --break-system-packages
  • SSL context修正 (Python 3.13対応)
  • raspividrpicam-vid
  • mimic1espeak (TTS)
  • PowerPlant未接続時のクラッシュ防止
  • SETUP.md に完全なセットアップ手順を追加

Grok Voice Chat

  • xAI Realtime API経由のWebSocketプロキシ (/voiceChat)
  • Push-to-talk方式 (Gキーでセッション開始、スペースキーで録音)
  • ブラウザ側でマイク入力・音声再生 (Web Audio API)
  • 起動時にWatneyが日本語で自己紹介

カメラ画像認識

  • Grokがfunction callingで自律的にカメラを使用
  • ブラウザのvideo要素からスナップショットを取得 (映像中断なし)
  • Grok Vision API (grok-4-1-fast-non-reasoning) で画像分析
  • 分析結果を音声で自然に応答

UI

  • Glassmorphism (すりガラス風パネル)
  • 角丸ボタン、スムーズなアニメーション
  • 録音中のパルスアニメーション
  • 音声チャットトランスクリプト表示

Test plan

  • Raspberry Pi 3A+ でクリーンインストール (SETUP.md の手順)
  • ブラウザから https://<IP>:5000 でアクセス確認
  • Gキーで音声チャット開始 → スペースキーで会話
  • 視覚に関する質問でカメラ画像認識が動作

obikata and others added 28 commits March 29, 2026 15:02
- Handle missing PowerPlant (UPS) hardware gracefully on I2C
- Add SETUP.md with full manual setup steps for new Raspberry Pi OS
- Document all Buster→Bookworm compatibility changes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Change TTSCommand from mimic1 to espeak in rover.conf
- Add branch checkout step to SETUP.md
- Minor doc fixes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use ssl.SSLContext(PROTOCOL_TLS_SERVER) instead of
ssl.create_default_context() which creates a client context
and breaks TLS on Python 3.13.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
raspivid is deprecated in Bookworm OS. Use libcamera-vid instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use ipv4.never-default to prevent eth0 from stealing the
default route from WiFi, which breaks internet access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Command renamed from libcamera-vid to rpicam-vid in newer rpicam-apps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add WebSocket proxy endpoint /voiceChat in server.py
- Add [XAI] config section to rover.conf
- Add voice_chat.js for mic capture, audio playback, and transcript
- Add voice chat button and transcript UI to index.html
- Add CSS styles for voice chat UI
- Keyboard shortcut: G key to toggle voice chat

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
xAI uses 'response.output_audio.delta' and
'response.output_audio_transcript.delta' instead of
'response.audio.delta' and 'response.audio_transcript.delta'.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Enable server-side voice activity detection (server_vad)
- Set pcm16 input/output audio format
- Enable input audio transcription
- Set modalities to text and audio

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lower threshold and add silence/padding duration for
more reliable voice activity detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reset playback time when it falls behind current time
- Reset playback time on new response to prevent queue buildup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Stop sending audio to xAI while AI response is playing
- Resume mic 500ms after response completes
- Prevents "うん" loop from echo/noise detection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Space key: hold to record, release to send
- G key: toggle voice chat session on/off
- Disable server VAD, use manual commit instead
- Red button indicator while recording
- Simpler and more reliable than VAD-based detection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add 3x gain node to amplify mic input
- Track if actual audio was captured during recording
- Only commit and request response if audio was detected
- Clear buffer on release if no audio detected

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Auto-greet with self-introduction when voice chat starts
- Increase mic gain from 3x to 8x for better pickup
- Set instructions to always respond in Japanese

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Frosted glass panels with backdrop-filter blur
- Rounded corners and smooth transitions
- Flexbox layout for button panel
- Modern range slider and input styling
- Pulse animation on voice recording
- Inter font for cleaner typography
- Custom scrollbar for voice chat transcript

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reset current AI message div after each response completes,
so subsequent responses start on a new line instead of
appending to the same one.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Grok decides when to look through camera via function calling
- Browser captures frame from video element (no camera conflict)
- Server sends image to Grok Vision API for analysis
- Result is fed back to voice conversation naturally
- No video stream interruption

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
grok-2-vision-latest does not exist in the xAI API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
grok-2-vision models have been retired. Current xAI API uses
grok-4 series models which natively support image input.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No longer needed after Vision API integration confirmed working.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- API key setup steps
- Permission requirements (Realtime + Chat)
- Usage guide (G key, Space for push-to-talk)
- Voice configuration options
- Security note about not committing API keys

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant