-
Goal: test how an adaptive engagement speech strategy affects human-robot interaction using the Elmo robot as the embodiment and an LLM-powered dialogue stack.
-
Setting: controlled study where Elmo opens with the same topic prompt; engagement tactics (independent variable) are varied between runs.
-
Measurement: user engagement time and qualitative behavior are observed while VAD/STT/LLM/TTS pipeline runs in real-time.
-
Constraints: low latency from VAD to TTS; reproducible runs with logging suitable for later analysis.
flowchart LR
A[User] -->|Speech Utterance| B(VAD & Audio Buffer)
B -->|Speech Audio Chunk| C(Faster-Whisper ASR)
C -->|Transcribed Text| D(Engagement Scoring LLM)
D -->|Engagement Score 1-5| E(Dialogue Generation LLM)
E -->|Generated Response Text| F(Text-to-Speech Model)
F -->|Generated Audio Response| A
stateDiagram
direction LR
[*] --> SileroVAD: ~32ms audio segments
state vad_choice <<choice>>
SileroVAD --> vad_choice: Detected speech?
vad_choice --> SileroVAD: No
vad_choice --> FasterWhisper: Yes
FasterWhisper -->EngagementLLM: Transcribed Text
EngagementLLM -->ResponseLLM:Score 1-5 + Text
ResponseLLM --> [*] : Generated Response
stateDiagram
direction LR
State_High_Engagement --> State_Low_Engagement : Short answers / Long pauses
State_Low_Engagement --> Repair_Strategy : Trigger Threshold (<40%)
state Repair_Strategy {
Explicit_Check --> User_Response
Topic_Switch --> User_Response
}
User_Response --> State_High_Engagement : User elaborates
User_Response --> End_Interaction : User ignores
- ELMO starts the conversation with the same question/topic
- Repair Strategy
- Survival Rate (how much time the conversation holds)
-
Response Latency
-
Synctatic Complexity (Short-term responses)
-
Subjective Scores (RoSAS, Godspeed)
-
Logs from the LLM thought process
- Engagement score (1-5) is produced per user turn to gauge attention/interest; score plus rationale feed the response LLM.
- Strategies to vary: mirroring affect, asking clarifying questions, adding light humor, summarizing/reframing, prompting elaboration, or keeping neutral baseline.
- Topic management: keep the initial conversation topic fixed; track state (last turns, score trend) to adapt responses without drifting topics.
The source code for the project is located inside the SRProject directory.
To install the dependencies locally run:
poetry installTo run the project:
poetry run python src/group-13/session.pyBe sure to populate the following environment variables (.env):
-
ENGAGEMENT_API_KEY: Your Gemini API key for the Engagement Evaluator model -
DIALOGUE_API_KEY: Your Gemini API key for the Dialogue model -
ENVIRONMENT: can be eitherENGAGEMENTorCONTROL, depending on the hypothesis being tested (defaults toENGAGEMENT) -
ROBOT_IP: Your robot's IP, if absent code runs in dry mode (without a robot)
src/group-13/vad.py: VAD prototype using Silero.src/group-13/stt.py: STT prototype using Faster-Whisper.src/group-13/main.py: entry point; will orchestrate audio capture, VAD, STT, engagement scoring, response generation, TTS, and robot output.TODO.md: high-level project to-do list (setup, pipeline integration, study logistics).assets/: supporting artifacts (e.g., prompts, audio) — fill as needed.
- LLM (Gemini) requires API keys; load from env vars or a local
.env/config.toml(not checked in). TTS uses gTTS locally (no Gemini TTS). When no Gemini key/library is present, the pipeline falls back to heuristics for engagement/response. - Audio I/O: choose correct input/output devices; ensure sample rate compatible with VAD/STT pipeline.
- Robot I/O: Elmo integration will need network/serial commands from https://github.com/S-Andrade/Robots/tree/main/Elmo; specify connection details in a config block.
- Prepare consent and briefing scripts; keep conversation topic constant.
- Validate audio path end-to-end (mic → VAD → STT → LLMs → TTS → robot speaker) before each session; heuristic fallback runs when Gemini keys are absent.
- Log timestamps, engagement scores, selected strategy, and user responses for analysis (CSV/JSON).
- Provide a fallback "no-robot" mode for laptop-only dry runs and demos.
-
coquiai/TTS (optional fallback)
- maximum 10 Turns
Variable,Metric (Operationalization),Supporting Citation
IV: Repair Strategy,"Control: No deviation from script.Explicit: ""Meta-comment"" on silence (e.g., ""Are you still there?"").Implicit: ""Topic Switch"" or ""Humor"" injection.",Fischer et al. (2019) vs. Niculescu et al. (2013)
DV: Survival Rate,"Time (seconds) until user says ""Stop"" or leaves.",Sidner et al. (2005)
DV: Response Latency,Avg. time (ms) between Robot_End and User_Start.,Skantze (2021)
DV: Verbal Density,"Avg. words per turn (excluding ""stop words"").",Ben-Youssef et al. (2017)