Summary
Add a squad watch health (or squad watch --health) command that shows the real-time health status of all running Squad Watch instances on the current machine. Today, diagnosing whether a watch is "stuck" vs "actively working" requires manual process tree inspection — this command should automate that.
Background / Problem
When squad watch runs, each round spawns a Copilot CLI session (via agency.exe) that can run background agents (sub-agent pwsh.exe shells). A round that appears "stuck" based on heartbeat age alone may actually be actively processing work through 30+ background agent shells.
Current diagnostic flow (manual, error-prone):
- Read heartbeat JSON -> see round running for 100+ min -> assume stuck
- Actually need to: find the agency session PID -> check its child copilot.exe -> check copilot.exe's children (pwsh.exe background agents) -> check if any have CPU activity
- Also need to check the process log for last non-heartbeat activity timestamp
- Only truly stuck if: copilot.exe exited AND agency.exe still alive (zombie MCP servers), OR zero child processes have any CPU activity
Key insight: A round with active background agents (pwsh.exe children with CPU > 0) is WORKING, not stuck. The heartbeat age alone is misleading.
Requirements
squad watch health output
Show a table for each running watch instance on this machine:
Squad Watch Health
==================
Repo Round Status Duration Agents Last Activity
tamresearch1 R2 Working 98m 35 2m ago (pwsh CPU active)
tamresearch1-re R113 Idle 2m 0 2m ago (between rounds)
content-empire -- Custom -- -- (no heartbeat)
Health determination logic
For each watch instance:
- Read heartbeat file (
~/.squad/ralph-heartbeat-{repo}.json) -> get PID, round, status, lastRun
- Check if PID is alive -> if dead, status = Dead
- Find the agency session child process (child of ralph-watch PID)
- Find copilot.exe grandchild of the agency process
- Count active background agents -- pwsh.exe children of copilot.exe with CPU > 0
- Check process log for last non-heartbeat line timestamp
- Determine status:
- Working -- copilot.exe alive AND (background agents with CPU > 0 OR last log activity < 5m ago)
- Slow -- copilot.exe alive but no CPU activity for 10+ min AND no log activity for 10+ min
- Stuck -- copilot.exe exited but agency.exe still alive (zombie MCP servers), OR PID dead, OR no activity for 30+ min
- Unknown -- no heartbeat file (custom script, no monitoring)
Zombie MCP detection pattern
The specific zombie pattern discovered in production:
- copilot.exe finishes LLM conversation (log shows
str_replace_editor_shutdown + telemetry flush)
- But agency.exe stays alive because MCP server child processes (voicemcp, aspire, configgen, enghub) keep heartbeating every 30s
- The process log tail is 100% heartbeat lines with no real activity
- Detection: Last non-heartbeat log line is 30+ min old while heartbeat lines are < 1 min old
- Filter heartbeat noise with: lines matching
heartbeat|gateway|Start processing|Sending HTTP|Received HTTP|End processing
Data sources
| Data |
Location |
| Heartbeat |
~/.squad/ralph-heartbeat-{repo}.json |
| Process log |
~/.agency/logs/session_*/process-*.log (latest session dir) |
| Watch log |
~/.squad/ralph-watch-{repo}.log |
| Process tree |
Win: Get-CimInstance Win32_Process, Linux: pstree//proc |
| Scheduled tasks |
Win: Get-ScheduledTask -TaskPath "\Squad\" |
CLI flags
squad watch health -- show all instances
squad watch health --json -- machine-readable output
squad watch health --repo <name> -- filter to specific repo
squad watch health --kill-stuck -- kill stuck instances and restart via scheduled task
Real-world example
From debugging on CPC-tamir-3H7BI (2026-04-04):
- Ralph R2 appeared "stuck" at 98 min based on heartbeat age
- copilot.exe was ALIVE with ~35 pwsh.exe background agent shells as children
- LLM conversation finished at 10:33 UTC but background agents were still running
- Heartbeat-only diagnosis would have incorrectly killed a working session
- Proper diagnosis: check child process count + CPU activity = actively working
Acceptance Criteria
Summary
Add a
squad watch health(orsquad watch --health) command that shows the real-time health status of all running Squad Watch instances on the current machine. Today, diagnosing whether a watch is "stuck" vs "actively working" requires manual process tree inspection — this command should automate that.Background / Problem
When
squad watchruns, each round spawns a Copilot CLI session (viaagency.exe) that can run background agents (sub-agentpwsh.exeshells). A round that appears "stuck" based on heartbeat age alone may actually be actively processing work through 30+ background agent shells.Current diagnostic flow (manual, error-prone):
Key insight: A round with active background agents (pwsh.exe children with CPU > 0) is WORKING, not stuck. The heartbeat age alone is misleading.
Requirements
squad watch healthoutputShow a table for each running watch instance on this machine:
Health determination logic
For each watch instance:
~/.squad/ralph-heartbeat-{repo}.json) -> get PID, round, status, lastRunZombie MCP detection pattern
The specific zombie pattern discovered in production:
str_replace_editor_shutdown+ telemetry flush)heartbeat|gateway|Start processing|Sending HTTP|Received HTTP|End processingData sources
~/.squad/ralph-heartbeat-{repo}.json~/.agency/logs/session_*/process-*.log(latest session dir)~/.squad/ralph-watch-{repo}.logGet-CimInstance Win32_Process, Linux:pstree//procGet-ScheduledTask -TaskPath "\Squad\"CLI flags
squad watch health-- show all instancessquad watch health --json-- machine-readable outputsquad watch health --repo <name>-- filter to specific reposquad watch health --kill-stuck-- kill stuck instances and restart via scheduled taskReal-world example
From debugging on CPC-tamir-3H7BI (2026-04-04):
Acceptance Criteria
squad watch healthshows status of all watch instances on the machine--jsonflag for programmatic consumption--kill-stuckflag to auto-remediate