Skip to content

feat: squad watch health — show running watch instance status with background agent detection #808

@tamirdresher

Description

@tamirdresher

Summary

Add a squad watch health (or squad watch --health) command that shows the real-time health status of all running Squad Watch instances on the current machine. Today, diagnosing whether a watch is "stuck" vs "actively working" requires manual process tree inspection — this command should automate that.

Background / Problem

When squad watch runs, each round spawns a Copilot CLI session (via agency.exe) that can run background agents (sub-agent pwsh.exe shells). A round that appears "stuck" based on heartbeat age alone may actually be actively processing work through 30+ background agent shells.

Current diagnostic flow (manual, error-prone):

  1. Read heartbeat JSON -> see round running for 100+ min -> assume stuck
  2. Actually need to: find the agency session PID -> check its child copilot.exe -> check copilot.exe's children (pwsh.exe background agents) -> check if any have CPU activity
  3. Also need to check the process log for last non-heartbeat activity timestamp
  4. Only truly stuck if: copilot.exe exited AND agency.exe still alive (zombie MCP servers), OR zero child processes have any CPU activity

Key insight: A round with active background agents (pwsh.exe children with CPU > 0) is WORKING, not stuck. The heartbeat age alone is misleading.

Requirements

squad watch health output

Show a table for each running watch instance on this machine:

Squad Watch Health
==================
Repo              Round  Status       Duration  Agents  Last Activity
tamresearch1      R2     Working      98m       35      2m ago (pwsh CPU active)
tamresearch1-re   R113   Idle         2m        0       2m ago (between rounds)
content-empire    --     Custom       --        --      (no heartbeat)

Health determination logic

For each watch instance:

  1. Read heartbeat file (~/.squad/ralph-heartbeat-{repo}.json) -> get PID, round, status, lastRun
  2. Check if PID is alive -> if dead, status = Dead
  3. Find the agency session child process (child of ralph-watch PID)
  4. Find copilot.exe grandchild of the agency process
  5. Count active background agents -- pwsh.exe children of copilot.exe with CPU > 0
  6. Check process log for last non-heartbeat line timestamp
  7. Determine status:
    • Working -- copilot.exe alive AND (background agents with CPU > 0 OR last log activity < 5m ago)
    • Slow -- copilot.exe alive but no CPU activity for 10+ min AND no log activity for 10+ min
    • Stuck -- copilot.exe exited but agency.exe still alive (zombie MCP servers), OR PID dead, OR no activity for 30+ min
    • Unknown -- no heartbeat file (custom script, no monitoring)

Zombie MCP detection pattern

The specific zombie pattern discovered in production:

  • copilot.exe finishes LLM conversation (log shows str_replace_editor_shutdown + telemetry flush)
  • But agency.exe stays alive because MCP server child processes (voicemcp, aspire, configgen, enghub) keep heartbeating every 30s
  • The process log tail is 100% heartbeat lines with no real activity
  • Detection: Last non-heartbeat log line is 30+ min old while heartbeat lines are < 1 min old
  • Filter heartbeat noise with: lines matching heartbeat|gateway|Start processing|Sending HTTP|Received HTTP|End processing

Data sources

Data Location
Heartbeat ~/.squad/ralph-heartbeat-{repo}.json
Process log ~/.agency/logs/session_*/process-*.log (latest session dir)
Watch log ~/.squad/ralph-watch-{repo}.log
Process tree Win: Get-CimInstance Win32_Process, Linux: pstree//proc
Scheduled tasks Win: Get-ScheduledTask -TaskPath "\Squad\"

CLI flags

  • squad watch health -- show all instances
  • squad watch health --json -- machine-readable output
  • squad watch health --repo <name> -- filter to specific repo
  • squad watch health --kill-stuck -- kill stuck instances and restart via scheduled task

Real-world example

From debugging on CPC-tamir-3H7BI (2026-04-04):

  • Ralph R2 appeared "stuck" at 98 min based on heartbeat age
  • copilot.exe was ALIVE with ~35 pwsh.exe background agent shells as children
  • LLM conversation finished at 10:33 UTC but background agents were still running
  • Heartbeat-only diagnosis would have incorrectly killed a working session
  • Proper diagnosis: check child process count + CPU activity = actively working

Acceptance Criteria

  • squad watch health shows status of all watch instances on the machine
  • Correctly distinguishes "working with background agents" from "stuck with zombie MCPs"
  • Shows background agent count and last real activity timestamp
  • --json flag for programmatic consumption
  • --kill-stuck flag to auto-remediate
  • Works on Windows (primary) and Linux (secondary)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions