feat: squad watch health — show running watch instance status with background agent detection

## Summary

Add a `squad watch health` (or `squad watch --health`) command that shows the real-time health status of all running Squad Watch instances on the current machine. Today, diagnosing whether a watch is "stuck" vs "actively working" requires manual process tree inspection — this command should automate that.

## Background / Problem

When `squad watch` runs, each round spawns a Copilot CLI session (via `agency.exe`) that can run background agents (sub-agent `pwsh.exe` shells). A round that appears "stuck" based on heartbeat age alone may actually be actively processing work through 30+ background agent shells.

**Current diagnostic flow (manual, error-prone):**
1. Read heartbeat JSON -> see round running for 100+ min -> assume stuck
2. Actually need to: find the agency session PID -> check its child copilot.exe -> check copilot.exe's children (pwsh.exe background agents) -> check if any have CPU activity
3. Also need to check the process log for last non-heartbeat activity timestamp
4. Only truly stuck if: copilot.exe exited AND agency.exe still alive (zombie MCP servers), OR zero child processes have any CPU activity

**Key insight:** A round with active background agents (pwsh.exe children with CPU > 0) is WORKING, not stuck. The heartbeat age alone is misleading.

## Requirements

### `squad watch health` output

Show a table for each running watch instance on this machine:

```
Squad Watch Health
==================
Repo              Round  Status       Duration  Agents  Last Activity
tamresearch1      R2     Working      98m       35      2m ago (pwsh CPU active)
tamresearch1-re   R113   Idle         2m        0       2m ago (between rounds)
content-empire    --     Custom       --        --      (no heartbeat)
```

### Health determination logic

For each watch instance:

1. **Read heartbeat file** (`~/.squad/ralph-heartbeat-{repo}.json`) -> get PID, round, status, lastRun
2. **Check if PID is alive** -> if dead, status = Dead
3. **Find the agency session child process** (child of ralph-watch PID)
4. **Find copilot.exe grandchild** of the agency process
5. **Count active background agents** -- pwsh.exe children of copilot.exe with CPU > 0
6. **Check process log** for last non-heartbeat line timestamp
7. **Determine status:**
   - **Working** -- copilot.exe alive AND (background agents with CPU > 0 OR last log activity < 5m ago)
   - **Slow** -- copilot.exe alive but no CPU activity for 10+ min AND no log activity for 10+ min
   - **Stuck** -- copilot.exe exited but agency.exe still alive (zombie MCP servers), OR PID dead, OR no activity for 30+ min
   - **Unknown** -- no heartbeat file (custom script, no monitoring)

### Zombie MCP detection pattern

The specific zombie pattern discovered in production:
- copilot.exe finishes LLM conversation (log shows `str_replace_editor_shutdown` + telemetry flush)
- But agency.exe stays alive because MCP server child processes (voicemcp, aspire, configgen, enghub) keep heartbeating every 30s
- The process log tail is 100% heartbeat lines with no real activity
- **Detection:** Last non-heartbeat log line is 30+ min old while heartbeat lines are < 1 min old
- Filter heartbeat noise with: lines matching `heartbeat|gateway|Start processing|Sending HTTP|Received HTTP|End processing`

### Data sources

| Data | Location |
|------|----------|
| Heartbeat | `~/.squad/ralph-heartbeat-{repo}.json` |
| Process log | `~/.agency/logs/session_*/process-*.log` (latest session dir) |
| Watch log | `~/.squad/ralph-watch-{repo}.log` |
| Process tree | Win: `Get-CimInstance Win32_Process`, Linux: `pstree`/`/proc` |
| Scheduled tasks | Win: `Get-ScheduledTask -TaskPath "\Squad\"` |

### CLI flags

- `squad watch health` -- show all instances
- `squad watch health --json` -- machine-readable output
- `squad watch health --repo <name>` -- filter to specific repo
- `squad watch health --kill-stuck` -- kill stuck instances and restart via scheduled task

### Real-world example

From debugging on CPC-tamir-3H7BI (2026-04-04):
- Ralph R2 appeared "stuck" at 98 min based on heartbeat age
- copilot.exe was ALIVE with ~35 pwsh.exe background agent shells as children
- LLM conversation finished at 10:33 UTC but background agents were still running
- Heartbeat-only diagnosis would have incorrectly killed a working session
- Proper diagnosis: check child process count + CPU activity = actively working

## Acceptance Criteria

- [ ] `squad watch health` shows status of all watch instances on the machine
- [ ] Correctly distinguishes "working with background agents" from "stuck with zombie MCPs"
- [ ] Shows background agent count and last real activity timestamp
- [ ] `--json` flag for programmatic consumption
- [ ] `--kill-stuck` flag to auto-remediate
- [ ] Works on Windows (primary) and Linux (secondary)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: squad watch health — show running watch instance status with background agent detection #808

Summary

Background / Problem

Requirements

`squad watch health` output

Health determination logic

Zombie MCP detection pattern

Data sources

CLI flags

Real-world example

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Data	Location
Heartbeat	`~/.squad/ralph-heartbeat-{repo}.json`
Process log	`~/.agency/logs/session_/process-.log` (latest session dir)
Watch log	`~/.squad/ralph-watch-{repo}.log`
Process tree	Win: `Get-CimInstance Win32_Process`, Linux: `pstree`/`/proc`
Scheduled tasks	Win: `Get-ScheduledTask -TaskPath "\Squad\"`

feat: squad watch health — show running watch instance status with background agent detection #808

Description

Summary

Background / Problem

Requirements

squad watch health output

Health determination logic

Zombie MCP detection pattern

Data sources

CLI flags

Real-world example

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`squad watch health` output