|
| 1 | +--- |
| 2 | +title: "Monitoring vSphere cluster health with check_vsphere" |
| 3 | +date: 2026-04-01 |
| 4 | +--- |
| 5 | + |
| 6 | +## What's new? |
| 7 | + |
| 8 | +The `cluster-health` command in **check_vsphere** looks at the members of a |
| 9 | +vSphere cluster, checks their state and decides whether the whole cluster is |
| 10 | +healthy. By default it treats nodes that are *disconnected* or *in maintenance* |
| 11 | +as faulty, but you can tweak that list. Use `--faulty` to customize what counts |
| 12 | +as a failure. |
| 13 | + |
| 14 | +## How the threshold works |
| 15 | + |
| 16 | +You tell the command when to raise a warning or a critical alert with the |
| 17 | +`--cluster-threshold` flag: |
| 18 | + |
| 19 | +``` |
| 20 | +[max_members:]warn_threshold:crit_threshold |
| 21 | +``` |
| 22 | + |
| 23 | +* `max_members` (optional) - Apply the rule to clusters with up to this many members. |
| 24 | +* `warn_threshold` – Number or percent of faulty nodes that triggers a **WARN**. |
| 25 | +* `crit_threshold` – Number or percent that triggers a **CRIT**. |
| 26 | + |
| 27 | +You can give several `--cluster-threshold` flags for different cluster sizes. |
| 28 | +Rules apply to clusters up to their `max_members`; if multiple rules match, the |
| 29 | +smallest `max_members` wins. One rule must omit `max_members`; that one is the |
| 30 | +fallback. |
| 31 | + |
| 32 | +## Quick examples |
| 33 | + |
| 34 | +* `3:1:1` - For clusters up to 3 nodes: a single fault triggers a critical state (warning and critical equal). |
| 35 | +* `5:1:3` - For clusters up to 5 nodes: warn at >=1 faulty node, critical at >=3. |
| 36 | +* `10:2:5` - For clusters up to 10 nodes: warn at 2 faulty nodes, critical at 5 |
| 37 | +* `50:5:15` - For clusters up to 50 nodes: warn at 5 faulty nodes, critical at 15. |
| 38 | +* `10%:20%` - Fallback for larger clusters: warning at 10% failures, critical at 20%. |
| 39 | + |
| 40 | +## Usage snippet |
| 41 | + |
| 42 | +```bash |
| 43 | +check_vsphere cluster-health \ |
| 44 | + --host vcenter.example.com \ |
| 45 | + -u naemon@vsphere.local \ |
| 46 | + --cluster-threshold 3:1:1 \ |
| 47 | + --cluster-threshold 5:1:3 \ |
| 48 | + --cluster-threshold 10:2:5 \ |
| 49 | + --cluster-threshold 50:5:15 \ |
| 50 | + --cluster-threshold '10%:20%' \ |
| 51 | + --cluster-name MyCluster |
| 52 | +``` |
| 53 | + |
| 54 | +## Naemon integration |
| 55 | + |
| 56 | +``` |
| 57 | +define command{ |
| 58 | + command_name check_vsphere_cluster_health |
| 59 | + command_line VSPHERE_PASS=$ARG4$ $USER2$/check_vsphere cluster-health \ |
| 60 | + -u $ARG3$ \ |
| 61 | + --host $ARG1$ \ |
| 62 | + --cluster-name $ARG2$ \ |
| 63 | + --cluster-threshold 3:1:1 \ |
| 64 | + --cluster-threshold 5:1:3 \ |
| 65 | + --cluster-threshold 10:2:5 \ |
| 66 | + --cluster-threshold 50:5:15 \ |
| 67 | + --cluster-threshold '10%:20%' |
| 68 | +} |
| 69 | +
|
| 70 | +define service{ |
| 71 | + use generic-service |
| 72 | + host_name vcenter.example.com |
| 73 | + service_description vSphere Cluster Health |
| 74 | + check_command check_vsphere_cluster_health!vcenter.example.com!MyCluster!user!pw |
| 75 | +} |
| 76 | +``` |
0 commit comments