Problem
etcd auto-recovery (#4503) does not detect pods stuck in init containers during initial HostedCluster bootstrap. Cluster remained stuck for 70+ minutes with no recovery triggered.
- etcd-0 stuck in 'PodInitializing' - 'ensure-dns' init container blocked on DNS resolution.
- both etcd-1 and etcd-2 couldn't reach etcd-0 (TLS rejected IP connections since certs only have DNS SANs)
- Manual 'kubectl delete' command fixed it immediately
Logs
{"level":"warn","ts":"2026-01-29T11:56:49.557821Z","caller":"etcdserver/cluster_util.go:294","msg":"failed to reach the peer URL","address":"https://etcd-0.etcd-discovery..svc:2380/version","error":"dial tcp: lookup etcd-0.etcd-discovery..svc on 172.30.0.10:53: no such host"}
{"level":"warn","ts":"2026-01-29T11:56:45.128004Z","caller":"embed/config_logging.go:161","msg":"rejected connection on peer endpoint","remote-addr":"10.130.5.160:39932","ip-addresses":[],"dns-names":[".etcd-discovery..svc",".etcd-discovery..svc.cluster.local","127.0.0.1","::1"],"error":"tls: "10.130.5.160" does not match any of DNSNames"}
Results in the status
status:
conditions:
- lastTransitionTime: "2026-01-29T11:56:46Z"
message: 'containers with incomplete status: [ensure-dns reset-member]'
reason: ContainersNotInitialized
status: "False"
type: Initialized
phase: Pending
initContainerStatuses:
- name: ensure-dns
ready: false
state:
waiting:
reason: PodInitializing
Note: 3 other HostedClusters were created simultaneously and succeeded. This appears to be an intermittent race condition during parallel pod boostrap
Current recovery only monitors etcd endpoint health checks, not init container hangs
Related
- #4503 - Current fix (doesn't cover this scenario)
- #4354 / #4475 - Reverted approach
- #1985 - Added
ensure-dns init container
Problem
etcd auto-recovery (#4503) does not detect pods stuck in init containers during initial HostedCluster bootstrap. Cluster remained stuck for 70+ minutes with no recovery triggered.
Logs
{"level":"warn","ts":"2026-01-29T11:56:49.557821Z","caller":"etcdserver/cluster_util.go:294","msg":"failed to reach the peer URL","address":"https://etcd-0.etcd-discovery..svc:2380/version","error":"dial tcp: lookup etcd-0.etcd-discovery..svc on 172.30.0.10:53: no such host"} {"level":"warn","ts":"2026-01-29T11:56:45.128004Z","caller":"embed/config_logging.go:161","msg":"rejected connection on peer endpoint","remote-addr":"10.130.5.160:39932","ip-addresses":[],"dns-names":[".etcd-discovery..svc",".etcd-discovery..svc.cluster.local","127.0.0.1","::1"],"error":"tls: "10.130.5.160" does not match any of DNSNames"}Results in the status
Note: 3 other HostedClusters were created simultaneously and succeeded. This appears to be an intermittent race condition during parallel pod boostrap
Current recovery only monitors etcd endpoint health checks, not init container hangs
Related
ensure-dnsinit container