Skip to content

docs: add monitoring and observability guide#1040

Open
zhoward-1 wants to merge 1 commit intomainfrom
docs/monitoring-observability
Open

docs: add monitoring and observability guide#1040
zhoward-1 wants to merge 1 commit intomainfrom
docs/monitoring-observability

Conversation

@zhoward-1
Copy link
Copy Markdown
Contributor

Summary

Adds docs/operator-guides/monitoring.md covering the full observability setup for Michelangelo deployments.

Covers:

  • Prometheus scrape configuration: ServiceMonitor for the controller manager (port 8091), health probe endpoints (port 8081), API server gRPC metrics, and Envoy admin stats (port 9901)
  • Key metrics organized by subsystem: job scheduling, Temporal workflow engine, model serving (Envoy upstream metrics), and controller-runtime health metrics
  • 5 alerting rules: job scheduling backlog, no healthy compute clusters (critical), controller reconcile error rate, inference latency P99, inference 5xx error rate
  • Grafana dashboard panel recommendations by row (overview, jobs, serving, controller health) with PromQL queries
  • Structured logging configuration and a table of important log fields to index for log aggregation systems

Part of the operator/contributor guide improvements proposed in #1033.

🤖 Generated with Claude Code

Covers Prometheus scrape configuration (ServiceMonitor for controller
manager, health probes, Envoy admin stats), key metrics for job
scheduling, Temporal, model serving, and controller health, five
alerting rules, Grafana dashboard panel recommendations, and
structured logging configuration with log field indexing guidance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant