docs: add monitoring and observability guide by zhoward-1 · Pull Request #1040 · michelangelo-ai/michelangelo

zhoward-1 · 2026-04-01T22:05:35Z

Summary

Adds docs/operator-guides/monitoring.md covering the full observability setup for Michelangelo deployments.

Covers:

Prometheus scrape configuration: ServiceMonitor for the controller manager (port 8091), health probe endpoints (port 8081), API server gRPC metrics, and Envoy admin stats (port 9901)
Key metrics organized by subsystem: job scheduling, Temporal workflow engine, model serving (Envoy upstream metrics), and controller-runtime health metrics
5 alerting rules: job scheduling backlog, no healthy compute clusters (critical), controller reconcile error rate, inference latency P99, inference 5xx error rate
Grafana dashboard panel recommendations by row (overview, jobs, serving, controller health) with PromQL queries
Structured logging configuration and a table of important log fields to index for log aggregation systems

Part of the operator/contributor guide improvements proposed in #1033.

🤖 Generated with Claude Code

Covers Prometheus scrape configuration (ServiceMonitor for controller manager, health probes, Envoy admin stats), key metrics for job scheduling, Temporal, model serving, and controller health, five alerting rules, Grafana dashboard panel recommendations, and structured logging configuration with log field indexing guidance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

austingreco mentioned this pull request Apr 8, 2026

fix: add pipefail to docs-check build step #1077

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add monitoring and observability guide#1040

docs: add monitoring and observability guide#1040
zhoward-1 wants to merge 1 commit intomainfrom
docs/monitoring-observability

zhoward-1 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhoward-1 commented Apr 1, 2026

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant