Skip to content

docs: add operator troubleshooting guide#1038

Open
zhoward-1 wants to merge 4 commits intomainfrom
docs/troubleshooting-guide
Open

docs: add operator troubleshooting guide#1038
zhoward-1 wants to merge 4 commits intomainfrom
docs/troubleshooting-guide

Conversation

@zhoward-1
Copy link
Copy Markdown
Contributor

Summary

Adds docs/operator-guides/troubleshooting.md covering 8 common failure scenarios operators encounter when deploying and running Michelangelo.

Each scenario has:

  • Symptoms — what the operator observes
  • Diagnosticskubectl commands to run, with expected output explained
  • Likely causes — ordered from most to least common

Scenarios covered:

  1. Jobs not being scheduled (no cluster assignment)
  2. Compute cluster registration failures
  3. Ray pods not starting on the compute cluster
  4. Worker cannot connect to the API server
  5. Temporal/Cadence connectivity issues
  6. InferenceServer not becoming healthy
  7. Model not loading (Deployment stuck in Asset Preparation)
  8. S3/object store errors
  9. UI not loading or API calls failing

Part of the operator/contributor guide improvements proposed in #1033.

🤖 Generated with Claude Code

Covers 8 common failure scenarios with symptoms, kubectl diagnostic
commands, and likely causes: job scheduling, cluster registration,
Ray pods, worker connectivity, Temporal, InferenceServer health,
model loading, S3 errors, and UI issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kubectl -n ma-system get configmap michelangelo-envoy-config -o yaml

# Verify the API server is reachable through Envoy
curl -v https://michelangelo-envoy.your-domain/healthz
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently support a healthcheck endpoint; there currently isn't a way to verify API server is reachable through Envoy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated/removed. ty

zhoward-1 and others added 3 commits April 10, 2026 15:31
Co-authored-by: Craig Marker <craig@marker.org>
Co-authored-by: Craig Marker <craig@marker.org>
Removed "# Verify the API server is reachable through Envoy + curl -v https://michelangelo-envoy-your-domain/healthz" as we don't currently support a healthcheck endpoint; there currently isn't a way to verify API server is reachable through Envoy.
austingreco

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@austingreco austingreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All review feedback addressed. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants