To receive all notifications, click Tools -> Notification settings -> enable All comments and tasks
| Field | Value |
|---|---|
| Authors | weric@uber.com |
| ERD uPlan URL | d68e1299-1e57-4557-9eaf-04e74b3beb14 |
| Project Summary | Full-stack end-to-end integration test infrastructure for Michelangelo, exercising the Go backend (API server, controllermgr, worker), Python SDK and CLI (ma), JavaScript UI (project, pipeline, run pages), and complete training workflows using UniFLOW, Ray, and Spark — across both Cadence and Temporal workflow engines |
| uOwn Asset | Uber AI |
| Date started | Mar 06, 2026 |
Michelangelo is a full-stack ML platform spanning three language domains: Go (API server, controllermgr, worker), Python (SDK, CLI, pipeline DSL), and JavaScript (React UI). Today each layer has unit and component tests in isolation, but there is no automated test that exercises the full vertical slice — from the ma CLI registering a pipeline, through the API server and controllermgr reconciling it in Kubernetes, to the worker executing a real Ray or Spark training job, and the UI displaying the results.
Without end-to-end coverage, a breaking change in any one component (a CRD schema change, a broken controller reconcile loop, an API response format change) is only caught when a user reports it or an on-call engineer investigates a production incident. This ERD proposes a nightly CI integration test that boots the full Michelangelo stack in a local k3d sandbox, runs representative training pipelines for each executor type (UniFLOW, Ray, Spark), and performs UI smoke tests to verify the project, pipeline, and run pages respond correctly.
Expected impact:
- Catch cross-component regressions (Go ↔ Python ↔ JS) before they reach production
- Reduce time-to-detect pipeline-breaking changes from days to minutes
- Validate that
maCLI, Go services, and UI all agree on the CRD contract - Give the team a reproducible environment for testing new features end-to-end
- No cross-language E2E coverage: Go, Python, and JavaScript tests run in separate CI jobs with no shared integration surface.
- No UI regression tests: The React UI has unit tests but no automated check that the project list, pipeline list, or run detail pages load real data from the API.
- Sandbox is manual-only:
mactl sandbox createwas designed for local developer use. Running it in CI requires solving image pre-warming, startup timing, and resource constraints on GitHub Actions runners. - Ray and Spark jobs untested end-to-end: No automated test verifies that a Ray cluster is created, a RayJob completes, or that Spark submits and finishes a job through the Michelangelo controllermgr.
- Two workflow engines, no shared test: Cadence and Temporal are both supported but there is no test that runs the same pipeline through both engines to verify behavioral parity.
- Examples image build time: The task image (bert_cola + Spark + PyTorch) takes 30+ minutes to build from scratch. A pre-build and caching strategy is required for CI to be practical.
- GPU-accelerated training in CI (CPU-only runs; GPU via self-hosted runners is a follow-up)
- Inference pipeline testing (follow-on scope; see resource estimate section)
- Performance or load testing
- Testing Cadence UI or MinIO console
- Testing Uber-internal auth/authFx integrations (sandbox uses minioadmin credentials)
- Full Playwright UI test suite (in-scope: HTTP smoke tests; full UI E2E testing is a follow-up)
┌────────────────────────────────┐
│ build-examples-image.yaml │ Triggered: push / PR / dispatch
│ │ Builds task image (bert_cola + Spark
│ ghcr.io/.../examples:<tag> │ + PyTorch + Ray), pushes to GHCR
│ │ with GHA layer cache [1]
└──────────────┬─────────────────┘
│ workflow_run (on success)
▼
┌─────────────────────────────────────────────────────────────────────┐
│ integration-test-sandbox.yaml │
│ │
│ matrix: workflow_engine: [cadence, temporal] (runs in parallel) │
│ │
│ For each engine: │
│ 1. Boot k3d sandbox (MySQL, workflow engine, MinIO, API server, │
│ controllermgr, worker, kuberay-operator, spark-operator, │
│ envoy, michelangelo-ui) [2][3] │
│ 2. UniFLOW test — bert_cola training pipeline │
│ 3. Ray test — simple Ray training job [4] │
│ 4. Spark test — simple Spark training job [5] │
│ 5. UI smoke test — project / pipeline / run pages return 200 │
│ 6. Tear down sandbox │
│ │
│ Triggers: workflow_run | nightly 03:00 UTC | workflow_dispatch │
└─────────────────────────────────────────────────────────────────────┘
References: [1] GitHub Actions cache: https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows [2] k3d — lightweight k3s in Docker: https://k3d.io/ [3] k3s — lightweight Kubernetes: https://k3s.io/ [4] KubeRay operator: https://ray-project.github.io/kuberay/ [5] Spark operator: https://github.com/kubeflow/spark-operator
| Test | Go components | Python CLI | UI |
|---|---|---|---|
| UniFLOW (bert_cola) | API server (CRD CRUD), controllermgr (reconcile), worker (pod scheduling) | ma pipeline apply, ma pipeline run, poll |
— |
| Ray | API server, controllermgr (RayJob), kuberay-operator [4] | ma pipeline apply, ma pipeline run |
— |
| Spark | API server, controllermgr (SparkApplication), spark-operator [5] | ma pipeline apply, ma pipeline run |
— |
| UI smoke | API server (HTTP/JSON) | — | Project list, detail; pipeline list; run detail |
- Michelangelo Python SDK team — owns integration test script and CI workflows
- Michelangelo platform (Go) team — sandbox resource YAMLs and CRD schemas must remain compatible with k3d deployment
- Michelangelo UI team — UI must be reachable at
http://localhost:8090; all four route patterns must return HTTP 200 after test data is created - GitHub billing — larger runners required; see Cost section
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Ray/Spark jobs OOM on CI runner | Low (GCP VM ≥64 GB) | Test failure | Size GCP VM to ≥16 vCPU / 64 GB |
| MySQL slow startup on constrained runner | Mitigated | kubectl wait timeout | GCP VM has sufficient CPU; also pre-import mysql:8.0 into k3d and tune probe settings |
| Ray/Spark demo pipeline YAMLs don't exist yet | High (current gap) | Blocked tests | Must be created as part of this work |
workflow_run only fires from default branch |
Known | Branch testing harder | Temporary push trigger on feature branches |
| Cadence/Temporal parity gaps surface | Medium | Flaky parallel jobs | Run sequentially if one engine is flaky; fix parity first |
| GCP VM runner registration requires org admin | Known | Blocks initial setup | Org admin must generate runner token at GitHub org settings |
The sandbox runs on GitHub Actions [6] using a self-hosted runner [7] registered on a GCP VM. The Kubernetes cluster is created by k3d [2], which wraps k3s [3] (a lightweight Kubernetes distribution) inside Docker containers on the runner host.
Why k3d? Michelangelo uses Kubernetes CRDs (PipelineRun, Pipeline, Project, RayJob, SparkApplication) that require a real Kubernetes API server with CRD support, RBAC, and pod scheduling. k3d provides this on a single Linux VM in ~10 seconds without cloud credentials or persistent infrastructure.
Why GitHub Actions? The Michelangelo repo is hosted on GitHub (github.com/michelangelo-ai/michelangelo). GitHub Actions provides native integration with the repository event model (push, pull_request, workflow_run, schedule), secret management, and artifact storage.
Why self-hosted GCP VM instead of GitHub-hosted runners?
MySQL 8.0's first-boot data directory initialization (mysqld --initialize) is CPU-intensive and takes 20+ minutes on a 2-CPU ubuntu-latest runner — well beyond any kubectl readiness probe timeout. The standard GitHub-hosted runner (ubuntu-latest) provides only 2 CPU / 7 GB RAM [7]. Larger GitHub-hosted runners (up to 16-core/64 GB) are available but require billing approval and add significant per-minute cost.
A self-hosted GCP VM eliminates both constraints: it provides ample CPU/RAM for MySQL cold start and all Ray/Spark task pods, and runner minutes are not billed by GitHub (only GCP VM cost applies). The workflow targets the GCP runner with runs-on: [self-hosted, linux, gcp] [7].
Runner registration: A GitHub org admin must generate a runner registration token at https://github.com/organizations/michelangelo-ai/settings/actions/runners/new and run config.sh on the GCP VM. See [Self-hosted runner docs][7] for step-by-step setup.
[6] GitHub Actions: https://docs.github.com/en/actions [7] GitHub self-hosted runners: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners
All services run as pods inside the k3d cluster on the GitHub Actions runner:
k3d cluster: michelangelo-sandbox (k3s v1.31, Docker-in-Docker [2])
└── default namespace
├── mysql Pod MySQL 8.0 [8] — API server + workflow engine storage
├── cadence | temporal Pod Cadence [9] or Temporal [10] — workflow engine
├── minio Pod MinIO [11] — S3-compatible blob store
├── michelangelo-apiserver Pod REST+gRPC CRD API (Go)
├── michelangelo-controllermgr Pod Reconciles Pipeline/PipelineRun CRDs (Go)
├── michelangelo-worker Pod Executes UniFLOW task pods (Go)
├── envoy Pod HTTP proxy → michelangelo-ui
└── michelangelo-ui Deployment React UI (TypeScript/React [12])
├── ray-system namespace
│ └── kuberay-operator Deployment KubeRay operator v1.4.2 [4]
└── spark-operator namespace
└── spark-operator Deployment Spark Operator v2.x [5]
[8] MySQL 8.0: https://hub.docker.com/_/mysql [9] Cadence: https://cadenceworkflow.io/ [10] Temporal: https://temporal.io/ [11] MinIO: https://min.io/ [12] React: https://react.dev/
Cadence [9] and Temporal [10] are both supported by Michelangelo as the underlying workflow engine for pipeline runs. The integration test must validate both. We use a GitHub Actions matrix [13] to run two parallel jobs — one per engine — on the same commit, with independent sandboxes.
# integration-test-sandbox.yaml (simplified)
jobs:
sandbox-e2e:
runs-on: [self-hosted, linux, gcp]
strategy:
matrix:
workflow_engine: [cadence, temporal]
fail-fast: false # one engine failing does not cancel the other
steps:
- run: poetry run ma sandbox create --workflow ${{ matrix.workflow_engine }} ...
- run: integration-test.sh
- run: poetry run ma sandbox deleteParallel vs sequential:
| Mode | Pros | Cons |
|---|---|---|
| Parallel (matrix) | Faster total wall time (~60 min instead of ~120 min) | Requires 2× runner cost; both sandboxes share the runner's Docker daemon |
| Sequential | Lower cost; simpler debugging | Slower; a Temporal failure blocks seeing Cadence result |
Recommendation: parallel using GitHub Actions matrix. Each engine gets its own k3d cluster with a unique name (michelangelo-sandbox-cadence, michelangelo-sandbox-temporal) to avoid port conflicts. If runner resource contention becomes an issue, fall back to sequential by setting max-parallel: 1.
[13] GitHub Actions matrix strategy: https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs
integration-test.sh (runs inside the sandbox-e2e job, after sandbox is up)
│
├── 0. [CI workflow] ma sandbox create --workflow <engine>
│ k3d cluster created, all services running
│
├── 1. Upload bert_local.tar → s3://default/bert_local.tar
│ aws s3 cp --endpoint-url http://localhost:9091
│ (required before demo pipeline registration; uniflowTar references this S3 path)
│
├── 2. ma sandbox demo create pipeline
│ Registers: Project CR, training-pipeline, ray-pipeline, spark-pipeline
│
├── 3. UniFLOW test
│ ma pipeline run -n ma-dev-test --name training-pipeline
│ → poll kubectl get pipelinerun ... -o jsonpath='{.status.state}'
│ → assert PIPELINE_RUN_STATE_SUCCEEDED ✅
│
├── 4. Ray test
│ ma pipeline run -n ma-dev-test --name ray-pipeline
│ → poll → assert PIPELINE_RUN_STATE_SUCCEEDED ✅
│
├── 5. Spark test
│ ma pipeline run -n ma-dev-test --name spark-pipeline
│ → poll → assert PIPELINE_RUN_STATE_SUCCEEDED ✅
│
├── 6. UI smoke test (after test data exists)
│ curl http://localhost:8090/ → 200 ✅
│ curl http://localhost:8090/{projectId} → 200 ✅
│ curl http://localhost:8090/{projectId}/train/pipelines → 200 ✅
│ curl http://localhost:8090/{projectId}/train/runs/{runId} → 200 ✅
│
└── [CI workflow] ma sandbox delete (always, even on failure)
Resource sizing is driven by the sum of: baseline services + task pods per pipeline type.
Baseline services (always running):
| Service | CPU request | RAM |
|---|---|---|
| k3s server + agent | 0.5 | 1 GB |
| MySQL 8.0 (first boot init) | 1.0 | 1 GB |
| Cadence or Temporal | 0.5 | 512 MB |
| MinIO | 0.2 | 256 MB |
| API server + controllermgr + worker | 0.5 | 768 MB |
| envoy + michelangelo-ui | 0.2 | 256 MB |
| kuberay-operator | 0.5 | 512 MB |
| spark-operator | 0.5 | 512 MB |
| Baseline total | ~4 CPU | ~5 GB |
Training task pods (sequential execution, peak per job):
| Pipeline type | Peak CPU | Peak RAM | Notes |
|---|---|---|---|
| UniFLOW (bert_cola, CPU-only) | 4 | 8 GB | PyTorch single-process training |
| Ray (1 head + 1 worker) | 4 | 8 GB | KubeRay managed [4]; small model |
| Spark (1 driver + 1 executor) | 4 | 8 GB | Local mode or 1 executor [5] |
| Peak (one job at a time) | ~8 CPU | ~13 GB | Jobs run sequentially |
Runner sizing by scope:
| Scope | Runner | CPU | RAM | Cost/run |
|---|---|---|---|---|
| UniFLOW + Ray + Spark (current) | GCP VM self-hosted [7] | 16+ | 64+ GB | GCP VM cost only |
| + Inference (CPU, small model) | GCP VM self-hosted [7] | 16+ | 64+ GB | GCP VM cost only |
| + Inference (GPU, vLLM) | GCP VM self-hosted w/ GPU [7] | 8+ | 32 GB + GPU | GCP VM cost only |
Adding inference scope: A CPU inference job (e.g., HuggingFace model serving) adds ~8–16 GB RAM for the model and ~4 CPU for inference workers. A 16-core/64 GB GCP VM is sufficient for small models (7B or smaller, quantized). GPU inference (vLLM with a 13B+ model [14]) requires a GCP VM with an attached GPU (e.g., A10G or T4).
[14] vLLM: https://docs.vllm.ai/
When a nightly test fails, finding the responsible change quickly is critical. We propose a two-layer approach: step-level attribution (which component owns the failure) and AI-assisted root cause analysis (which PR most likely introduced it).
Step-level attribution:
Each test step maps directly to an owner team. The failing step name in GitHub Actions is sufficient to route the alert:
| Failed step | Owner | First action |
|---|---|---|
Create sandbox (MySQL/k3d timeout) |
Platform (Go) team | Check sandbox.py, resource YAML changes in recent PRs |
| UniFLOW run timeout/fail | Python SDK team | Check worker pod logs (uploaded as artifact) |
| Ray run timeout/fail | Platform (Go) team | Check kuberay CRD, controllermgr Ray handling |
| Spark run timeout/fail | Platform (Go) team | Check spark-operator CRD, controllermgr Spark handling |
| UI smoke (HTTP non-200) | JavaScript UI team | Check envoy config, michelangelo-ui container logs |
Pull examples image |
Python SDK team | Check build-examples-image.yaml run for same commit |
AI-assisted root cause (GenAI):
We propose integrating an LLM step into the CI failure workflow to accelerate root cause identification. When the integration test fails, a post-failure step:
- Collects: failed step name, last 100 lines of failed step log,
kubectl describe podoutput for failed pods, and the git diff of PRs merged since the last passing nightly run (viagh pr list --state merged --base main) - Submits this context to the Claude API [16] (via
claude -por a small Python script) with the prompt: "Given this CI failure log and the following recent code changes, which change most likely caused this failure and why?" - Posts the LLM's response as a comment on the GitHub Actions run summary
This surfaces a ranked list of suspect PRs and likely root cause hypotheses without requiring a human to manually correlate logs and git history. GitHub Copilot autofix [17] provides similar capability natively for some failure types.
[16] Claude API: https://docs.anthropic.com/en/api/ [17] GitHub Copilot autofix: https://docs.github.com/en/code-security/code-scanning/managing-code-scanning-alerts/about-autofix-for-codeql-alerts
Artifact collection on failure:
- name: Collect debug artifacts
if: failure()
run: |
kubectl get pods -A -o wide > pods.txt
kubectl describe pods -n ma-dev-test >> pods.txt
kubectl logs -n ma-dev-test -l app --tail=200 >> logs.txt
- uses: actions/upload-artifact@v4
if: failure()
with:
name: sandbox-debug-${{ matrix.workflow_engine }}
path: |
pods.txt
logs.txt| CRD | API Group | Operations |
|---|---|---|
Project |
michelangelo.ai/v2 |
create |
Pipeline |
michelangelo.ai/v2 |
create, get |
PipelineRun |
michelangelo.ai/v2 |
create, get/watch |
RayJob |
ray.io/v1 [4] |
create, get (via controllermgr) |
SparkApplication |
sparkoperator.k8s.io/v1beta2 [5] |
create, get (via controllermgr) |
| Store | Bucket/DB | Contents |
|---|---|---|
| MinIO [11] | default |
bert_local.tar, model artifacts |
| MinIO | logs |
Task pod stdout/stderr |
| MySQL [8] | michelangelo |
Project, Pipeline, PipelineRun records |
| MySQL | cadence or temporal |
Workflow engine state |
- Build image first:
build-examples-image.yamlmust succeed before the sandbox test starts. Theworkflow_runtrigger enforces this onmain. Feature branches use a temporarypushtrigger. - Image tag synchronization: The
Compute image tagstep in the sandbox workflow derives the GHCR tag from the triggering branch name (sanitizing/→-, perdocker/metadata-action[18]type=ref,event=branchbehavior). - MinIO upload precondition:
bert_local.tarmust exist ats3://default/bert_local.tarbeforema sandbox demo create pipeline, astraining-pipeline.yamlreferences that S3 path inuniflowTar. - Ray/Spark pipeline YAMLs:
ray-pipeline.yamlandspark-pipeline.yamland lightweight example modules must be added to the demo set. This is a prerequisite for those test steps. - UI requires test data: The UI smoke tests run after the pipeline tests to ensure project/pipeline/run data exists in the API for the UI to display.
[18] docker/metadata-action: https://github.com/docker/metadata-action
-
Ray and Spark demo pipelines don't exist yet: Lightweight
ray_exampleandspark_examplepipeline modules and theirdemo/pipeline/*.yamlmanifests must be implemented as a prerequisite for those test steps. -
MySQL startup on constrained runners: First-boot data directory initialization takes 3–8 min. Mitigated by pre-importing
mysql:8.0into k3d containerd viak3d image import[2] before pod scheduling, and settinginitialDelaySeconds: 10,failureThreshold: 50. -
Cadence/Temporal port conflicts in parallel matrix: Two sandbox clusters must use different k3d port mappings to coexist on the same runner. The matrix job uses cluster names
michelangelo-sandbox-cadenceandmichelangelo-sandbox-temporalwith non-overlapping NodePort ranges. -
workflow_runlimitation: The sandbox workflow must exist on the default branch forworkflow_runto fire [6]. Feature branch testing requires a temporarypushtrigger (added to the workflow file, removed before merge). -
k3d image import timing:
mysql:8.0is imported into k3d containerd whilema sandbox createruns in the background. The import must complete before the mysql pod'sContainerCreatingstate resolves into a pull — otherwise containerd will still attempt to pull from Docker Hub inside k3d.
Daily (nightly) is the primary cadence — not per-PR. Full test suite run time (~60 min per engine) makes per-PR execution too slow for developer feedback loops.
| Trigger | Schedule | Purpose |
|---|---|---|
| Nightly schedule | 0 3 * * * UTC [6] |
Primary regression signal on main |
workflow_run |
After build-examples-image succeeds on main |
Catch image regressions on merge |
workflow_dispatch |
Manual | Pre-release validation, failure debugging |
Temporary push on branch |
During development | Branch-level testing (removed pre-merge) |
- GitHub Actions native notifications [6]: Failure email to workflow watchers; failure badge on README
- GitHub Actions run summary: AI-assisted root cause comment posted as a step annotation (see Design section)
- Debug artifacts: Pod logs and
kubectl describeoutput uploaded on failure viaactions/upload-artifact[19] - Owner routing: Failing step name maps directly to component owner (table in Design section)
[19] actions/upload-artifact: https://github.com/actions/upload-artifact
Phase 1 — UniFLOW baseline (current): Land CI workflows, verify bert_cola pipeline succeeds end-to-end on a 16-core runner.
Phase 2 — Ray + Spark: Add lightweight example jobs and demo pipeline YAMLs; activate Ray and Spark test steps.
Phase 3 — UI smoke tests + dual engine: Add curl smoke test step; enable Cadence/Temporal matrix with port conflict resolution.
Phase 4 — AI root cause + inference (optional): Integrate Claude API failure analysis step; add inference pipeline test if GPU self-hosted runner is available.
Not applicable. This is new CI infrastructure. The sandbox tooling (ma sandbox) already exists for local developer use; this work extends it to CI.
This integration test runs entirely within an ephemeral GitHub Actions runner. No personal data (L1/L2/L3) is processed or stored.
- MinIO:
bert_local.taris a pre-built ML model tarball with no PII; all artifacts are destroyed at job end - MySQL: stores Cadence/Temporal workflow state and Michelangelo CRD metadata; no user data
- All storage is destroyed when
ma sandbox deleteruns (or the runner terminates) - No Databook tables are read or written
- GHCR:
GITHUB_TOKEN(auto-provisioned per-job, repository-scoped) [6] - MinIO:
minioadmin/minioadmin— sandbox credentials only, no production connectivity - MySQL:
root/root— sandbox credentials only, no production connectivity - k3d cluster is network-isolated to the GitHub Actions runner host; no external ingress
No user input. All configuration is via environment variables defined in the workflow YAML (version-controlled). The Claude API call in the AI root cause step sends only CI logs and git diffs — no user or production data.
The AI-assisted root cause step uses the Claude API [16] (Anthropic's externally-hosted LLM). The input is: CI failure logs + git diffs of recent merged PRs. This data is:
- Not L1/L2 personal data
- Not production data
- Consists entirely of source code changes and CI stderr/stdout
No PII redactor is required. No L8+ exception is needed.
GitHub Actions provides full step-level logs retained for 90 days [6]. Debug artifacts (pod logs) are uploaded on failure with the same retention. No additional logging infrastructure is required.
- A GCP VM self-hosted runner is registered with the
michelangelo-aiGitHub org with labelsself-hosted, linux, gcp. - The GCP VM has ≥16 CPU cores and ≥64 GB RAM so that MySQL cold-boot initialization, all baseline services, and peak Ray/Spark task pods run concurrently without OOM.
- A simple Ray training job (1 head + 1 worker, small model) will complete within 1800s on the GCP runner.
- A simple Spark job (local mode or 1 executor) will complete within 1800s on the GCP runner.
- The UI serves pre-compiled static assets from the examples image; no live JS build in CI.
- Cadence and Temporal sandboxes can coexist on the same runner using separate k3d clusters with non-overlapping ports.
| Alternative | Why discarded |
|---|---|
| Mock/stub Go services in Python tests | Does not exercise real CRD reconciliation or controller logic |
| Shared cloud k8s sandbox (persistent) | Expensive, shared-state flakiness, complex access control |
| Build examples image in every test run | 30+ min build time; GHCR pre-build + GHA layer cache [1] is the right tradeoff |
| Playwright for UI tests | Correct long-term approach; curl HTTP smoke tests are faster to implement as first coverage |
docker-compose instead of k3d |
Michelangelo CRDs require a real Kubernetes API; k3d [2] provides this without cloud credentials |
| Sequential Cadence/Temporal runs | Doubles wall time to ~120 min; parallel matrix at 60 min is acceptable |
| Per-PR integration tests | ~60 min run time is too slow for PR feedback; nightly is the right cadence |
| GitHub-hosted larger runners (16-core) | MySQL cold-boot initialization takes 20+ min under load on shared runners; adds ~$3.84/run in runner cost. GCP VM self-hosted runner eliminates both constraints. |
Why not separate test jobs for Go, Python, and JS? The value of this test is precisely that it exercises all three together. Only a full-stack test catches cross-layer contract breakage (e.g., a Go CRD schema change that breaks the Python CLI).
Why k3d and not a real cloud cluster? k3d [2] runs entirely on the GitHub Actions runner — no cloud credentials, no persistent infra, no cost beyond runner minutes. See the platform section for details.
Why is prometheus/grafana excluded from CI? They are not needed for pipeline execution or UI smoke tests. Their images are slow to pull inside k3d and consume RAM needed for Ray/Spark task pods.
When will Ray and Spark tests be active?
After lightweight ray_example and spark_example jobs are implemented and added to the demo pipeline YAMLs.
How do UI smoke tests work for a SPA?
The envoy proxy serves the compiled React bundle. For a SPA, all routes return the same index.html with HTTP 200. The curl test validates that envoy, the UI container, and the routing config are all healthy.
Can the AI root cause step leak production data? No. It only sends CI logs (stderr/stdout) and source code diffs. No production databases, user data, or secrets are included. See Privacy section.
The integration test runs on a self-hosted GCP VM [7]. GitHub does not bill runner minutes for self-hosted runners — only GCP VM compute cost applies.
| Runner | Approx. GCP cost | Avg run time | Cost/run | Nightly/month |
|---|---|---|---|---|
| GCP VM (16 vCPU, 64 GB) — both engines parallel | ~$0.50–0.80/hr (e2-standard-16) | ~60 min | ~$0.50–0.80 | ~$15–25 |
| GCP VM (8 vCPU, 32 GB) — single engine | ~$0.25–0.40/hr (e2-standard-8) | ~60 min | ~$0.25–0.40 | ~$8–12 |
Compared to GitHub-hosted 16-core runners at ~$3.84/run × 2 engines = ~$7.68/run ($230+/month), the GCP VM self-hosted approach reduces CI cost by ~90% while providing more consistent and controllable resources.
If a pre-existing GCP VM is already allocated for other uses, the marginal CI cost is near zero (the VM is already running).