fix: separate ServiceAccount for router workloads#433
fix: separate ServiceAccount for router workloads#433ambient-code[bot] wants to merge 8 commits intomainfrom
Conversation
|
Important Review skippedBot user detected. To trigger a single review, invoke the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
✅ Deploy Preview for jumpstarter-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
| kind: Role | ||
| metadata: | ||
| labels: | ||
| app.kubernetes.io/name: jumpstarter-router |
There was a problem hiding this comment.
do not modify any helm, it has been deprecated.
There was a problem hiding this comment.
Agreed -- all Helm chart changes have been reverted (commit b1a519a). Only the operator Go code is modified now.
There was a problem hiding this comment.
Acknowledged -- all Helm chart changes were reverted in commit b1a519a. Only the operator Go code path is modified.
There was a problem hiding this comment.
Acknowledged -- reverted all Helm chart changes in commit b1a519a.
Status UpdateReview feedback addressed
CI failures analysisThe following CI jobs failed:
The Helm chart revert should resolve the helm-path failures. The operator-path changes (separate |
Root cause of CI failureThe router pod ( The router's Fix
This maintains the security goal of separating the router SA from the controller SA (the router SA has far fewer permissions than the controller SA) while providing the minimum permissions the router actually needs. |
CI Fix: Router ServiceAccount needs K8s API accessRoot cause: The router process calls Fix (commit 93ce272):
This maintains the security separation goal — the router SA has far fewer permissions than the controller SA — while granting the minimum permissions the router actually needs to function. |
| APIGroups: []string{""}, | ||
| Resources: []string{"secrets"}, | ||
| Verbs: []string{"get", "list", "watch"}, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
does the router really need to get/list/watch secrets? why?
There was a problem hiding this comment.
Good catch -- it does not. The router only reads a ConfigMap via LoadRouterConfiguration(). Secrets access has been removed from the router Role (commit cdf6b75). The router Role now only grants get/list/watch on configmaps.
There was a problem hiding this comment.
Good catch -- you're right, the router has no need for secrets access. Removed in commit cdf6b75. The router Role now only grants get/list/watch on configmaps.
Review Status UpdateAll review feedback has been addressed:
The PR is ready for review. The operator deployment path now creates a separate |
Addressing review feedback from @raballewPushed f1a546e to fix the three comment nits:
Items acknowledged but deferred:
CI is all green (tests, e2e-test-operator, deploy-kind for both operator and helm paths all passing). |
The controller and router previously shared the same `controller-manager` ServiceAccount, giving the router unnecessary cluster-wide secrets CRUD access. This creates a dedicated `router-sa` ServiceAccount with no RBAC bindings and `automountServiceAccountToken: false`, following the principle of least privilege. Fixes #351 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Helm charts are deprecated; only the operator deployment path should be modified for the separate router ServiceAccount. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The router process needs Kubernetes API access at startup to load its configuration from a ConfigMap (via ctrl.GetConfigOrDie() and LoadRouterConfiguration). Setting AutomountServiceAccountToken: false on both the ServiceAccount and pod spec prevented the router from authenticating, causing the pod to crash and never become ready (180s timeout in CI). Changes: - Remove AutomountServiceAccountToken: false from router ServiceAccount and pod spec so the token is mounted - Add a minimal router Role granting read-only access to configmaps and secrets (the only resources the router needs) - Add a RoleBinding to bind the router Role to the router ServiceAccount This maintains the security goal of separating the router SA from the controller SA while granting only the minimum permissions needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The router only reads a ConfigMap via LoadRouterConfiguration() and does not access any secrets. Remove the secrets PolicyRule from the router Role per review feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix misleading "zero RBAC, no token automount" comment to "uses dedicated minimal Role" - Add missing comment explaining why SetControllerReference is not called on router SA - Fix createRouterServiceAccount doc comment to not reference RBAC permissions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Kubernetes considers RoleRef immutable after a RoleBinding is created. Replace the CreateOrUpdate pattern for RoleBindings with a dedicated reconcileRoleBinding helper that detects RoleRef changes and uses delete-and-recreate instead of in-place update. This applies to both the controller and router RoleBindings. Addresses review feedback from @raballew. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4a87d67 to
0e314ec
Compare
Add unit tests for router RBAC factory functions (createRouterServiceAccount, createRouterRole, createRouterRoleBinding) and reconcileRoleBinding covering all four code paths: create, delete-and-recreate, update, and no-op. Remove redundant Info-level log in the "unchanged" path of reconcileRoleBinding to reduce log noise during steady-state reconciliation. The V(1) debug log is sufficient. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addressing review feedback from @raballew (2nd round)Pushed 47d520a with two changes: 1. Unit tests for RBAC functions (HIGH)Added Factory functions (
Tests use the existing envtest infrastructure from 2. Remove noisy Info-level log (LOW)Removed the redundant |
| // RoleRef is immutable -- if it differs we must delete and recreate | ||
| if !equality.Semantic.DeepEqual(existing.RoleRef, desired.RoleRef) { | ||
| log.Info("RoleBinding RoleRef changed, deleting and recreating", | ||
| "name", existing.Name, | ||
| "namespace", existing.Namespace) | ||
| if err := r.Client.Delete(ctx, existing); err != nil { | ||
| log.Error(err, "Failed to delete RoleBinding for recreation", | ||
| "name", existing.Name, | ||
| "namespace", existing.Namespace) | ||
| return err | ||
| } | ||
| if err := controllerutil.SetControllerReference(jumpstarter, desired, r.Scheme); err != nil { | ||
| return err | ||
| } | ||
| if err := r.Client.Create(ctx, desired); err != nil { | ||
| log.Error(err, "Failed to recreate RoleBinding", | ||
| "name", desired.Name, | ||
| "namespace", desired.Namespace) | ||
| return err | ||
| } | ||
| log.Info("RoleBinding reconciled", | ||
| "name", desired.Name, | ||
| "namespace", desired.Namespace, | ||
| "operation", "recreated") | ||
| return nil |
There was a problem hiding this comment.
[low] Non-atomic delete-and-recreate for immutable RoleRef: if Delete succeeds on line 253 but Create fails on line 262, the RoleBinding will be absent until the next reconciliation loop. This is the standard K8s operator pattern for immutable fields and self-heals, so it is acceptable as-is. Just flagging for awareness in case you want to add a targeted integration test for this window.
AI-generated, human reviewed
There was a problem hiding this comment.
Acknowledged. The non-atomic window between Delete and Create is inherent to the immutable RoleRef pattern in Kubernetes. As you noted, the next reconciliation loop will self-heal. The risk is minimal given that this only triggers if RoleRef actually changes (which it does not under normal operation). No code change needed.
There was a problem hiding this comment.
Acknowledged. As noted, this is the standard K8s operator pattern for immutable fields and self-heals on next reconciliation. No code change needed.
There was a problem hiding this comment.
Acknowledged. The self-healing behavior on next reconciliation makes this acceptable. Added a log.Error for the SetControllerReference failure case in commit 74a9e9e to improve traceability during the gap window.
There was a problem hiding this comment.
Acknowledged -- this is the standard K8s operator pattern and self-heals on the next reconciliation loop, so leaving as-is. Good to have it documented here for awareness.
There was a problem hiding this comment.
Acknowledged. As you noted, this is the standard K8s operator pattern and self-heals on next reconciliation. No change needed here.
| "namespace", existingSA.Namespace, | ||
| "operation", op) | ||
|
|
||
| // Router ServiceAccount (uses dedicated minimal Role) | ||
| // Note: We intentionally do NOT set controller reference on ServiceAccount to prevent | ||
| // it from being garbage collected when the Jumpstarter CR is deleted | ||
| desiredRouterSA := r.createRouterServiceAccount(jumpstarter) | ||
|
|
||
| existingRouterSA := &corev1.ServiceAccount{} | ||
| existingRouterSA.Name = desiredRouterSA.Name | ||
| existingRouterSA.Namespace = desiredRouterSA.Namespace | ||
|
|
||
| op, err = controllerutil.CreateOrUpdate(ctx, r.Client, existingRouterSA, func() error { | ||
| if existingRouterSA.CreationTimestamp.IsZero() { | ||
| existingRouterSA.Labels = desiredRouterSA.Labels | ||
| existingRouterSA.Annotations = desiredRouterSA.Annotations | ||
| return nil | ||
| } | ||
|
|
||
| if !serviceAccountNeedsUpdate(existingRouterSA, desiredRouterSA) { | ||
| log.V(1).Info("Router ServiceAccount is up to date, skipping update", | ||
| "name", existingRouterSA.Name, | ||
| "namespace", existingRouterSA.Namespace) | ||
| return nil | ||
| } | ||
|
|
||
| existingRouterSA.Labels = desiredRouterSA.Labels | ||
| existingRouterSA.Annotations = desiredRouterSA.Annotations | ||
| return nil | ||
| }) | ||
|
|
||
| if err != nil { | ||
| log.Error(err, "Failed to reconcile Router ServiceAccount", | ||
| "name", desiredRouterSA.Name, | ||
| "namespace", desiredRouterSA.Namespace) | ||
| return err | ||
| } | ||
|
|
||
| log.Info("Router ServiceAccount reconciled", | ||
| "name", existingRouterSA.Name, | ||
| "namespace", existingRouterSA.Namespace, | ||
| "operation", op) | ||
|
|
||
| // Role | ||
| desiredRole := r.createRole(jumpstarter) | ||
|
|
There was a problem hiding this comment.
[low] The router SA reconciliation block (lines 64-109) follows an identical pattern to the controller SA block above (lines 23-62). Same goes for the two Role reconciliation blocks. Consider extracting a generic SA/Role reconciliation helper in a follow-up to reduce duplication.
AI-generated, human reviewed
There was a problem hiding this comment.
Agreed. A generic reconciliation helper would be a worthwhile follow-up refactor. Keeping it consistent with the existing pattern for now to avoid expanding the scope of this PR.
There was a problem hiding this comment.
Acknowledged. A generic reconciliation helper is a good follow-up refactor. Keeping consistent with the existing pattern for now.
There was a problem hiding this comment.
Agreed, the duplication is notable. A generic helper using Go generics or a shared closure is a good candidate for a follow-up refactoring PR.
There was a problem hiding this comment.
Agreed -- a generic reconciliation helper would reduce duplication. Keeping consistent with the existing pattern in this PR to minimize scope; worth tackling in a dedicated follow-up.
There was a problem hiding this comment.
Agreed -- extracting a generic SA/Role reconciliation helper would reduce duplication significantly. Tracking this as a follow-up refactor to keep the scope of this PR focused.
|
|
||
| // Router ServiceAccount (uses dedicated minimal Role) | ||
| // Note: We intentionally do NOT set controller reference on ServiceAccount to prevent |
There was a problem hiding this comment.
[low] The router SA intentionally skips SetControllerReference, so it persists after the Jumpstarter CR is deleted. This matches the existing controller SA pattern, which is fine. Worth noting that on rollback to a pre-PR operator version the router SA will remain orphaned (harmless, but good to document in release notes).
AI-generated, human reviewed
There was a problem hiding this comment.
Good point about the orphaned SA on rollback. The SA is harmless (no permissions without the corresponding Role/RoleBinding) but worth noting in release notes. Will include this in the release documentation.
There was a problem hiding this comment.
Acknowledged. The orphaned SA is harmless and will be documented in release notes.
There was a problem hiding this comment.
Good point about documenting the orphaned SA in release notes. The behavior is harmless (a ServiceAccount with no permissions) but worth calling out.
There was a problem hiding this comment.
Good point. The orphaned router SA is harmless (no permissions without the accompanying Role/RoleBinding which are owned by the CR), but worth noting in release notes for operators who roll back.
There was a problem hiding this comment.
Good point. The orphaned router SA on rollback is harmless but worth documenting. Will note this in release notes.
| if err := controllerutil.SetControllerReference(jumpstarter, desired, r.Scheme); err != nil { | ||
| return err |
There was a problem hiding this comment.
[low] In the delete-and-recreate path, if SetControllerReference fails here (after Delete already succeeded on line 253), the error is returned without a log entry explaining why the RoleBinding is now absent. Adding a brief log line could help debugging in the unlikely event of a scheme misconfiguration. Self-heals on next reconciliation either way.
AI-generated, human reviewed
There was a problem hiding this comment.
Good catch -- will add a log line before returning the error so the absent RoleBinding state is traceable in operator logs.
There was a problem hiding this comment.
Fixed in commit 74a9e9e -- added log.Error when SetControllerReference fails after the old RoleBinding has already been deleted, making the transient absent state traceable in operator logs.
There was a problem hiding this comment.
Added in commit 74a9e9e -- log.Error now fires when SetControllerReference fails after the deletion, explaining why the RoleBinding is absent until next reconciliation.
There was a problem hiding this comment.
Fixed in commit 74a9e9e -- added a log.Error line when SetControllerReference fails after the delete, so the absent RoleBinding state is logged before the error is returned.
There was a problem hiding this comment.
Fixed in commit 74a9e9e -- added a log.Error when SetControllerReference fails after the old RoleBinding has been deleted, making the transient absent state traceable in operator logs.
| It("should be a no-op when the RoleBinding already matches", func() { | ||
| // Create it first | ||
| desired := r.createRouterRoleBinding(js) | ||
| err := r.reconcileRoleBinding(ctx, js, desired) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
|
|
||
| // Reconcile again with same desired state -- should be a no-op | ||
| desired2 := r.createRouterRoleBinding(js) | ||
| err = r.reconcileRoleBinding(ctx, js, desired2) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
|
|
||
| // Verify it still exists and is unchanged | ||
| existing := &rbacv1.RoleBinding{} | ||
| err = k8sClient.Get(ctx, types.NamespacedName{ | ||
| Name: "rbac-test-router-rolebinding", | ||
| Namespace: "default", | ||
| }, existing) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
| Expect(existing.RoleRef.Name).To(Equal("rbac-test-router-role")) | ||
| }) |
There was a problem hiding this comment.
[low] This no-op test verifies the RoleBinding still exists and has the correct RoleRef, but does not assert that ResourceVersion is unchanged. Capturing the ResourceVersion before and after the second reconcile and comparing them would prove that no unnecessary API write occurred. Nice-to-have improvement.
AI-generated, human reviewed
There was a problem hiding this comment.
Good suggestion -- will capture ResourceVersion before and after reconciliation and assert they are equal to prove no unnecessary API write occurs.
There was a problem hiding this comment.
Fixed in commit 74a9e9e -- the no-op test now captures ResourceVersion before and after reconciliation and asserts they are equal, proving no unnecessary API write occurred.
There was a problem hiding this comment.
Added in commit 74a9e9e -- the no-op test now captures and asserts that ResourceVersion is unchanged after the second reconciliation, proving no unnecessary API write occurred.
There was a problem hiding this comment.
Fixed in commit 74a9e9e -- the no-op test now captures ResourceVersion before and after the second reconcile and asserts they are equal, proving no unnecessary API write occurred.
There was a problem hiding this comment.
Fixed in commit 74a9e9e -- the no-op test now captures ResourceVersion before and after reconciliation and asserts they are equal, proving no unnecessary API write occurred.
| Expect(recreated.RoleRef.Name).To(Equal("different-role")) | ||
| Expect(recreated.UID).NotTo(Equal(originalUID)) | ||
| }) | ||
| }) |
There was a problem hiding this comment.
[medium] The tests cover the happy-path factory functions and reconcileRoleBinding scenarios well. Error handling paths (e.g., Get/Create/Update/Delete failures) are untested since they would require a mock client. Also, the existing controller factory functions (createServiceAccount, createRole, createRoleBinding) have no corresponding unit tests.
AI-generated, human reviewed
There was a problem hiding this comment.
Agreed, error handling paths would require a mock client and are a pre-existing gap that applies equally to the controller RBAC functions. Adding mock-based error path tests would significantly expand this PR. Best addressed in a dedicated follow-up that covers both controller and router RBAC uniformly.
There was a problem hiding this comment.
Acknowledged. Error handling paths requiring a mock client are deferred to a follow-up PR that would cover both controller and router RBAC together.
There was a problem hiding this comment.
Acknowledged. Error-path testing with a mock client and coverage for the existing controller factory functions would be valuable but represents a broader effort beyond this PR's scope. Noted for a follow-up.
There was a problem hiding this comment.
Agreed that error-handling paths and controller factory function tests would improve coverage. However, adding mock client infrastructure is a meaningful increase in scope and complexity for this PR -- the existing controller RBAC has no unit tests either, so introducing a full mock client setup feels like it belongs in a dedicated follow-up.
The current tests cover the factory function outputs and all four reconcileRoleBinding code paths (create, delete-and-recreate, update, no-op) using a real envtest API server, which gives us good confidence in the happy paths. Happy to file a follow-up issue for mock-client-based error path testing if that would be useful.
There was a problem hiding this comment.
Agreed -- error handling paths would require injecting a mock client, which adds complexity beyond the scope of this PR. The existing tests cover the factory functions and all four reconcileRoleBinding code paths (create, no-op, update, delete-and-recreate). Adding mock-based error path tests and covering the controller factory functions would be good follow-up work.
There was a problem hiding this comment.
Acknowledged. Error handling paths (mock client tests) and coverage for existing controller factory functions are deferred to a follow-up PR covering both controller and router RBAC. The scope here is already significant and the happy-path coverage provides a solid baseline.
Fix plan for 3rd round review feedbackTwo actionable changes from @raballew's latest review:
CI failure ( |
…op test - Add log.Error when SetControllerReference fails after RoleBinding deletion, making the absent RoleBinding state traceable in operator logs - Assert ResourceVersion is unchanged in the no-op reconciliation test to prove no unnecessary API write occurred Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addressing review feedback from @raballew (3rd round)Pushed 74a9e9e with two changes: 1. Log entry for absent RoleBinding state (LOW)Added 2. ResourceVersion assertion in no-op test (LOW)The no-op test now captures ResourceVersion before and after reconciliation and asserts they are equal, proving no unnecessary API write occurred. Acknowledged (no code change needed):
CI note:The |
Summary
router-saServiceAccount with minimal RBAC permissions for router workloads in the operator deployment path onlycontroller-managerSA with full RBAC (secrets CRUD, CRD access, leader election)Fixes #351
Changes
Operator (only deployment path modified)
rbac.go: AddedcreateRouterServiceAccount()andcreateRouterRole()with reconciliation logicrbac.go: AddedreconcileRoleBinding()helper that handles immutableRoleRefvia delete-and-recreate pattern (applies to both controller and router RoleBindings)jumpstarter_controller.go: UpdatedcreateRouterDeployment()to use{name}-router-saget/list/watchonconfigmapsonly (no secrets access)Tests
rbac_test.go: Unit tests for factory functions (createRouterServiceAccount,createRouterRole,createRouterRoleBinding) andreconcileRoleBindingcovering all four code paths (create, no-op, update, delete-and-recreate)Helm chart
Review feedback addressed
secretspermission from router Role (commit 8de3816) - router only reads ConfigMapsRoleRef(commit 4a87d67) - per @raballew's reviewreconcileRoleBinding(commit 47d520a) - per @raballew's reviewreconcileRoleBindingunchanged path (commit 47d520a) - per @raballew's reviewTest plan
router-sawith minimal RBACcontroller-managerSA🤖 Generated with Claude Code