Skip to content

[Feat] Add sandbox_mounts cluster config field with r/w permission support#1350

Open
sgunasekar wants to merge 4 commits intomainfrom
custom_sandbox_mount_paths
Open

[Feat] Add sandbox_mounts cluster config field with r/w permission support#1350
sgunasekar wants to merge 4 commits intomainfrom
custom_sandbox_mount_paths

Conversation

@sgunasekar
Copy link
Copy Markdown
Collaborator

@sgunasekar sgunasekar commented Apr 8, 2026

Introduces a new optional sandbox_mounts key in cluster configs that lets users specify exactly which paths are mounted into the sandbox container, with optional read/write mode (src:dst or src:dst:ro / src:dst:rw).

Mount resolution priority (highest to lowest):

  1. sandbox_mounts in config (non-empty) — used as-is, overrides --keep_mounts_for_sandbox
  2. keep_mounts_for_sandbox=True with no sandbox_mounts — inherit all cluster mounts
  3. keep_mounts_for_sandbox=False with no sandbox_mounts — no mounts (safe default)

Fully backward compatible: without sandbox_mounts in cluster congig, the existing configs and cli flags --with_sandbox and --keep_mounts_for_sandbox behave identically to before.

Code design choices

  • Both the declarative pipeline (Command.prepare_for_execution) and the legacy add_task path in exp.py are updated.
  • Source paths in sandbox_mounts are validated at launch time via run_exp, consistent with how regular mounts are checked.
  • Specifying sandbox_mounts overrides the --keep_mounts_for_sandbox flag as a safer priority.
  • Intentionally kept this simple rather than introducing more fine-grained mount resolution logic - in future we can potentially add more flags/sub-keys to selectively control mounth paths at runtime.

Summary by CodeRabbit

  • New Features

    • Added sandbox_mounts config to explicitly control mounts for sandbox containers (supports src:dst and src:dst:mode with ro/rw).
  • Behavior Changes

    • sandbox_mounts now takes precedence over the previous keep-mounts behavior; when absent, previous fallback remains.
    • Mount entries support environment-variable expansion and stricter validation.
  • Documentation

    • Added docs describing syntax, precedence, and guidance.
  • Tests

    • Added tests for parsing, expansion, validation, and precedence.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 8, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an optional cluster-config key sandbox_mounts, a resolver that expands ${VAR} and legacy {VAR} and validates optional :mode, changes sandbox-mount precedence (config > keep_mounts flag), integrates mount checks when configured, exposes the resolver, and adds tests covering precedence and expansion.

Changes

Cohort / File(s) Summary
Configuration Examples
cluster_configs/example-local.yaml, cluster_configs/example-slurm.yaml
Added commented examples and syntax notes for sandbox_mounts (src:dst or src:dst:mode with ro/rw) and clarified fallback to keep_mounts_for_sandbox when absent.
User Documentation
docs/basics/cluster-configs.md
Documented mounts vs sandbox_mounts, precedence rules, accepted mount string formats, :mode constraints, and guidance for when to use each option.
Mount Parsing & Helpers
nemo_skills/pipeline/utils/mounts.py
Added _resolve_mount_path_with_env_variables() to expand/validate ${VAR} and {VAR}, enforce src:dst or src:dst:mode formats and ro/rw modes; refactored get_mounts_from_config() to use the helper; added get_sandbox_mounts_from_config() which returns None when key absent and allows :mode.
Public Exports
nemo_skills/pipeline/utils/__init__.py
Re-exported get_sandbox_mounts_from_config into the nemo_skills.pipeline.utils namespace.
Pipeline Integration
nemo_skills/pipeline/utils/declarative.py, nemo_skills/pipeline/utils/exp.py
Replaced direct keep_mounts mapping with priority-based resolution using get_sandbox_mounts_from_config(); sandbox mounts now resolves to configured list, None (inherit), or [] (none); added pre-run remote-mount directory checks when sandbox_mounts provided.
Tests
tests/test_pipeline_utils.py
Added tests for sandbox mount precedence (config vs keep_mounts_for_sandbox), ${VAR} expansion in sandbox_mounts, and executor mounts selection semantics (None vs [] vs explicit list).

Sequence Diagram(s)

sequenceDiagram
    participant Config as Cluster Config
    participant Resolver as Mount Resolver
    participant Pipeline as Execution Pipeline
    participant RemoteCheck as Remote Mount Checker
    participant Executor as Sandbox Executor

    Pipeline->>Config: read cluster_config
    Pipeline->>Resolver: get_sandbox_mounts_from_config(cluster_config)
    alt sandbox_mounts present
        Resolver->>Resolver: expand ${VAR}, handle {VAR}, validate src:dst(:mode)
        Resolver-->>Pipeline: resolved mounts list
        Pipeline->>RemoteCheck: extract mount sources and verify remote dirs
    else sandbox_mounts absent
        Resolver-->>Pipeline: return None
        Pipeline->>Pipeline: consult keep_mounts_for_sandbox flag
    end
    Pipeline->>Executor: create sandbox executor with mounts (list / None / [])
    Executor-->>Pipeline: executor ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature addition: a new cluster config field sandbox_mounts with read/write permission support for sandbox container mounts.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch custom_sandbox_mount_paths

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
nemo_skills/pipeline/utils/mounts.py (1)

515-518: Use direct access for sandbox_mounts after the presence check.

Once the guard above confirms the key exists, .get("sandbox_mounts", []) just reintroduces a silent default on a required field.

Proposed fix
-    mounts = cluster_config.get("sandbox_mounts", [])
+    mounts = cluster_config["sandbox_mounts"]

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/mounts.py` around lines 515 - 518, The code checks
for the presence of "sandbox_mounts" in cluster_config but then uses
cluster_config.get("sandbox_mounts", []), which reintroduces a silent default;
replace that .get call with direct access cluster_config["sandbox_mounts"]
(assign to mounts) so the code fails loudly if the key is missing and preserves
the earlier presence check; update the variable assignment that sets mounts to
use cluster_config["sandbox_mounts"] instead of .get.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 249-266: The sandbox-only mounts are being applied to every script
because keep_mounts defaults True and there is no check that the script is a
SandboxScript; update the mount resolution in the block that uses
get_sandbox_mounts_from_config(cluster_config) so sandbox_mounts is only used
when self.script is a SandboxScript (e.g., guard with isinstance(self.script,
SandboxScript) or an equivalent sandbox-identifying attribute) and otherwise
fall back to the original behavior (keep_mounts -> None to inherit cluster
mounts, keep_mounts False -> []), ensuring execution_config["mounts"] is set
from exec_mounts accordingly.

In `@nemo_skills/pipeline/utils/exp.py`:
- Around line 803-812: The preflight mount check currently runs whenever
"sandbox_mounts" exists in cluster_config, causing unrelated SLURM jobs to fail;
modify run_exp (or the code path that handles SLURM jobs) to only call
get_sandbox_mounts_from_config and check_remote_mount_directories when a sandbox
sidecar will actually be launched (e.g., guard the existing block with the
sandbox-enabled condition used elsewhere or a boolean like is_sandbox_requested
derived from the experiment spec), keep the same use of exit_if_failure and
check_remote_mount_directories(mount_sources, cluster_config,
exit_on_failure=...), and ensure you reference the existing functions
sandbox_mounts, get_sandbox_mounts_from_config, and
check_remote_mount_directories so the check is executed only for sandbox runs.

In `@nemo_skills/pipeline/utils/mounts.py`:
- Around line 115-143: The code currently indexes mount_source[0] /
mount_target[0] which causes IndexError for empty pieces (e.g., ":/dst" or
"src:"); update the validation in the mounts parsing logic to explicitly reject
empty mount_source or mount_target before any indexing: after splitting into
mount_source, mount_target, mount_mode (and before the "{...}" handling) check
if mount_source == "" or mount_target == "" and raise a clear ValueError
describing the malformed mount entry; keep the existing mount_mode validation
and the subsequent env-var resolution for mount_source and mount_target
unchanged, but perform the emptiness check first so functions/variables named
mount_source, mount_target, and mount_mode are protected from IndexError.

In `@tests/test_pipeline_utils.py`:
- Around line 516-546: The tests call get_executor without required parameters
so they raise TypeError before reaching mount assertions; update
test_get_executor_empty_mounts_overrides_config and
test_get_executor_explicit_mounts_overrides_config to pass the same required
args used elsewhere (e.g. num_nodes=1, tasks_per_node=1, gpus_per_node=0,
job_name="test", log_dir="/tmp") when calling get_executor so the mounts
behavior is actually exercised.

---

Nitpick comments:
In `@nemo_skills/pipeline/utils/mounts.py`:
- Around line 515-518: The code checks for the presence of "sandbox_mounts" in
cluster_config but then uses cluster_config.get("sandbox_mounts", []), which
reintroduces a silent default; replace that .get call with direct access
cluster_config["sandbox_mounts"] (assign to mounts) so the code fails loudly if
the key is missing and preserves the earlier presence check; update the variable
assignment that sets mounts to use cluster_config["sandbox_mounts"] instead of
.get.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6e7b338f-3e6e-41b0-8fd0-c139d696ec6b

📥 Commits

Reviewing files that changed from the base of the PR and between 74650dd and 96ff3fc.

📒 Files selected for processing (8)
  • cluster_configs/example-local.yaml
  • cluster_configs/example-slurm.yaml
  • docs/basics/cluster-configs.md
  • nemo_skills/pipeline/utils/__init__.py
  • nemo_skills/pipeline/utils/declarative.py
  • nemo_skills/pipeline/utils/exp.py
  • nemo_skills/pipeline/utils/mounts.py
  • tests/test_pipeline_utils.py

Comment thread nemo_skills/pipeline/utils/declarative.py
Comment on lines +803 to +812
if "sandbox_mounts" in cluster_config:
sandbox_mounts = get_sandbox_mounts_from_config(cluster_config)
mount_sources = [m.split(":")[0] for m in sandbox_mounts]
LOG.info("Checking sandbox mount paths: %s", mount_sources)
exit_if_failure = os.environ.get("NEMO_SKILLS_DISABLE_MOUNT_CHECK", "False").lower() not in (
"1",
"true",
"yes",
)
check_remote_mount_directories(mount_sources, cluster_config, exit_on_failure=exit_if_failure)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't make sandbox_mounts a global preflight requirement.

run_exp() now validates these paths for every SLURM run that uses a cluster config containing sandbox_mounts, even if the experiment never launches a sandbox sidecar. That can fail unrelated training/inference jobs on sandbox-only paths, so this check needs to be gated on actual sandbox usage rather than config presence alone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/exp.py` around lines 803 - 812, The preflight
mount check currently runs whenever "sandbox_mounts" exists in cluster_config,
causing unrelated SLURM jobs to fail; modify run_exp (or the code path that
handles SLURM jobs) to only call get_sandbox_mounts_from_config and
check_remote_mount_directories when a sandbox sidecar will actually be launched
(e.g., guard the existing block with the sandbox-enabled condition used
elsewhere or a boolean like is_sandbox_requested derived from the experiment
spec), keep the same use of exit_if_failure and
check_remote_mount_directories(mount_sources, cluster_config,
exit_on_failure=...), and ensure you reference the existing functions
sandbox_mounts, get_sandbox_mounts_from_config, and
check_remote_mount_directories so the check is executed only for sandbox runs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a more defensive check that fails really fast before launch. This also mimics the checks for "mounts" and easier to maintain as code. The alternate of tracing sandbox usage is more complex with marginal benefits.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgunasekar, that's a fair point. The symmetry with the existing mounts preflight check is a valid reason to keep this pattern, and the NEMO_SKILLS_DISABLE_MOUNT_CHECK escape hatch already covers cases where the check needs to be bypassed. I'll defer to your judgment here.


✏️ Learnings added
Learnt from: sgunasekar
Repo: NVIDIA-NeMo/Skills PR: 1350
File: nemo_skills/pipeline/utils/exp.py:803-812
Timestamp: 2026-04-08T09:06:24.771Z
Learning: In `nemo_skills/pipeline/utils/exp.py`, `run_exp` intentionally runs `sandbox_mounts` path validation as a preflight check whenever `sandbox_mounts` is present in `cluster_config`, regardless of whether a sandbox sidecar is actually launched in that experiment. This mirrors the existing `mounts` check pattern and is deliberately chosen for simplicity and early failure. The `NEMO_SKILLS_DISABLE_MOUNT_CHECK` env var provides an opt-out.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: ludwig-n
Repo: NVIDIA-NeMo/Skills PR: 1310
File: nemo_skills/inference/eval/swebench.py:415-417
Timestamp: 2026-03-17T18:33:03.834Z
Learning: In `nemo_skills/inference/eval/swebench.py`, the `cp -r {container_repo_dir} /testbed` command inside `_execute_container_command` is intentionally guarded by `mode == "agent"`. Agent frameworks (SWE-agent, OpenHands, mini-swe-agent) all expect the repo at `/testbed`, so the copy is required for agent mode. For eval mode, custom eval harnesses are expected to find the repo at its original `container_repo_dir` location. Each `_execute_container_command` call runs in a fresh `apptainer exec --writable-tmpfs` container, so nothing persists between agent and eval invocations anyway.

Comment thread nemo_skills/pipeline/utils/mounts.py
Comment thread tests/test_pipeline_utils.py
sgunasekar added a commit that referenced this pull request Apr 8, 2026
Fixes raised in #1350:

- declarative.py: guard sandbox_mounts logic with hasattr(script, 'keep_mounts')
  so non-sandbox scripts (server, client) always inherit cluster mounts instead
  of receiving sandbox_mounts — sandbox_mounts was incorrectly applied to all
  scripts after the precedence change
- mounts.py: use direct dict access cluster_config["sandbox_mounts"] after the
  presence check instead of .get() with a silent default
- mounts.py: raise ValueError on empty source or target in a mount string
  (e.g. ":/dst" or "src:") before indexing, avoiding IndexError
- tests: add required positional args to get_executor tests (num_nodes,
  tasks_per_node, gpus_per_node, job_name, log_dir)

Signed-off-by: suriya <sgunasekar@nvidia.com>
@sgunasekar sgunasekar force-pushed the custom_sandbox_mount_paths branch from 54a9f76 to 40e189d Compare April 8, 2026 07:23
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_pipeline_utils.py (1)

399-406: ⚠️ Potential issue | 🟡 Minor

Exercise the non-sandbox guard with sandbox_mounts present.

This test never hits the regression fixed in the PR because cluster_config omits sandbox_mounts. Add a sandbox_mounts entry here and keep asserting mounts is None, otherwise Command.prepare_for_execution() could start forwarding sandbox-only mounts to regular scripts without this test failing.

Suggested test adjustment
-    _, exec_config = cmd.prepare_for_execution({"executor": "slurm"})
+    _, exec_config = cmd.prepare_for_execution(
+        {
+            "executor": "slurm",
+            "sandbox_mounts": ["/host/data:/sandbox/data:ro"],
+        }
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_pipeline_utils.py` around lines 399 - 406, The test
test_non_sandbox_command_mounts_unchanged doesn't include a sandbox_mounts entry
so it never exercises the non-sandbox guard; update the test to pass a
cluster_config that includes a sandbox_mounts value (e.g., {"sandbox_mounts":
[...]}) into Command.prepare_for_execution so the code path that ensures
non-SandboxScript commands keep mounts=None is exercised, and continue asserting
exec_config["mounts"] is None; locate the test and the call to
Command.prepare_for_execution and add the sandbox_mounts entry to the provided
cluster_config.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_pipeline_utils.py`:
- Around line 579-673: Add a test that asserts configured sandbox_mounts
override keep_mounts_for_sandbox=True for the legacy add_task path: call
add_task (function add_task) with cluster_config containing "sandbox_mounts":
["/host/data:/sandbox/data:ro"], with_sandbox=True and
keep_mounts_for_sandbox=True, patch get_executor and get_free_port as in the
other tests, then inspect mock_get_executor.call_args_list[-1].kwargs["mounts"]
and assert it equals the provided sandbox_mounts (not None); place this
alongside the existing three tests so the legacy path is covered.

---

Outside diff comments:
In `@tests/test_pipeline_utils.py`:
- Around line 399-406: The test test_non_sandbox_command_mounts_unchanged
doesn't include a sandbox_mounts entry so it never exercises the non-sandbox
guard; update the test to pass a cluster_config that includes a sandbox_mounts
value (e.g., {"sandbox_mounts": [...]}) into Command.prepare_for_execution so
the code path that ensures non-SandboxScript commands keep mounts=None is
exercised, and continue asserting exec_config["mounts"] is None; locate the test
and the call to Command.prepare_for_execution and add the sandbox_mounts entry
to the provided cluster_config.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6f4f4a99-3fa1-4df6-82c1-7d698a4744b4

📥 Commits

Reviewing files that changed from the base of the PR and between 54a9f76 and 40e189d.

📒 Files selected for processing (3)
  • nemo_skills/pipeline/utils/declarative.py
  • nemo_skills/pipeline/utils/mounts.py
  • tests/test_pipeline_utils.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • nemo_skills/pipeline/utils/declarative.py
  • nemo_skills/pipeline/utils/mounts.py

Comment thread tests/test_pipeline_utils.py
sgunasekar added a commit that referenced this pull request Apr 8, 2026
…x=True

Raised in review of PR #1350: the add_task path had no test for the
highest-priority rule — sandbox_mounts must win even when
keep_mounts_for_sandbox=True. Without it, that regression could slip
through while the declarative path stays green.

Signed-off-by: suriya <sgunasekar@nvidia.com>
sgunasekar added a commit that referenced this pull request Apr 13, 2026
Fixes raised in #1350:

- declarative.py: guard sandbox_mounts logic with hasattr(script, 'keep_mounts')
  so non-sandbox scripts (server, client) always inherit cluster mounts instead
  of receiving sandbox_mounts — sandbox_mounts was incorrectly applied to all
  scripts after the precedence change
- mounts.py: use direct dict access cluster_config["sandbox_mounts"] after the
  presence check instead of .get() with a silent default
- mounts.py: raise ValueError on empty source or target in a mount string
  (e.g. ":/dst" or "src:") before indexing, avoiding IndexError
- tests: add required positional args to get_executor tests (num_nodes,
  tasks_per_node, gpus_per_node, job_name, log_dir)

Signed-off-by: suriya <sgunasekar@nvidia.com>
sgunasekar added a commit that referenced this pull request Apr 13, 2026
…x=True

Raised in review of PR #1350: the add_task path had no test for the
highest-priority rule — sandbox_mounts must win even when
keep_mounts_for_sandbox=True. Without it, that regression could slip
through while the declarative path stays green.

Signed-off-by: suriya <sgunasekar@nvidia.com>
@sgunasekar sgunasekar force-pushed the custom_sandbox_mount_paths branch from bf00b0f to db33358 Compare April 13, 2026 16:25
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/test_pipeline_utils.py (1)

409-416: ⚠️ Potential issue | 🟡 Minor

Add the non-sandbox regression case to this matrix.

These new cases only cover SandboxScript. The earlier failure mode here was a normal Command picking up sandbox_mounts when that key existed, and test_non_sandbox_command_mounts_unchanged() still doesn't exercise that configuration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_pipeline_utils.py` around lines 409 - 416, Add a regression test
case covering the non-sandbox (Command) path where sandbox_mounts exists in the
config but should not affect Command mounts: update or extend
test_non_sandbox_command_mounts_unchanged to include a scenario with
sandbox_mounts present (both keep_mounts_for_sandbox true/false as needed) and
assert that the Command object's mounts remain unchanged; also ensure the matrix
comment above the SandboxScript tests documents this non-sandbox case so future
readers see that Commands must ignore sandbox_mounts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/mounts.py`:
- Around line 96-104: The current check after os.path.expandvars in mounts.py
only detects unresolved "${VAR}" patterns but misses bare "$VAR" placeholders;
update the conditional in the block that processes the local variable mount (the
code using os.path.expandvars(mount)) to use the same detection logic as
_resolve_path_placeholders — i.e., treat any leftover "$" or unbalanced "{" or
"}" as an unresolved placeholder — so replace the existing if condition with one
that checks if "$" in mount or "{" in mount or "}" in mount and then raise the
same ValueError when true.

---

Duplicate comments:
In `@tests/test_pipeline_utils.py`:
- Around line 409-416: Add a regression test case covering the non-sandbox
(Command) path where sandbox_mounts exists in the config but should not affect
Command mounts: update or extend test_non_sandbox_command_mounts_unchanged to
include a scenario with sandbox_mounts present (both keep_mounts_for_sandbox
true/false as needed) and assert that the Command object's mounts remain
unchanged; also ensure the matrix comment above the SandboxScript tests
documents this non-sandbox case so future readers see that Commands must ignore
sandbox_mounts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: e367244c-47e9-4f30-a9db-ec3429697fef

📥 Commits

Reviewing files that changed from the base of the PR and between bf00b0f and db33358.

📒 Files selected for processing (8)
  • cluster_configs/example-local.yaml
  • cluster_configs/example-slurm.yaml
  • docs/basics/cluster-configs.md
  • nemo_skills/pipeline/utils/__init__.py
  • nemo_skills/pipeline/utils/declarative.py
  • nemo_skills/pipeline/utils/exp.py
  • nemo_skills/pipeline/utils/mounts.py
  • tests/test_pipeline_utils.py
✅ Files skipped from review due to trivial changes (4)
  • cluster_configs/example-slurm.yaml
  • cluster_configs/example-local.yaml
  • nemo_skills/pipeline/utils/exp.py
  • docs/basics/cluster-configs.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/pipeline/utils/init.py

Comment on lines +96 to +104
# First, expand any ${VAR} style environment variables in the entire mount string
# This allows partial path substitution like /path/${USER}/subdir
if "$" in mount:
mount = os.path.expandvars(mount)
if "${" in mount and "}" in mount:
raise ValueError(
f"Mount `{mount}` contains env variable placeholders, but the required env var is "
f"not provided in the environment to resolve."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import os

for raw in ("${UNSET_SKILLS_VAR}:/dst", "$UNSET_SKILLS_VAR:/dst"):
    print(f"{raw} -> {os.path.expandvars(raw)}")
PY

sed -n '96,104p' nemo_skills/pipeline/utils/mounts.py

Repository: NVIDIA-NeMo/Skills

Length of output: 632


🏁 Script executed:

wc -l nemo_skills/pipeline/utils/mounts.py

Repository: NVIDIA-NeMo/Skills

Length of output: 102


🏁 Script executed:

cat -n nemo_skills/pipeline/utils/mounts.py | head -150

Repository: NVIDIA-NeMo/Skills

Length of output: 7242


🏁 Script executed:

# Search for tests related to mount validation
fd -t f "*test*mount*" -o -name "*mount*test*"

Repository: NVIDIA-NeMo/Skills

Length of output: 292


🏁 Script executed:

# Check for any comments or design notes about $VAR vs ${VAR}
rg -n "VAR|expandvars" nemo_skills/pipeline/utils/mounts.py -A 2 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1441


Fix unresolved $VAR placeholder check to match _resolve_path_placeholders behavior.

The validation at line 100 only rejects ${VAR} form but misses bare $VAR style. Since os.path.expandvars() processes both forms and leaves both unchanged when the env var is unset, a mount like $UNSET_SKILLS_VAR:/dst passes validation—inconsistent with _resolve_path_placeholders (line 63) which checks for any leftover $. This allows typos to become runtime errors instead of config-time errors.

Change the check from:

if "${" in mount and "}" in mount:

to match the pattern in _resolve_path_placeholders:

if "$" in mount or "{" in mount or "}" in mount:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/mounts.py` around lines 96 - 104, The current
check after os.path.expandvars in mounts.py only detects unresolved "${VAR}"
patterns but misses bare "$VAR" placeholders; update the conditional in the
block that processes the local variable mount (the code using
os.path.expandvars(mount)) to use the same detection logic as
_resolve_path_placeholders — i.e., treat any leftover "$" or unbalanced "{" or
"}" as an unresolved placeholder — so replace the existing if condition with one
that checks if "$" in mount or "{" in mount or "}" in mount and then raise the
same ValueError when true.

Introduces a new optional `sandbox_mounts` key in cluster configs that lets
users specify exactly which paths are mounted into the sandbox container, with
optional read/write mode (`src:dst` or `src:dst:ro` / `src:dst:rw`).

Mount resolution priority (highest to lowest):
  1. `sandbox_mounts` in config (non-empty) — used as-is, overrides keep_mounts_for_sandbox
  2. `keep_mounts_for_sandbox=True` with no sandbox_mounts — inherit all cluster mounts
  3. `keep_mounts_for_sandbox=False` with no sandbox_mounts — no mounts (safe default)

This makes the risky full-cluster-mount inheritance opt-out when sandbox_mounts
is explicitly configured, while remaining fully backward compatible: existing
configs without sandbox_mounts behave identically to before.

Both the declarative pipeline (Command.prepare_for_execution) and the legacy
add_task path in exp.py are updated. Sandbox mount paths are validated at
launch time via run_exp, consistent with how regular mounts are checked.

Signed-off-by: suriya <sgunasekar@nvidia.com>
Fixes raised in #1350:

- declarative.py: guard sandbox_mounts logic with hasattr(script, 'keep_mounts')
  so non-sandbox scripts (server, client) always inherit cluster mounts instead
  of receiving sandbox_mounts — sandbox_mounts was incorrectly applied to all
  scripts after the precedence change
- mounts.py: use direct dict access cluster_config["sandbox_mounts"] after the
  presence check instead of .get() with a silent default
- mounts.py: raise ValueError on empty source or target in a mount string
  (e.g. ":/dst" or "src:") before indexing, avoiding IndexError
- tests: add required positional args to get_executor tests (num_nodes,
  tasks_per_node, gpus_per_node, job_name, log_dir)

Signed-off-by: suriya <sgunasekar@nvidia.com>
…x=True

Raised in review of PR #1350: the add_task path had no test for the
highest-priority rule — sandbox_mounts must win even when
keep_mounts_for_sandbox=True. Without it, that regression could slip
through while the declarative path stays green.

Signed-off-by: suriya <sgunasekar@nvidia.com>
@sgunasekar sgunasekar force-pushed the custom_sandbox_mount_paths branch from db33358 to d003c09 Compare April 15, 2026 23:01
Copy link
Copy Markdown
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgunasekar small change requested to the Command construction path.

Also, can we add a functional test to the read only mounts? right now it looks like the tests are only looking for configurational differences.

# Non-sandbox script: always inherit cluster mounts
exec_mounts = None
else:
sandbox_mounts = get_sandbox_mounts_from_config(cluster_config)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this based on a script attribute instead? I think it would be preferable to reduce the dependency on cluster_config here. The sandbox mounts can be parsed/added at the Command instantiation rather than inside this method. Wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants