Skip to content

SALM Backend Support and HF ASR Leaderboard Evaluation Recipe#1344

Open
melllinia wants to merge 6 commits intomainfrom
salm-backend-support
Open

SALM Backend Support and HF ASR Leaderboard Evaluation Recipe#1344
melllinia wants to merge 6 commits intomainfrom
salm-backend-support

Conversation

@melllinia
Copy link
Copy Markdown
Member

@melllinia melllinia commented Apr 6, 2026

Summary

  • Add a new salm backend for NeMo SALM models (e.g. canary-qwen-2.5b) that use nemo.collections.speechlm2.models.SALM and its chat-style generate() API. This complements the existing nemo_asr backend (for traditional ASR models) and the upcoming vllm_nemo_speechlm backend (for vLLM-served models).
  • Fix a bug in nemo_asr_backend where empty transcriptions caused the raw Hypothesis object repr (~698 words of tensor data) to be returned instead of an empty string, inflating WER scores.
  • Add run_hf_leaderboard.py recipe for running HF Open ASR Leaderboard evaluation on NeMo models with a single command.
  • Improve the ASR leaderboard dataset preparation script for better robustness and maintainability.

Summary by CodeRabbit

  • New Features

    • ASR leaderboard CLI to run evaluations with configurable model/backend.
    • New SALM inference backend and selectable backend registry entry.
    • End-to-end SALM slurm test harness and result-checking utilities.
  • Improvements

    • Dataset handling accepts entries without audio; standardized 16 kHz decoding and tolerant failures.
    • Deterministic transcript selection and consolidated CLI messages.
  • Bug Fixes

    • More reliable JSONL message formatting and per-request validation.

…en-2.5b)

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
@melllinia melllinia force-pushed the salm-backend-support branch from 51e453a to 0f4a275 Compare April 6, 2026 10:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds SALM (Speech-Augmented Language Model) support to the unified inference server and introduces a one-command HuggingFace Open ASR Leaderboard evaluation recipe, alongside robustness improvements to ASR leaderboard dataset preparation and a WER-inflating bug fix in the NeMo ASR backend.

Changes:

  • Added a new salm unified-server backend for NeMo SALM models using the chat-style generate() API.
  • Fixed nemo_asr hypothesis parsing so empty transcriptions don’t fall back to verbose object string representations.
  • Added run_hf_leaderboard.py and updated ASR leaderboard dataset preparation for improved maintainability/robustness.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
recipes/multimodal/server/backends/salm_backend.py New SALM backend implementing audio-to-text via SALM generate() with temp audio files.
recipes/multimodal/server/backends/nemo_asr_backend.py Fixes hypothesis parsing to avoid returning a large str(hyp) when transcript text is empty.
recipes/multimodal/server/backends/init.py Registers the new salm backend for lazy loading.
recipes/asr/run_hf_leaderboard.py Adds a recipe script to run HF Open ASR Leaderboard evals against the unified server.
nemo_skills/dataset/asr-leaderboard/prepare.py Refactors dataset download/formatting and audio handling for leaderboard prep.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread recipes/multimodal/server/backends/salm_backend.py
Comment thread recipes/multimodal/server/backends/salm_backend.py
Comment thread recipes/asr/run_hf_leaderboard.py
Comment thread nemo_skills/dataset/asr-leaderboard/prepare.py
Comment thread recipes/multimodal/server/backends/nemo_asr_backend.py Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Split ASR dataset preparation into separate audio extraction and JSONL formatting, removed duration filtering, added a HuggingFace ASR leaderboard runner, introduced and registered a new SALM inference backend (config, model load/warmup, validation, generation), and tightened nemo_asr hypothesis text selection logic.

Changes

Cohort / File(s) Summary
ASR Dataset Preparation
nemo_skills/dataset/asr-leaderboard/prepare.py
Replaced save_audio_and_format_entry() with extract_audio() and format_entry(); removed duration-based filtering; cast dataset audio column to Audio(sampling_rate=16000) when present; standardized HF repo constant; changed JSONL messages shape and id placement; simplified iteration and aggregation; adjusted CLI messages.
ASR Leaderboard Runner
recipes/asr/run_hf_leaderboard.py
Added new executable main() to run HuggingFace Open ASR Leaderboard: CLI args, evaluation context, server entrypoint/args, and call to nemo_skills.pipeline.cli.eval.run_eval.
Backend Registry & Hypothesis Parsing
recipes/multimodal/server/backends/__init__.py, recipes/multimodal/server/backends/nemo_asr_backend.py
Registered "salm"SALMBackend in backend registry and docs; changed _parse_single_hypothesis() to use explicit None-check fallbacks for transcript selection (prevents treating empty strings as missing).
SALM Inference Backend
recipes/multimodal/server/backends/salm_backend.py
Added SALMConfig and SALMBackend: config parsing, model restore/from_pretrained, device placement, optional warmup, request validation (requires single audio), per-request temp WAV writing, batched generate() calls, token→text decoding, GenerationResult population, error handling, and temp cleanup.
Slurm Tests for SALM
tests/slurm-tests/unified_salm/*, tests/slurm-tests/run_all.sh
Added unified SALM slurm test scripts: run_test.py, check_results.py, test fixture salm_openai.test, and hooked into run_all.sh to execute unified_salm runs.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant SALMBackend
    participant AudioFS as Audio I/O
    participant SALMModel as SALM Model
    participant Tokenizer

    Client->>SALMBackend: generate(requests)
    SALMBackend->>SALMBackend: validate_request(requests)

    alt validation passes
        SALMBackend->>AudioFS: write audio_bytes -> temp .wav files
        SALMBackend->>SALMBackend: build SALM chat prompts (include audio_locator_tag)
        SALMBackend->>SALMModel: generate(prompts, max_new_tokens)
        SALMModel-->>SALMBackend: token sequences per input
        SALMBackend->>Tokenizer: decode tokens -> text
        Tokenizer-->>SALMBackend: decoded text
        SALMBackend->>SALMBackend: assemble GenerationResult per request (text, tokens, timing, metadata)
    else validation fails
        SALMBackend->>SALMBackend: produce per-request error GenerationResult
    end

    SALMBackend->>AudioFS: cleanup temp files
    SALMBackend-->>Client: List[GenerationResult]
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

reviewed

Suggested reviewers

  • Jorjeous
  • karpnv
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title comprehensively and accurately describes both major components of the changeset: SALM backend support (a new inference backend) and HF ASR leaderboard evaluation recipe (a new executable CLI script), which align with the substantial changes and objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch salm-backend-support

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
recipes/asr/run_hf_leaderboard.py (1)

41-41: Shadowing eval builtin is a library API design issue.

The import from nemo_skills.pipeline.cli import eval shadows Python's built-in eval(). This is flagged by static analysis, but since nemo_skills.pipeline.cli exports this name, the issue is upstream in the library's API design rather than in this script. Consider aliasing if the built-in is needed locally:

from nemo_skills.pipeline.cli import eval as run_eval, wrap_arguments
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/asr/run_hf_leaderboard.py` at line 41, The import shadows Python's
builtin eval; change the import line that brings in
nemo_skills.pipeline.cli.eval to use an alias (e.g., import eval as run_eval)
and update all usages of eval in this module to call run_eval instead; keep
wrap_arguments as-is and ensure any references to the original eval symbol
(function or variable) are replaced with the new alias to avoid builtin
shadowing.
recipes/multimodal/server/backends/salm_backend.py (2)

132-147: Consider extracting shared audio validation to base class.

_get_request_audio_bytes and validate_request are identical to NeMoASRBackend. This duplication could be extracted to a shared mixin or utility in base.py for audio-input backends.

♻️ Potential shared utility in base.py
# In base.py, add a mixin or utility:
class AudioInputMixin:
    """Shared utilities for backends accepting audio input."""
    
    def _get_request_audio_bytes(self, request: GenerationRequest) -> bytes:
        if request.audio_bytes:
            return request.audio_bytes
        if request.audio_bytes_list:
            if len(request.audio_bytes_list) > 1:
                raise ValueError(f"{self.name} backend currently supports one audio input per request.")
            return request.audio_bytes_list[0]
        raise ValueError("Request must contain audio_bytes/audio_bytes_list")

    def validate_request(self, request: GenerationRequest) -> Optional[str]:
        has_audio = request.audio_bytes is not None or (
            request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
        )
        if not has_audio:
            return f"{self.name} backend requires audio input"
        return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/salm_backend.py` around lines 132 - 147,
Extract the duplicated audio-request logic into a shared mixin in base.py (e.g.,
AudioInputMixin) and have both SALM backend and NeMoASRBackend inherit/use it:
move the implementations of _get_request_audio_bytes and validate_request into
AudioInputMixin (using self.name for backend-specific messages) and replace the
local methods in SALM backend (methods named _get_request_audio_bytes and
validate_request) so they defer to the mixin implementation; ensure the mixin
raises the same errors and preserves the single-audio constraint and message
text.

84-104: Add explicit device mapping to SALM model loading.

NeMoASRBackend passes map_location=torch.device(...) to restore_from() and from_pretrained() to control device placement during model loading, while this implementation relies on .to(device) afterward. This may cause temporary memory spikes on the default device before the explicit device transfer. Consider following the same pattern as NeMoASRBackend:

map_location = torch.device(self.config.device)

if Path(model_ref).exists():
    self._model = SALM.restore_from(model_ref, map_location=map_location)
else:
    self._model = SALM.from_pretrained(model_ref, map_location=map_location)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/salm_backend.py` around lines 84 - 104,
The load_model method currently moves the SALM instance to the target device
only after loading, which can cause memory spikes; change load_model to
construct a torch.device from self.config.device and pass it as map_location to
SALM.restore_from and SALM.from_pretrained (i.e., call
SALM.restore_from(model_ref, map_location=map_location) or
SALM.from_pretrained(model_ref, map_location=map_location)) so the model is
allocated directly on the correct device; ensure torch is imported if not
already and keep the existing .to(self.config.device) and eval() calls as a
safety net.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/asr-leaderboard/prepare.py`:
- Around line 115-118: The load_dataset call for HF_REPO using variables
hf_config and hf_split must enable remote code execution for custom dataset
scripts; update the load_dataset invocation (the one assigning to dataset) to
pass trust_remote_code=True so datasets from HF_REPO (e.g., hf_config/hf_split)
that require custom loaders load correctly while keeping the surrounding logic
that casts the "audio" column (dataset.cast_column and Audio) unchanged.

In `@recipes/multimodal/server/backends/nemo_asr_backend.py`:
- Around line 236-240: The dict-hypothesis path in
NeMoASRBackend._parse_single_hypothesis lacks test coverage for missing/empty
"text" (it checks "transcript" as a fallback) while the object path is tested;
add a unit test to cover the dict branch that passes a dict like {"text": "",
"words": None} and assert text == "" and words == [] to ensure the same fallback
behavior; create test_nemo_asr_backend_empty_text_dict_hypothesis using
NeMoASRBackend(NeMoASRConfig(model_path="dummy")) and call
_parse_single_hypothesis with the dict to validate the result.

---

Nitpick comments:
In `@recipes/asr/run_hf_leaderboard.py`:
- Line 41: The import shadows Python's builtin eval; change the import line that
brings in nemo_skills.pipeline.cli.eval to use an alias (e.g., import eval as
run_eval) and update all usages of eval in this module to call run_eval instead;
keep wrap_arguments as-is and ensure any references to the original eval symbol
(function or variable) are replaced with the new alias to avoid builtin
shadowing.

In `@recipes/multimodal/server/backends/salm_backend.py`:
- Around line 132-147: Extract the duplicated audio-request logic into a shared
mixin in base.py (e.g., AudioInputMixin) and have both SALM backend and
NeMoASRBackend inherit/use it: move the implementations of
_get_request_audio_bytes and validate_request into AudioInputMixin (using
self.name for backend-specific messages) and replace the local methods in SALM
backend (methods named _get_request_audio_bytes and validate_request) so they
defer to the mixin implementation; ensure the mixin raises the same errors and
preserves the single-audio constraint and message text.
- Around line 84-104: The load_model method currently moves the SALM instance to
the target device only after loading, which can cause memory spikes; change
load_model to construct a torch.device from self.config.device and pass it as
map_location to SALM.restore_from and SALM.from_pretrained (i.e., call
SALM.restore_from(model_ref, map_location=map_location) or
SALM.from_pretrained(model_ref, map_location=map_location)) so the model is
allocated directly on the correct device; ensure torch is imported if not
already and keep the existing .to(self.config.device) and eval() calls as a
safety net.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 71b193fa-8ba1-420b-bec7-f19ae7c99b64

📥 Commits

Reviewing files that changed from the base of the PR and between 42aed3a and 51e453a.

📒 Files selected for processing (5)
  • nemo_skills/dataset/asr-leaderboard/prepare.py
  • recipes/asr/run_hf_leaderboard.py
  • recipes/multimodal/server/backends/__init__.py
  • recipes/multimodal/server/backends/nemo_asr_backend.py
  • recipes/multimodal/server/backends/salm_backend.py

Comment thread nemo_skills/dataset/asr-leaderboard/prepare.py
Comment thread recipes/multimodal/server/backends/nemo_asr_backend.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
nemo_skills/dataset/asr-leaderboard/prepare.py (1)

115-118: ⚠️ Potential issue | 🟠 Major

Add trust_remote_code=True to load_dataset call.

The hf-audio/esb-datasets-test-only-sorted repository uses custom dataset loading scripts and requires trust_remote_code=True. Without this flag, dataset loading will fail with recent versions of the datasets library.

🔧 Proposed fix
     print(f"Loading {dataset_name} from {HF_REPO} (config={hf_config}, split={hf_split})...")
-    dataset = load_dataset(HF_REPO, hf_config, split=hf_split)
+    dataset = load_dataset(HF_REPO, hf_config, split=hf_split, trust_remote_code=True)
     if with_audio and "audio" in dataset.column_names:
         dataset = dataset.cast_column("audio", Audio(sampling_rate=AUDIO_SAMPLE_RATE))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/asr-leaderboard/prepare.py` around lines 115 - 118, The
load_dataset call in prepare.py fails for repos with custom dataset scripts;
update the call to include trust_remote_code=True so remote code is allowed when
loading HF_REPO (the call that uses load_dataset(HF_REPO, hf_config,
split=hf_split)); keep the rest of the logic (the with_audio check and
dataset.cast_column("audio", Audio(sampling_rate=AUDIO_SAMPLE_RATE))) unchanged.
🧹 Nitpick comments (2)
nemo_skills/dataset/asr-leaderboard/prepare.py (1)

56-69: Docstring claims broader handling than implementation provides.

The docstring states it handles "the newer AudioDecoder object from torchcodec-based datasets library," but the implementation only tries dict-style access. If AudioDecoder uses a different API, this silently returns (None, None) rather than actually extracting the audio. Additionally, IndexError is unusual for dict/object access patterns.

Consider either updating the docstring to reflect current behavior, or adding explicit AudioDecoder handling if that format is expected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/asr-leaderboard/prepare.py` around lines 56 - 69, The
docstring for extract_audio claims support for a torchcodec AudioDecoder but the
implementation only handles dict-style access and swallows mismatches; update
the function to robustly handle both dicts and AudioDecoder-like objects (or
else update the docstring to remove that claim). Concretely, in extract_audio
detect input type: if isinstance(audio_info, dict) use audio_info["array"] and
["sampling_rate"]; otherwise check for attributes or methods commonly exposed by
AudioDecoder (e.g., hasattr(audio_info, "array") and hasattr(audio_info,
"sampling_rate"), or a to_numpy()/to_array()/numpy() method), retrieve the array
and sampling rate with getattr/call as needed, convert the array to np.array and
sampling_rate to int, and only catch appropriate exceptions (AttributeError,
TypeError, ValueError) instead of IndexError; if unsupported, return (None,
None) as before. Ensure the docstring matches the actual supported formats and
reference extract_audio in your changes.
recipes/asr/run_hf_leaderboard.py (1)

70-70: Preserve backend-specific flag passthrough here.

serve_unified already supports backend-specific CLI args, but this wrapper calls parse_args() and only forwards --backend/--batch_size. That makes SALM-only knobs like --user_prompt unavailable from this recipe and forces future backend additions back into this file. parse_known_args() plus appending the unknown args to server_args would keep the recipe backend-agnostic. Based on learnings, in nemo_skills/inference/server/serve_unified.py, the CLI entrypoint uses parse_known_args() intentionally to capture backend-specific arguments in extra_args, which are then processed by parse_extra_args(). This backend-agnostic design allows different backends to have their own CLI flags without hardcoding them in the entrypoint.

Also applies to: 92-92

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/asr/run_hf_leaderboard.py` at line 70, Replace parser.parse_args()
with parser.parse_known_args() to capture backend-specific flags: call args,
extra = parser.parse_known_args() then keep using args.backend/args.batch_size
but append the extra list to the server_args you build for invoking
serve_unified (e.g., server_args += extra). This preserves SALM-specific flags
like --user_prompt and any future backend flags so serve_unified (and its
parse_extra_args) receives unknown/backend-specific arguments without hardcoding
them in this recipe; update all other parse_args() occurrences similarly (e.g.,
the later call around line 92) to parse_known_args and append unknowns to
server_args.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/asr/run_hf_leaderboard.py`:
- Line 41: The import of eval from nemo_skills.pipeline.cli shadows Python's
builtin eval; change the import to alias it (e.g., from nemo_skills.pipeline.cli
import eval as pipeline_eval, wrap_arguments) and update all call sites that
currently call eval (notably the call at the former line 72) to use the aliased
name pipeline_eval so you invoke the pipeline entrypoint instead of the builtin.

In `@recipes/multimodal/server/backends/salm_backend.py`:
- Around line 173-176: The code incorrectly reads user_prompt and max_new_tokens
from only the first valid request (via first_extra) and applies them to the
whole batch; update the logic in the block that builds per-request inputs
(referencing requests, valid_indices, first_extra, user_prompt, max_new_tokens,
and GenerationRequest.max_new_tokens) so you compute user_prompt and
max_new_tokens per request by reading each request.extra_params (falling back to
self.salm_config.user_prompt and self.config.max_new_tokens respectively) when
iterating valid_indices; ensure any unsupported extra_params raise or are
ignored per guidelines so per-request overrides are honored rather than using
the first request for the entire batch.
- Around line 127-128: The current broad except block in the SALM warmup (the
except Exception as e logging "SALM warmup skipped due to error: %s") hides
startup failures; remove the blanket catch or change it to re-raise after
logging so warmup errors fail startup. Locate the warmup call in salm_backend.py
(the try/except that logs via logger.warning) and either delete the try/except
so exceptions propagate, or keep the log but immediately raise the exception
(raise) after logging to ensure startup fails on unexpected warmup errors.
- Around line 141-147: The validate_request method currently only checks for
presence of audio but allows audio_bytes_list of any length; update
validate_request(GenerationRequest) to enforce the single-audio constraint by
returning an error when request.audio_bytes_list is not None and
len(request.audio_bytes_list) != 1 (e.g., "SALM backend requires exactly one
audio input"), so invalid multi-audio requests fail early (before
batching/temp-file handling) instead of deferring to _get_request_audio_bytes().

---

Duplicate comments:
In `@nemo_skills/dataset/asr-leaderboard/prepare.py`:
- Around line 115-118: The load_dataset call in prepare.py fails for repos with
custom dataset scripts; update the call to include trust_remote_code=True so
remote code is allowed when loading HF_REPO (the call that uses
load_dataset(HF_REPO, hf_config, split=hf_split)); keep the rest of the logic
(the with_audio check and dataset.cast_column("audio",
Audio(sampling_rate=AUDIO_SAMPLE_RATE))) unchanged.

---

Nitpick comments:
In `@nemo_skills/dataset/asr-leaderboard/prepare.py`:
- Around line 56-69: The docstring for extract_audio claims support for a
torchcodec AudioDecoder but the implementation only handles dict-style access
and swallows mismatches; update the function to robustly handle both dicts and
AudioDecoder-like objects (or else update the docstring to remove that claim).
Concretely, in extract_audio detect input type: if isinstance(audio_info, dict)
use audio_info["array"] and ["sampling_rate"]; otherwise check for attributes or
methods commonly exposed by AudioDecoder (e.g., hasattr(audio_info, "array") and
hasattr(audio_info, "sampling_rate"), or a to_numpy()/to_array()/numpy()
method), retrieve the array and sampling rate with getattr/call as needed,
convert the array to np.array and sampling_rate to int, and only catch
appropriate exceptions (AttributeError, TypeError, ValueError) instead of
IndexError; if unsupported, return (None, None) as before. Ensure the docstring
matches the actual supported formats and reference extract_audio in your
changes.

In `@recipes/asr/run_hf_leaderboard.py`:
- Line 70: Replace parser.parse_args() with parser.parse_known_args() to capture
backend-specific flags: call args, extra = parser.parse_known_args() then keep
using args.backend/args.batch_size but append the extra list to the server_args
you build for invoking serve_unified (e.g., server_args += extra). This
preserves SALM-specific flags like --user_prompt and any future backend flags so
serve_unified (and its parse_extra_args) receives unknown/backend-specific
arguments without hardcoding them in this recipe; update all other parse_args()
occurrences similarly (e.g., the later call around line 92) to parse_known_args
and append unknowns to server_args.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b8efa982-7144-4d7c-bec5-f4380406e644

📥 Commits

Reviewing files that changed from the base of the PR and between 51e453a and 0f4a275.

📒 Files selected for processing (5)
  • nemo_skills/dataset/asr-leaderboard/prepare.py
  • recipes/asr/run_hf_leaderboard.py
  • recipes/multimodal/server/backends/__init__.py
  • recipes/multimodal/server/backends/nemo_asr_backend.py
  • recipes/multimodal/server/backends/salm_backend.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • recipes/multimodal/server/backends/init.py
  • recipes/multimodal/server/backends/nemo_asr_backend.py

Comment thread recipes/asr/run_hf_leaderboard.py Outdated
Comment thread recipes/multimodal/server/backends/salm_backend.py
Comment thread recipes/multimodal/server/backends/salm_backend.py
Comment thread recipes/multimodal/server/backends/salm_backend.py Outdated
@vmendelev
Copy link
Copy Markdown
Collaborator

Thank you for the PR! Minor checks. Also could you please add slurm test same as for asr/tts backends? You can use same audio as ASR backend test.

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
@melllinia melllinia force-pushed the salm-backend-support branch from 62273ab to 6f4f93b Compare April 8, 2026 13:41
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
recipes/multimodal/server/backends/salm_backend.py (1)

173-191: ⚠️ Potential issue | 🟠 Major

Don't let the first request define the whole batch.

user_prompt and max_new_tokens are still taken from the first valid request and then reused for every prompt in the batch. This also bypasses the typed GenerationRequest.user_prompt field and silently accepts the other generation overrides on GenerationRequest/BackendConfig without applying them. Please either batch only requests with identical settings or reject unsupported per-request overrides up front.

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically."

Also applies to: 208-212

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/salm_backend.py` around lines 173 - 191,
The current batch uses values from first_req (first_req, user_prompt,
max_new_tokens) for every prompt, silently ignoring per-request
GenerationRequest.user_prompt and per-request generation overrides; update the
handler to either (A) validate that every request in valid_indices has identical
relevant settings (user_prompt, max_new_tokens and any other generation
overrides referenced in the backend) before building prompts and if any mismatch
exists raise/return an error that per-request overrides are unsupported, or (B)
group requests into sub-batches keyed by their identical settings and call
self._model.generate separately per group; ensure both the prompt construction
code that builds prompts from temp_paths and the other similar block (around the
other occurrence noted at lines 208-212) use the same validation/grouping logic
so no request's overrides are ignored.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/multimodal/server/backends/salm_backend.py`:
- Around line 130-146: validate_request currently allows both
GenerationRequest.audio_bytes and audio_bytes_list to be set and
_get_request_audio_bytes silently prefers audio_bytes; change validation to
reject ambiguous requests by returning an error when both audio_bytes is not
None and audio_bytes_list is not None and len(audio_bytes_list) > 0, e.g. in
validate_request check for the simultaneous presence of audio_bytes and a
non-empty audio_bytes_list and return a clear error like "Specify either
audio_bytes or audio_bytes_list, not both"; then keep _get_request_audio_bytes
as-is (it can assume validate_request enforced exclusivity) or update it to
raise if both are present as a defensive fallback.
- Around line 163-170: The current try/except around calling
_get_request_audio_bytes and writing temp files in the Salm backend is too broad
and hides unexpected failures; change the handler in the block that builds
temp_paths/valid_indices (and the similar block at the other site around lines
216-217) to only catch explicit request-validation errors (e.g., ValueError or
your framework's RequestValidationError) and convert those into
GenerationResult(error=...), while letting other exceptions (IOError,
tokenizer/model errors, etc.) propagate after optionally logging; reference the
_get_request_audio_bytes call, the temp_paths and valid_indices updates, and the
results[idx] = GenerationResult(...) assignment to locate and narrow the
exception handling.

---

Duplicate comments:
In `@recipes/multimodal/server/backends/salm_backend.py`:
- Around line 173-191: The current batch uses values from first_req (first_req,
user_prompt, max_new_tokens) for every prompt, silently ignoring per-request
GenerationRequest.user_prompt and per-request generation overrides; update the
handler to either (A) validate that every request in valid_indices has identical
relevant settings (user_prompt, max_new_tokens and any other generation
overrides referenced in the backend) before building prompts and if any mismatch
exists raise/return an error that per-request overrides are unsupported, or (B)
group requests into sub-batches keyed by their identical settings and call
self._model.generate separately per group; ensure both the prompt construction
code that builds prompts from temp_paths and the other similar block (around the
other occurrence noted at lines 208-212) use the same validation/grouping logic
so no request's overrides are ignored.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f403486b-770f-4011-9f2e-6bb5651d541d

📥 Commits

Reviewing files that changed from the base of the PR and between 0f4a275 and 62273ab.

📒 Files selected for processing (3)
  • recipes/asr/run_hf_leaderboard.py
  • recipes/multimodal/server/backends/nemo_asr_backend.py
  • recipes/multimodal/server/backends/salm_backend.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • recipes/multimodal/server/backends/nemo_asr_backend.py
  • recipes/asr/run_hf_leaderboard.py

Comment on lines +130 to +146
def _get_request_audio_bytes(self, request: GenerationRequest) -> bytes:
if request.audio_bytes:
return request.audio_bytes
if request.audio_bytes_list:
if len(request.audio_bytes_list) > 1:
raise ValueError("SALM backend currently supports one audio input per request.")
return request.audio_bytes_list[0]
raise ValueError("Request must contain audio_bytes/audio_bytes_list")

def validate_request(self, request: GenerationRequest) -> Optional[str]:
has_audio = request.audio_bytes is not None or (
request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
)
if not has_audio:
return "SALM backend requires audio input"
if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 1:
return "SALM backend currently supports one audio input per request"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject ambiguous requests that set both audio inputs.

validate_request() currently allows audio_bytes and audio_bytes_list together, and _get_request_audio_bytes() then silently prefers audio_bytes. That means one user-supplied audio payload is ignored without any error.

💡 Proposed fix
     def _get_request_audio_bytes(self, request: GenerationRequest) -> bytes:
+        if request.audio_bytes is not None and request.audio_bytes_list is not None:
+            raise ValueError("SALM backend accepts either audio_bytes or audio_bytes_list, not both")
         if request.audio_bytes:
             return request.audio_bytes
         if request.audio_bytes_list:
             if len(request.audio_bytes_list) > 1:
                 raise ValueError("SALM backend currently supports one audio input per request.")
             return request.audio_bytes_list[0]
         raise ValueError("Request must contain audio_bytes/audio_bytes_list")

     def validate_request(self, request: GenerationRequest) -> Optional[str]:
+        if request.audio_bytes is not None and request.audio_bytes_list is not None:
+            return "SALM backend accepts either audio_bytes or audio_bytes_list, not both"
         has_audio = request.audio_bytes is not None or (
             request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
         )

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/salm_backend.py` around lines 130 - 146,
validate_request currently allows both GenerationRequest.audio_bytes and
audio_bytes_list to be set and _get_request_audio_bytes silently prefers
audio_bytes; change validation to reject ambiguous requests by returning an
error when both audio_bytes is not None and audio_bytes_list is not None and
len(audio_bytes_list) > 0, e.g. in validate_request check for the simultaneous
presence of audio_bytes and a non-empty audio_bytes_list and return a clear
error like "Specify either audio_bytes or audio_bytes_list, not both"; then keep
_get_request_audio_bytes as-is (it can assume validate_request enforced
exclusivity) or update it to raise if both are present as a defensive fallback.

Comment on lines +163 to +170
try:
audio_bytes = self._get_request_audio_bytes(req)
p = tmp_dir / f"req_{idx:04d}.wav"
p.write_bytes(audio_bytes)
temp_paths.append(str(p))
valid_indices.append(idx)
except Exception as e:
results[idx] = GenerationResult(error=str(e), request_id=req.request_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only catch expected request-validation failures here.

These blanket handlers also swallow temp-file, model, and tokenizer failures, turning backend regressions into plain request errors with no traceback. Catch the expected validation errors explicitly and let unexpected exceptions propagate after logging.

Based on learnings, "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."

Also applies to: 216-217

🧰 Tools
🪛 Ruff (0.15.9)

[warning] 169-169: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/salm_backend.py` around lines 163 - 170,
The current try/except around calling _get_request_audio_bytes and writing temp
files in the Salm backend is too broad and hides unexpected failures; change the
handler in the block that builds temp_paths/valid_indices (and the similar block
at the other site around lines 216-217) to only catch explicit
request-validation errors (e.g., ValueError or your framework's
RequestValidationError) and convert those into GenerationResult(error=...),
while letting other exceptions (IOError, tokenizer/model errors, etc.) propagate
after optionally logging; reference the _get_request_audio_bytes call, the
temp_paths and valid_indices updates, and the results[idx] =
GenerationResult(...) assignment to locate and narrow the exception handling.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
recipes/multimodal/server/backends/salm_backend.py (4)

130-147: ⚠️ Potential issue | 🟠 Major

Reject requests that set both audio fields.

validate_request() still accepts a request with both audio_bytes and audio_bytes_list, and _get_request_audio_bytes() silently prefers one payload. That means caller-supplied input is ignored without an error. As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/salm_backend.py` around lines 130 - 147,
The validator currently allows requests that set both audio_bytes and
audio_bytes_list while _get_request_audio_bytes silently prefers one; update
validate_request to detect and reject requests where both request.audio_bytes is
not None and request.audio_bytes_list is not None (return a clear error like
"Specify either audio_bytes or audio_bytes_list, not both") and also harden
_get_request_audio_bytes to raise the same ValueError if both are present (so
callers cannot have input ignored); reference validate_request and
_get_request_audio_bytes to locate and change the logic and error messages
accordingly.

163-170: ⚠️ Potential issue | 🟠 Major

Only catch expected validation errors in generate().

These blanket handlers also swallow temp-file, filesystem, model, and tokenizer failures, turning backend regressions into plain request errors with no traceback. If you want per-request validation errors here, catch ValueError from _get_request_audio_bytes() and let the rest propagate. As per coding guidelines, "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."

Also applies to: 216-217

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/salm_backend.py` around lines 163 - 170,
In generate(), only catch validation errors from _get_request_audio_bytes()
instead of blanket Exception: change the try/except that populates temp_paths,
valid_indices and sets results[idx] = GenerationResult(...) to catch ValueError
(not Exception) so filesystem/model/tokenizer errors propagate; apply the same
change to the second analogous try/except handling request audio bytes (the
other block around setting results entries) so unexpected errors are not
swallowed.

173-176: ⚠️ Potential issue | 🟠 Major

Don't let the first valid request define the whole batch.

user_prompt and max_new_tokens are read once from valid_indices[0] and then reused for every prompt. Later requests lose their own overrides, and GenerationRequest.user_prompt is never consulted here either. Either build prompts from each request or reject mixed-setting batches explicitly. As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically."

Also applies to: 179-191

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/salm_backend.py` around lines 173 - 176,
The code currently reads user_prompt and max_new_tokens once from first_req
(variables first_req, first_extra, user_prompt, max_new_tokens) and reuses them
for the whole batch, which ignores per-request overrides
(GenerationRequest.user_prompt and per-request max_new_tokens) and can silently
drop user-provided settings; change the batching logic so you either (A) build
the prompt and token limit per-request inside the loop that processes
valid_indices (use each request's .extra_params / .user_prompt and
.max_new_tokens) or (B) explicitly reject mixed-setting batches by validating
that all requests in valid_indices have identical user_prompt and max_new_tokens
and raise an error if not; update any downstream use sites that reference those
shared variables to consume per-request values or the validated common values
accordingly.

123-126: ⚠️ Potential issue | 🟠 Major

Let warmup failures fail startup.

Warmup exercises the same generate() path used for real traffic. Swallowing every exception here can leave the backend marked loaded even when generation is broken. As per coding guidelines, "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/salm_backend.py` around lines 123 - 126,
The warmup currently swallows all exceptions around self._model.generate and
logs a warning via logger.warning, which can mask real startup failures; remove
the broad except or change it to only catch expected, recoverable exceptions and
re-raise any unexpected ones so startup fails on generation errors — locate the
SALM warmup block that calls self._model.generate(...) and the
logger.warning("SALM warmup skipped due to error: %s", e) and either delete the
try/except so exceptions propagate or replace the except with specific exception
types and a re-raise for anything else.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@recipes/multimodal/server/backends/salm_backend.py`:
- Around line 130-147: The validator currently allows requests that set both
audio_bytes and audio_bytes_list while _get_request_audio_bytes silently prefers
one; update validate_request to detect and reject requests where both
request.audio_bytes is not None and request.audio_bytes_list is not None (return
a clear error like "Specify either audio_bytes or audio_bytes_list, not both")
and also harden _get_request_audio_bytes to raise the same ValueError if both
are present (so callers cannot have input ignored); reference validate_request
and _get_request_audio_bytes to locate and change the logic and error messages
accordingly.
- Around line 163-170: In generate(), only catch validation errors from
_get_request_audio_bytes() instead of blanket Exception: change the try/except
that populates temp_paths, valid_indices and sets results[idx] =
GenerationResult(...) to catch ValueError (not Exception) so
filesystem/model/tokenizer errors propagate; apply the same change to the second
analogous try/except handling request audio bytes (the other block around
setting results entries) so unexpected errors are not swallowed.
- Around line 173-176: The code currently reads user_prompt and max_new_tokens
once from first_req (variables first_req, first_extra, user_prompt,
max_new_tokens) and reuses them for the whole batch, which ignores per-request
overrides (GenerationRequest.user_prompt and per-request max_new_tokens) and can
silently drop user-provided settings; change the batching logic so you either
(A) build the prompt and token limit per-request inside the loop that processes
valid_indices (use each request's .extra_params / .user_prompt and
.max_new_tokens) or (B) explicitly reject mixed-setting batches by validating
that all requests in valid_indices have identical user_prompt and max_new_tokens
and raise an error if not; update any downstream use sites that reference those
shared variables to consume per-request values or the validated common values
accordingly.
- Around line 123-126: The warmup currently swallows all exceptions around
self._model.generate and logs a warning via logger.warning, which can mask real
startup failures; remove the broad except or change it to only catch expected,
recoverable exceptions and re-raise any unexpected ones so startup fails on
generation errors — locate the SALM warmup block that calls
self._model.generate(...) and the logger.warning("SALM warmup skipped due to
error: %s", e) and either delete the try/except so exceptions propagate or
replace the except with specific exception types and a re-raise for anything
else.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 90d22a12-dfc9-41e9-bf91-547dec1b878c

📥 Commits

Reviewing files that changed from the base of the PR and between 62273ab and 6f4f93b.

📒 Files selected for processing (3)
  • recipes/asr/run_hf_leaderboard.py
  • recipes/multimodal/server/backends/nemo_asr_backend.py
  • recipes/multimodal/server/backends/salm_backend.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • recipes/multimodal/server/backends/nemo_asr_backend.py

Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/slurm-tests/unified_salm/check_results.py (1)

88-94: Use direct key access for required output fields.

If id and generation are required output schema fields, prefer explicit key checks + row["..."] instead of .get(...) to avoid silent fallback behavior.

Proposed fix
-        sample_id = row.get("id")
+        if "id" not in row:
+            soft_assert(False, "Missing required key 'id' in output row")
+            continue
+        sample_id = row["id"]
@@
-        transcript = (row.get("generation") or "").strip()
+        if "generation" not in row:
+            soft_assert(False, f"Sample {sample_id}: missing required key 'generation'")
+            continue
+        transcript = (row["generation"] or "").strip()

As per coding guidelines "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/unified_salm/check_results.py` around lines 88 - 94,
Replace use of row.get(...) with direct key access for required fields: change
sample_id = row.get("id") to sample_id = row["id"] and transcript =
(row.get("generation") or "").strip() to transcript = row["generation"].strip();
keep the existing soft_assert checks (soft_assert(sample_id in references, ...),
soft_assert(bool(transcript), ...)) and the early continue when sample_id not in
references so missing/invalid keys raise a clear KeyError rather than silently
falling back.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/slurm-tests/run_all.sh`:
- Line 24: The shell invocation uses unquoted variables CLUSTER and RUN_NAME
which can break on whitespace; update the command that calls python
tests/slurm-tests/unified_salm/run_test.py to quote all variable expansions —
replace occurrences of $CLUSTER and $RUN_NAME (both in --cluster, the
--workspace path, and --expname_prefix) with "$CLUSTER" and "$RUN_NAME" so the
values are treated as single arguments even if they contain spaces or special
characters.

In `@tests/slurm-tests/unified_salm/check_results.py`:
- Around line 15-19: The current count-equality check can miss missing or
duplicated sample IDs; update the block that builds and compares output IDs and
reference IDs so it (1) computes sets for both (e.g., output_ids_set and
reference_ids_set) and asserts set equality instead of just equal counts, and
(2) detects duplicates by comparing len(output_ids_list) vs len(output_ids_set)
(and similarly for reference list) and fail with a clear error when duplicates
are found; modify the comparison logic used around the output/reference ID check
to first fail on duplicates and then verify exact set membership (missing/extra
IDs) rather than only comparing totals.

---

Nitpick comments:
In `@tests/slurm-tests/unified_salm/check_results.py`:
- Around line 88-94: Replace use of row.get(...) with direct key access for
required fields: change sample_id = row.get("id") to sample_id = row["id"] and
transcript = (row.get("generation") or "").strip() to transcript =
row["generation"].strip(); keep the existing soft_assert checks
(soft_assert(sample_id in references, ...), soft_assert(bool(transcript), ...))
and the early continue when sample_id not in references so missing/invalid keys
raise a clear KeyError rather than silently falling back.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ef160864-9ad8-4a15-9e6f-7d4e52076298

📥 Commits

Reviewing files that changed from the base of the PR and between 6f4f93b and f461bd9.

📒 Files selected for processing (4)
  • tests/slurm-tests/run_all.sh
  • tests/slurm-tests/unified_salm/check_results.py
  • tests/slurm-tests/unified_salm/run_test.py
  • tests/slurm-tests/unified_salm/salm_openai.test
✅ Files skipped from review due to trivial changes (1)
  • tests/slurm-tests/unified_salm/salm_openai.test

# sleep 10
python tests/slurm-tests/unified_tts/run_test.py --cluster $CLUSTER --workspace /workspace/nemo-skills-slurm-ci/$RUN_NAME/unified_tts --expname_prefix unified_tts_$RUN_NAME
# sleep 10
python tests/slurm-tests/unified_salm/run_test.py --cluster $CLUSTER --workspace /workspace/nemo-skills-slurm-ci/$RUN_NAME/unified_salm --expname_prefix unified_salm_$RUN_NAME
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if file exists and examine the relevant section
cd /repo && wc -l tests/slurm-tests/run_all.sh && head -30 tests/slurm-tests/run_all.sh

Repository: NVIDIA-NeMo/Skills

Length of output: 117


🏁 Script executed:

# Find the file and check its contents
find . -name "run_all.sh" -path "*/slurm-tests/*" 2>/dev/null

Repository: NVIDIA-NeMo/Skills

Length of output: 92


🏁 Script executed:

# Examine the file, particularly around line 24
wc -l ./tests/slurm-tests/run_all.sh && echo "---" && cat -n ./tests/slurm-tests/run_all.sh | head -40

Repository: NVIDIA-NeMo/Skills

Length of output: 2308


Quote shell variables in the new SALM invocation.

Line 24 uses unquoted $CLUSTER and $RUN_NAME; this can break on whitespace or special characters since these values come from external inputs (script arguments).

Proposed fix
-python tests/slurm-tests/unified_salm/run_test.py --cluster $CLUSTER --workspace /workspace/nemo-skills-slurm-ci/$RUN_NAME/unified_salm --expname_prefix unified_salm_$RUN_NAME
+python tests/slurm-tests/unified_salm/run_test.py --cluster "$CLUSTER" --workspace "/workspace/nemo-skills-slurm-ci/$RUN_NAME/unified_salm" --expname_prefix "unified_salm_$RUN_NAME"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
python tests/slurm-tests/unified_salm/run_test.py --cluster $CLUSTER --workspace /workspace/nemo-skills-slurm-ci/$RUN_NAME/unified_salm --expname_prefix unified_salm_$RUN_NAME
python tests/slurm-tests/unified_salm/run_test.py --cluster "$CLUSTER" --workspace "/workspace/nemo-skills-slurm-ci/$RUN_NAME/unified_salm" --expname_prefix "unified_salm_$RUN_NAME"
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 24-24: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 24-24: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 24-24: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/run_all.sh` at line 24, The shell invocation uses unquoted
variables CLUSTER and RUN_NAME which can break on whitespace; update the command
that calls python tests/slurm-tests/unified_salm/run_test.py to quote all
variable expansions — replace occurrences of $CLUSTER and $RUN_NAME (both in
--cluster, the --workspace path, and --expname_prefix) with "$CLUSTER" and
"$RUN_NAME" so the values are treated as single arguments even if they contain
spaces or special characters.

Comment on lines +15 to +19
import argparse
import json
import re
import sys
from pathlib import Path
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Count equality alone can miss missing sample coverage.

Line 85 + Lines 87-101 can still pass when one reference ID is missing but another ID is duplicated. Add explicit set/duplicate checks for output IDs.

Proposed fix
 import argparse
+from collections import Counter
 import json
 import re
 import sys
 from pathlib import Path
@@
-    soft_assert(len(rows) == len(references), f"Expected {len(references)} outputs, found {len(rows)}")
+    soft_assert(len(rows) == len(references), f"Expected {len(references)} outputs, found {len(rows)}")
+    output_ids = [row["id"] for row in rows if "id" in row]
+    missing_ids = sorted(set(references) - set(output_ids))
+    extra_ids = sorted(set(output_ids) - set(references))
+    duplicate_ids = sorted([sid for sid, cnt in Counter(output_ids).items() if cnt > 1])
+    soft_assert(not missing_ids, f"Missing sample ids in output: {', '.join(missing_ids)}")
+    soft_assert(not extra_ids, f"Unexpected sample ids in output: {', '.join(extra_ids)}")
+    soft_assert(not duplicate_ids, f"Duplicate sample ids in output: {', '.join(duplicate_ids)}")

Also applies to: 85-101

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/unified_salm/check_results.py` around lines 15 - 19, The
current count-equality check can miss missing or duplicated sample IDs; update
the block that builds and compares output IDs and reference IDs so it (1)
computes sets for both (e.g., output_ids_set and reference_ids_set) and asserts
set equality instead of just equal counts, and (2) detects duplicates by
comparing len(output_ids_list) vs len(output_ids_set) (and similarly for
reference list) and fail with a clear error when duplicates are found; modify
the comparison logic used around the output/reference ID check to first fail on
duplicates and then verify exact set membership (missing/extra IDs) rather than
only comparing totals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants