NVIDIA-NeMo
diff --git a/‎README.md‎
Lines changed: 131 additions & 126 deletions b/‎README.md‎
Lines changed: 131 additions & 126 deletions
diff --git a/‎benchmarks/arena_hard/README.md‎
Lines changed: 63 additions & 0 deletions b/‎benchmarks/arena_hard/README.md‎
Lines changed: 63 additions & 0 deletions
diff --git a/‎benchmarks/arena_hard/__init__.py‎ b/‎benchmarks/arena_hard/__init__.py‎
diff --git a/‎benchmarks/arena_hard/config.yaml‎
Lines changed: 25 additions & 0 deletions b/‎benchmarks/arena_hard/config.yaml‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎benchmarks/arena_hard/data/.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎benchmarks/arena_hard/data/.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎benchmarks/arena_hard/prepare.py‎
Lines changed: 104 additions & 0 deletions b/‎benchmarks/arena_hard/prepare.py‎
Lines changed: 104 additions & 0 deletions
diff --git a/‎benchmarks/asr_leaderboard/README.md‎
Lines changed: 53 additions & 0 deletions b/‎benchmarks/asr_leaderboard/README.md‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎benchmarks/asr_leaderboard/__init__.py‎ b/‎benchmarks/asr_leaderboard/__init__.py‎
diff --git a/‎benchmarks/asr_leaderboard/config.yaml‎
Lines changed: 22 additions & 0 deletions b/‎benchmarks/asr_leaderboard/config.yaml‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎benchmarks/asr_leaderboard/data/.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎benchmarks/asr_leaderboard/data/.gitignore‎
Lines changed: 3 additions & 0 deletions
@@ -0,0 +1,63 @@
+# arena_hard
+
+Gym implementation of the
+[Arena Hard v0.1](https://github.com/lmarena/arena-hard-auto)
+open-ended generation benchmark.
+
+## What it tests
+
+500 hard, open-ended user prompts. Each candidate rollout is judged
+pairwise (both A↔B orderings) against a fixed **gpt-4-0314** baseline
+via an LLM judge. See
+[`resources_servers/arena_judge`](../../resources_servers/arena_judge/README.md)
+for the judging protocol and metric details.
+
+## Data
+
+Runtime download only — benchmark JSONL is not committed. Run
+[`prepare.py`](prepare.py) (or `ng_prepare_benchmark`) to populate
+`data/arena_hard_benchmark.jsonl`. The prepare script fetches
+questions and the baseline directly from the arena-hard-auto GitHub
+repo, joins by `uid`, and emits one row per question with `question`,
+`baseline_answer`, and `uid` at the top level. Arena-hard v0.1 has no
+real sub-categories, so the upstream `category` field is dropped and
+`arena_judge` falls through to its `default_category` (`hard_prompt`)
+to pick the standard judge prompt.
+
+## Example usage
+
+```bash
+# Prepare benchmark data
+ng_prepare_benchmark "+config_paths=[benchmarks/arena_hard/config.yaml]"
+
+# Running servers
+config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
+benchmarks/arena_hard/config.yaml"
+ng_run "+config_paths=[$config_paths]"
+
+# Collecting rollouts
+ng_collect_rollouts \
+    +agent_name=arena_hard_arena_judge_simple_agent \
+    +input_jsonl_fpath=benchmarks/arena_hard/data/arena_hard_benchmark.jsonl \
+    +output_jsonl_fpath=results/arena_hard_rollouts.jsonl \
+    +prompt_config=benchmarks/prompts/generic_default.yaml \
+    +num_repeats=4
+```
+
+## Metrics
+
+The headline number is the **Arena-Elo win-rate (%) vs baseline**,
+computed by the `arena_judge` resources server as MLE logistic
+regression over the pairwise battles with a 100-round bootstrap 95% CI.
+Emitted keys:
+
+- `arena_elo/score` — overall win-rate (0-100)
+- `arena_elo/ci_lower` / `arena_elo/ci_upper` — bootstrap percentile CI bounds
+- `arena_elo/invalid_scores` — count of judge calls that produced no
+  parseable verdict
+
+The server also emits pass@k / pass@1[avg-of-k] / majority@k for a
+verdict-type decomposition (`wins`, `strict_wins`, `ties`, `losses`,
+`double_wins`, `invalid_gen_base`), so a single run gives both the
+Arena-Elo headline and a rollout-level verdict distribution without
+extra post-processing.
@@ -0,0 +1,25 @@
+# Chain to the arena_judge resource server + agent config.
+config_paths:
+  - resources_servers/arena_judge/configs/arena_judge.yaml
+
+# Isolate this benchmark's agent wiring from other arena_judge consumers
+# via ``_inherit_from`` so overrides here don't leak.
+arena_hard_arena_judge_resources_server:
+  _inherit_from: arena_judge
+
+arena_hard_arena_judge_simple_agent:
+  _inherit_from: arena_judge_simple_agent
+  responses_api_agents:
+    simple_agent:
+      resources_server:
+        name: arena_hard_arena_judge_resources_server
+      datasets:
+      - name: arena_hard
+        type: benchmark
+        jsonl_fpath: benchmarks/arena_hard/data/arena_hard_benchmark.jsonl
+        prompt_config: benchmarks/prompts/generic_default.yaml
+        prepare_script: benchmarks/arena_hard/prepare.py
+        # NOTE: num_repeats here is NOT honored for type=benchmark. Pass
+        # it on the CLI as `+num_repeats=N` — see run_arena_hard_gym.py
+        # in the migration recipe directory.
+        license: Apache 2.0
@@ -0,0 +1,3 @@
+*benchmark.jsonl
+question.jsonl
+baseline_*.jsonl
@@ -0,0 +1,104 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Prepare Arena Hard (v0.1) benchmark data.
+
+Downloads the arena-hard-auto v0.1 question set and the single
+``gpt-4-0314`` baseline model-answer file, joins them by ``uid``, and
+writes one row per question with the fields the ``arena_judge``
+resources server consumes at the top level (``question``,
+``baseline_answer``, ``uid``):
+
+- Questions:
+  https://github.com/lmarena/arena-hard-auto/blob/main/data/arena-hard-v0.1/question.jsonl
+- Baseline (gpt-4-0314):
+  https://github.com/lmarena/arena-hard-auto/blob/main/data/arena-hard-v0.1/model_answer/gpt-4-0314.jsonl
+
+Note: arena-hard v0.1 has no real sub-categories — the upstream
+``category`` field is just the dataset version string ("arena-hard-v0.1").
+We drop it so ``arena_judge`` falls through to its ``default_category``
+(``hard_prompt``) and uses the standard arena-hard judge prompt.
+"""
+
+import json
+import urllib.request
+from pathlib import Path
+
+
+BENCHMARK_DIR = Path(__file__).parent
+DATA_DIR = BENCHMARK_DIR / "data"
+OUTPUT_FPATH = DATA_DIR / "arena_hard_benchmark.jsonl"
+
+URL_QUESTIONS = "https://raw.githubusercontent.com/lmarena/arena-hard-auto/main/data/arena-hard-v0.1/question.jsonl"
+URL_BASELINE = (
+    "https://raw.githubusercontent.com/lmarena/arena-hard-auto/main/data/arena-hard-v0.1/model_answer/gpt-4-0314.jsonl"
+)
+
+
+def _extract_answer_text(data: dict) -> str:
+    """Extract the assistant answer from a baseline model's JSONL row.
+
+    The arena-hard-auto baseline files use both shapes for the assistant
+    ``content``: a plain string or a dict with an ``answer`` key.
+    """
+    for msg in data["messages"]:
+        if msg["role"] == "assistant":
+            content = msg["content"]
+            return content["answer"] if isinstance(content, dict) else content
+    raise ValueError("No assistant message found in the baseline row.")
+
+
+def prepare() -> Path:
+    """Download and write ``arena_hard_benchmark.jsonl``. Returns the path."""
+    DATA_DIR.mkdir(parents=True, exist_ok=True)
+
+    print(f"Downloading questions from {URL_QUESTIONS} ...")
+    questions_fpath = DATA_DIR / "question.jsonl"
+    urllib.request.urlretrieve(URL_QUESTIONS, questions_fpath)
+
+    print(f"Downloading baseline from {URL_BASELINE} ...")
+    baseline_fpath = DATA_DIR / "baseline_gpt-4-0314.jsonl"
+    urllib.request.urlretrieve(URL_BASELINE, baseline_fpath)
+
+    # uid -> answer_text
+    baseline_answers: dict[str, str] = {}
+    with open(baseline_fpath, "r", encoding="utf-8") as fin:
+        for line in fin:
+            row = json.loads(line)
+            baseline_answers[row["uid"]] = _extract_answer_text(row)
+
+    count = 0
+    with open(questions_fpath, "r", encoding="utf-8") as fin, open(OUTPUT_FPATH, "w", encoding="utf-8") as fout:
+        for line in fin:
+            row = json.loads(line)
+            # arena-hard-auto stores the prompt under ``prompt`` but the
+            # resource server + prompt template expect ``question``.
+            row["question"] = row.pop("prompt")
+            # arena-hard v0.1 has no real sub-categories — the upstream
+            # ``category`` field is just the dataset version string. Drop
+            # it so ``arena_judge`` falls through to its ``default_category``
+            # (``hard_prompt``) and uses the standard arena-hard judge prompt.
+            row.pop("category", None)
+            # Fail loudly if a question's baseline answer is missing — a
+            # silent skip would shrink the evaluation set.
+            row["baseline_answer"] = baseline_answers[row["uid"]]
+            fout.write(json.dumps(row) + "\n")
+            count += 1
+
+    print(f"Wrote {count} problems to {OUTPUT_FPATH}")
+    return OUTPUT_FPATH
+
+
+if __name__ == "__main__":
+    prepare()
@@ -0,0 +1,53 @@
+# ASR-Leaderboard
+
+The 8-subset HuggingFace Open ASR Leaderboard test set
+(`hf-audio/esb-datasets-test-only-sorted`): librispeech-clean,
+librispeech-other, voxpopuli, tedlium, gigaspeech, spgispeech,
+earnings22, ami. Pairs with the
+[`asr_with_pc`](../../resources_servers/asr_with_pc/) resource server's
+`task_type=ASR` mode (Whisper-normalized WER).
+
+## Audio handling
+
+Audio FLACs are downloaded by `prepare.py` to the cluster-mounted
+`/dataset/asr-leaderboard/data/<dataset>/<id>.flac` path. Each row
+references the file via `responses_create_params.metadata.audio_path`,
+and `vllm_model`'s audio sidechannel reads the file at request time and
+splices it into the user message before forwarding to vLLM Chat
+Completions.
+
+## Prompt
+
+System + user templates live in [`prompts/default.yaml`](prompts/default.yaml).
+
+## Prepare benchmark data
+
+```bash
+ng_prepare_benchmark "+config_paths=[benchmarks/asr_leaderboard/config.yaml]"
+```
+
+Downloads the 8 ESB subsets (~tens of GB of FLAC) and writes
+`benchmarks/asr_leaderboard/data/asr_leaderboard_benchmark.jsonl`.
+
+## Running servers
+
+```bash
+config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
+benchmarks/asr_leaderboard/config.yaml"
+ng_run "+config_paths=[$config_paths]"
+```
+
+## Collecting rollouts
+
+```bash
+ng_collect_rollouts \
+    +agent_name=asr_leaderboard_asr_with_pc_simple_agent \
+    +output_jsonl_fpath=results/asr_leaderboard_rollouts.jsonl \
+    +num_repeats=1
+```
+
+## Verification
+
+Per-rollout: standard WER (Whisper-normalized) and binary
+`is_correct = wer < 0.5`. Aggregated: corpus-level `wer` and per-rollout
+`pass@k`/`majority@k` are produced by `asr_with_pc.compute_metrics()`.
@@ -0,0 +1,22 @@
+config_paths:
+  - resources_servers/asr_with_pc/configs/asr_with_pc.yaml
+
+asr_leaderboard_asr_with_pc_resources_server:
+  _inherit_from: asr_with_pc
+  resources_servers:
+    asr_with_pc:
+      task_type: ASR
+
+asr_leaderboard_asr_with_pc_simple_agent:
+  _inherit_from: asr_with_pc_simple_agent
+  responses_api_agents:
+    simple_agent:
+      resources_server:
+        name: asr_leaderboard_asr_with_pc_resources_server
+      datasets:
+      - name: asr_leaderboard
+        type: benchmark
+        jsonl_fpath: benchmarks/asr_leaderboard/data/asr_leaderboard_benchmark.jsonl
+        prompt_config: benchmarks/asr_leaderboard/prompts/default.yaml
+        prepare_script: benchmarks/asr_leaderboard/prepare.py
+        license: Creative Commons Attribution 4.0 International
@@ -0,0 +1,3 @@
+*benchmark.jsonl
+*.flac
+*.wav
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+*benchmark.jsonl`
	`2`	`+question.jsonl`
	`3`	`+baseline_*.jsonl`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+*benchmark.jsonl`
	`2`	`+*.flac`
	`3`	`+*.wav`