Skip to content

Commit e96fcb5

Browse files
authored
Merge branch 'main' into lbliii/fern-latest-main-versioning
2 parents 7642d2d + ed190cd commit e96fcb5

179 files changed

Lines changed: 12976 additions & 173 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 131 additions & 126 deletions
Large diffs are not rendered by default.

benchmarks/arena_hard/README.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# arena_hard
2+
3+
Gym implementation of the
4+
[Arena Hard v0.1](https://github.com/lmarena/arena-hard-auto)
5+
open-ended generation benchmark.
6+
7+
## What it tests
8+
9+
500 hard, open-ended user prompts. Each candidate rollout is judged
10+
pairwise (both A↔B orderings) against a fixed **gpt-4-0314** baseline
11+
via an LLM judge. See
12+
[`resources_servers/arena_judge`](../../resources_servers/arena_judge/README.md)
13+
for the judging protocol and metric details.
14+
15+
## Data
16+
17+
Runtime download only — benchmark JSONL is not committed. Run
18+
[`prepare.py`](prepare.py) (or `ng_prepare_benchmark`) to populate
19+
`data/arena_hard_benchmark.jsonl`. The prepare script fetches
20+
questions and the baseline directly from the arena-hard-auto GitHub
21+
repo, joins by `uid`, and emits one row per question with `question`,
22+
`baseline_answer`, and `uid` at the top level. Arena-hard v0.1 has no
23+
real sub-categories, so the upstream `category` field is dropped and
24+
`arena_judge` falls through to its `default_category` (`hard_prompt`)
25+
to pick the standard judge prompt.
26+
27+
## Example usage
28+
29+
```bash
30+
# Prepare benchmark data
31+
ng_prepare_benchmark "+config_paths=[benchmarks/arena_hard/config.yaml]"
32+
33+
# Running servers
34+
config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
35+
benchmarks/arena_hard/config.yaml"
36+
ng_run "+config_paths=[$config_paths]"
37+
38+
# Collecting rollouts
39+
ng_collect_rollouts \
40+
+agent_name=arena_hard_arena_judge_simple_agent \
41+
+input_jsonl_fpath=benchmarks/arena_hard/data/arena_hard_benchmark.jsonl \
42+
+output_jsonl_fpath=results/arena_hard_rollouts.jsonl \
43+
+prompt_config=benchmarks/prompts/generic_default.yaml \
44+
+num_repeats=4
45+
```
46+
47+
## Metrics
48+
49+
The headline number is the **Arena-Elo win-rate (%) vs baseline**,
50+
computed by the `arena_judge` resources server as MLE logistic
51+
regression over the pairwise battles with a 100-round bootstrap 95% CI.
52+
Emitted keys:
53+
54+
- `arena_elo/score` — overall win-rate (0-100)
55+
- `arena_elo/ci_lower` / `arena_elo/ci_upper` — bootstrap percentile CI bounds
56+
- `arena_elo/invalid_scores` — count of judge calls that produced no
57+
parseable verdict
58+
59+
The server also emits pass@k / pass@1[avg-of-k] / majority@k for a
60+
verdict-type decomposition (`wins`, `strict_wins`, `ties`, `losses`,
61+
`double_wins`, `invalid_gen_base`), so a single run gives both the
62+
Arena-Elo headline and a rollout-level verdict distribution without
63+
extra post-processing.

benchmarks/arena_hard/__init__.py

Whitespace-only changes.

benchmarks/arena_hard/config.yaml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Chain to the arena_judge resource server + agent config.
2+
config_paths:
3+
- resources_servers/arena_judge/configs/arena_judge.yaml
4+
5+
# Isolate this benchmark's agent wiring from other arena_judge consumers
6+
# via ``_inherit_from`` so overrides here don't leak.
7+
arena_hard_arena_judge_resources_server:
8+
_inherit_from: arena_judge
9+
10+
arena_hard_arena_judge_simple_agent:
11+
_inherit_from: arena_judge_simple_agent
12+
responses_api_agents:
13+
simple_agent:
14+
resources_server:
15+
name: arena_hard_arena_judge_resources_server
16+
datasets:
17+
- name: arena_hard
18+
type: benchmark
19+
jsonl_fpath: benchmarks/arena_hard/data/arena_hard_benchmark.jsonl
20+
prompt_config: benchmarks/prompts/generic_default.yaml
21+
prepare_script: benchmarks/arena_hard/prepare.py
22+
# NOTE: num_repeats here is NOT honored for type=benchmark. Pass
23+
# it on the CLI as `+num_repeats=N` — see run_arena_hard_gym.py
24+
# in the migration recipe directory.
25+
license: Apache 2.0
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
*benchmark.jsonl
2+
question.jsonl
3+
baseline_*.jsonl

benchmarks/arena_hard/prepare.py

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
"""Prepare Arena Hard (v0.1) benchmark data.
16+
17+
Downloads the arena-hard-auto v0.1 question set and the single
18+
``gpt-4-0314`` baseline model-answer file, joins them by ``uid``, and
19+
writes one row per question with the fields the ``arena_judge``
20+
resources server consumes at the top level (``question``,
21+
``baseline_answer``, ``uid``):
22+
23+
- Questions:
24+
https://github.com/lmarena/arena-hard-auto/blob/main/data/arena-hard-v0.1/question.jsonl
25+
- Baseline (gpt-4-0314):
26+
https://github.com/lmarena/arena-hard-auto/blob/main/data/arena-hard-v0.1/model_answer/gpt-4-0314.jsonl
27+
28+
Note: arena-hard v0.1 has no real sub-categories — the upstream
29+
``category`` field is just the dataset version string ("arena-hard-v0.1").
30+
We drop it so ``arena_judge`` falls through to its ``default_category``
31+
(``hard_prompt``) and uses the standard arena-hard judge prompt.
32+
"""
33+
34+
import json
35+
import urllib.request
36+
from pathlib import Path
37+
38+
39+
BENCHMARK_DIR = Path(__file__).parent
40+
DATA_DIR = BENCHMARK_DIR / "data"
41+
OUTPUT_FPATH = DATA_DIR / "arena_hard_benchmark.jsonl"
42+
43+
URL_QUESTIONS = "https://raw.githubusercontent.com/lmarena/arena-hard-auto/main/data/arena-hard-v0.1/question.jsonl"
44+
URL_BASELINE = (
45+
"https://raw.githubusercontent.com/lmarena/arena-hard-auto/main/data/arena-hard-v0.1/model_answer/gpt-4-0314.jsonl"
46+
)
47+
48+
49+
def _extract_answer_text(data: dict) -> str:
50+
"""Extract the assistant answer from a baseline model's JSONL row.
51+
52+
The arena-hard-auto baseline files use both shapes for the assistant
53+
``content``: a plain string or a dict with an ``answer`` key.
54+
"""
55+
for msg in data["messages"]:
56+
if msg["role"] == "assistant":
57+
content = msg["content"]
58+
return content["answer"] if isinstance(content, dict) else content
59+
raise ValueError("No assistant message found in the baseline row.")
60+
61+
62+
def prepare() -> Path:
63+
"""Download and write ``arena_hard_benchmark.jsonl``. Returns the path."""
64+
DATA_DIR.mkdir(parents=True, exist_ok=True)
65+
66+
print(f"Downloading questions from {URL_QUESTIONS} ...")
67+
questions_fpath = DATA_DIR / "question.jsonl"
68+
urllib.request.urlretrieve(URL_QUESTIONS, questions_fpath)
69+
70+
print(f"Downloading baseline from {URL_BASELINE} ...")
71+
baseline_fpath = DATA_DIR / "baseline_gpt-4-0314.jsonl"
72+
urllib.request.urlretrieve(URL_BASELINE, baseline_fpath)
73+
74+
# uid -> answer_text
75+
baseline_answers: dict[str, str] = {}
76+
with open(baseline_fpath, "r", encoding="utf-8") as fin:
77+
for line in fin:
78+
row = json.loads(line)
79+
baseline_answers[row["uid"]] = _extract_answer_text(row)
80+
81+
count = 0
82+
with open(questions_fpath, "r", encoding="utf-8") as fin, open(OUTPUT_FPATH, "w", encoding="utf-8") as fout:
83+
for line in fin:
84+
row = json.loads(line)
85+
# arena-hard-auto stores the prompt under ``prompt`` but the
86+
# resource server + prompt template expect ``question``.
87+
row["question"] = row.pop("prompt")
88+
# arena-hard v0.1 has no real sub-categories — the upstream
89+
# ``category`` field is just the dataset version string. Drop
90+
# it so ``arena_judge`` falls through to its ``default_category``
91+
# (``hard_prompt``) and uses the standard arena-hard judge prompt.
92+
row.pop("category", None)
93+
# Fail loudly if a question's baseline answer is missing — a
94+
# silent skip would shrink the evaluation set.
95+
row["baseline_answer"] = baseline_answers[row["uid"]]
96+
fout.write(json.dumps(row) + "\n")
97+
count += 1
98+
99+
print(f"Wrote {count} problems to {OUTPUT_FPATH}")
100+
return OUTPUT_FPATH
101+
102+
103+
if __name__ == "__main__":
104+
prepare()
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# ASR-Leaderboard
2+
3+
The 8-subset HuggingFace Open ASR Leaderboard test set
4+
(`hf-audio/esb-datasets-test-only-sorted`): librispeech-clean,
5+
librispeech-other, voxpopuli, tedlium, gigaspeech, spgispeech,
6+
earnings22, ami. Pairs with the
7+
[`asr_with_pc`](../../resources_servers/asr_with_pc/) resource server's
8+
`task_type=ASR` mode (Whisper-normalized WER).
9+
10+
## Audio handling
11+
12+
Audio FLACs are downloaded by `prepare.py` to the cluster-mounted
13+
`/dataset/asr-leaderboard/data/<dataset>/<id>.flac` path. Each row
14+
references the file via `responses_create_params.metadata.audio_path`,
15+
and `vllm_model`'s audio sidechannel reads the file at request time and
16+
splices it into the user message before forwarding to vLLM Chat
17+
Completions.
18+
19+
## Prompt
20+
21+
System + user templates live in [`prompts/default.yaml`](prompts/default.yaml).
22+
23+
## Prepare benchmark data
24+
25+
```bash
26+
ng_prepare_benchmark "+config_paths=[benchmarks/asr_leaderboard/config.yaml]"
27+
```
28+
29+
Downloads the 8 ESB subsets (~tens of GB of FLAC) and writes
30+
`benchmarks/asr_leaderboard/data/asr_leaderboard_benchmark.jsonl`.
31+
32+
## Running servers
33+
34+
```bash
35+
config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
36+
benchmarks/asr_leaderboard/config.yaml"
37+
ng_run "+config_paths=[$config_paths]"
38+
```
39+
40+
## Collecting rollouts
41+
42+
```bash
43+
ng_collect_rollouts \
44+
+agent_name=asr_leaderboard_asr_with_pc_simple_agent \
45+
+output_jsonl_fpath=results/asr_leaderboard_rollouts.jsonl \
46+
+num_repeats=1
47+
```
48+
49+
## Verification
50+
51+
Per-rollout: standard WER (Whisper-normalized) and binary
52+
`is_correct = wer < 0.5`. Aggregated: corpus-level `wer` and per-rollout
53+
`pass@k`/`majority@k` are produced by `asr_with_pc.compute_metrics()`.

benchmarks/asr_leaderboard/__init__.py

Whitespace-only changes.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
config_paths:
2+
- resources_servers/asr_with_pc/configs/asr_with_pc.yaml
3+
4+
asr_leaderboard_asr_with_pc_resources_server:
5+
_inherit_from: asr_with_pc
6+
resources_servers:
7+
asr_with_pc:
8+
task_type: ASR
9+
10+
asr_leaderboard_asr_with_pc_simple_agent:
11+
_inherit_from: asr_with_pc_simple_agent
12+
responses_api_agents:
13+
simple_agent:
14+
resources_server:
15+
name: asr_leaderboard_asr_with_pc_resources_server
16+
datasets:
17+
- name: asr_leaderboard
18+
type: benchmark
19+
jsonl_fpath: benchmarks/asr_leaderboard/data/asr_leaderboard_benchmark.jsonl
20+
prompt_config: benchmarks/asr_leaderboard/prompts/default.yaml
21+
prepare_script: benchmarks/asr_leaderboard/prepare.py
22+
license: Creative Commons Attribution 4.0 International
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
*benchmark.jsonl
2+
*.flac
3+
*.wav

0 commit comments

Comments
 (0)