Skip to content

Commit 2b71315

Browse files
authored
Add APEX Shortlist benchmark (#1105)
# Add APEX Shortlist benchmark Migrates the `apex-shortlist` benchmark from NeMo Skills into Gym on top of the existing `math_with_judge` resource server. Verification uses the server's symbolic-only path (math-verify, `should_use_judge: false`) with a new opt-in `parse_reasoning_like_skills` flag that mirrors Skills' `parse_reasoning=True` + brace-matched `\boxed{…}` extraction — needed to avoid spurious mid-reasoning extractions on truncated generations. ## Includes - `benchmarks/apex_shortlist/` — benchmark config, prepare.py, prompt template - Data source: `MathArena/apex-shortlist` on HuggingFace (48 problems, 32 integer + 16 symbolic answers) - `resources_servers/math_with_judge/` — extended (not new) - `_search_boxed` brace-matching extractor (mirrors `nemo_skills.evaluation.math_grader.search_boxed`) and prefers the raw `\boxed{…}` LaTeX over math-verify's normalized form as judge input - `_strip_think_tags` + `skills_parity_mode` flag that routes rollouts through Skills' full judge pipeline (`parse_reasoning` → `search_boxed` → prefill shortcuts → LLM judge) for per-rollout parity - `parse_reasoning_like_skills` flag (new on this branch): applies the same extraction to the symbolic-only path (no judge), for benchmarks whose Skills config is `eval_type=math` + `should_use_judge=false` ## Validated against NeMo Skills Single comparison run on draco-oci: 48 problems × 4 rollouts/task, Nemotron-3-Nano-30B-A3B-BF16, T=1.0 top_p=0.95 max_output_tokens=65536. Skills uses 4× single-node vLLM (one per seed); Gym uses a single 4-node DP vLLM (TP=8, DP=4, Ray-coordinated) with `+num_repeats=4`. ``` =========================================================================== eval_type=math (symbolic-only, math-verify) | 4 rollouts/task | T=1.0 top_p=0.95 Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 =========================================================================== Metric Skills Gym Delta --------------------------------------------------------------- pass@1[avg-of-4] 32.3% 34.4% +2.1% majority@4 40.5% 41.7% +1.2% pass@4 56.3% 54.2% -2.1% no_answer@1[avg-of-4] 26.6% 29.7% +3.1% ``` --------- Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com> Co-authored-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
1 parent 072e13d commit 2b71315

6 files changed

Lines changed: 138 additions & 0 deletions

File tree

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# APEX Shortlist
2+
3+
Math problems from MathArena's APEX Shortlist, sourced from
4+
`MathArena/apex-shortlist` on HuggingFace. Mirrors the NeMo Skills
5+
`apex-shortlist` benchmark (`nemo_skills/dataset/apex-shortlist/`).
6+
7+
## Verification
8+
9+
Reuses the `math_with_judge` resource server in **symbolic-only** mode
10+
(`should_use_judge: false`) to mirror NeMo Skills' `eval_type=math`
11+
default for this benchmark. The HuggingFace `math-verify` library does
12+
symbolic equivalence of the model-extracted `\boxed{...}` answer against
13+
`expected_answer`.
14+
15+
## Prompt
16+
17+
User-only prompt, character-for-character match with NeMo Skills'
18+
`generic/math.yaml`:
19+
20+
```
21+
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.
22+
23+
<question>
24+
```
25+
26+
## Data preparation
27+
28+
```bash
29+
ng_prepare_benchmark '+config_paths=[benchmarks/apex_shortlist/config.yaml]'
30+
```
31+
32+
Writes `data/apex_shortlist_benchmark.jsonl` with one row per problem:
33+
`{"question": "...", "expected_answer": "..."}`.
34+
35+
## Running servers
36+
37+
```bash
38+
config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
39+
benchmarks/apex_shortlist/config.yaml"
40+
ng_run "+config_paths=[$config_paths]"
41+
```
42+
43+
## Collecting rollouts
44+
45+
```bash
46+
ng_collect_rollouts \
47+
+agent_name=apex_shortlist_math_with_judge_simple_agent \
48+
+input_jsonl_fpath=benchmarks/apex_shortlist/data/apex_shortlist_benchmark.jsonl \
49+
+output_jsonl_fpath=results/apex_shortlist_rollouts.jsonl \
50+
+num_repeats=4
51+
```

benchmarks/apex_shortlist/__init__.py

Whitespace-only changes.
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Chain to existing resource server + agent config
2+
config_paths:
3+
- resources_servers/math_with_judge/configs/math_with_judge.yaml
4+
5+
# We use `_inherit_from` directives to inherit from and not use the generic config
6+
# above to ensure this benchmark config is isolated.
7+
apex_shortlist_math_with_judge_resources_server:
8+
_inherit_from: math_with_judge
9+
resources_servers:
10+
math_with_judge:
11+
should_use_judge: false
12+
13+
apex_shortlist_math_with_judge_simple_agent:
14+
_inherit_from: math_with_judge_simple_agent
15+
responses_api_agents:
16+
simple_agent:
17+
resources_server:
18+
name: apex_shortlist_math_with_judge_resources_server
19+
datasets:
20+
- name: apex_shortlist
21+
type: benchmark
22+
jsonl_fpath: benchmarks/apex_shortlist/data/apex_shortlist_benchmark.jsonl
23+
prompt_config: benchmarks/apex_shortlist/prompts/default.yaml
24+
prepare_script: benchmarks/apex_shortlist/prepare.py
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*benchmark.jsonl
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
"""Prepare the APEX Shortlist benchmark data.
16+
17+
Downloads APEX Shortlist problems from HuggingFace and converts them to the
18+
Gym benchmark JSONL format with `question` and `expected_answer` fields.
19+
Mirrors NeMo Skills' `nemo_skills/dataset/apex-shortlist/prepare.py`.
20+
"""
21+
22+
import json
23+
from pathlib import Path
24+
25+
from datasets import load_dataset
26+
27+
28+
BENCHMARK_DIR = Path(__file__).parent
29+
DATA_DIR = BENCHMARK_DIR / "data"
30+
OUTPUT_FPATH = DATA_DIR / "apex_shortlist_benchmark.jsonl"
31+
32+
HF_REPO_ID = "MathArena/apex-shortlist"
33+
34+
35+
def prepare() -> Path:
36+
"""Download and prepare APEX Shortlist data. Returns the output file path."""
37+
DATA_DIR.mkdir(parents=True, exist_ok=True)
38+
39+
print(f"Loading APEX Shortlist data from {HF_REPO_ID}...")
40+
ds = load_dataset(HF_REPO_ID, split="train")
41+
42+
count = 0
43+
with open(OUTPUT_FPATH, "w") as f:
44+
for row in ds:
45+
out = {
46+
"question": row["problem"],
47+
"expected_answer": str(row["answer"]),
48+
}
49+
f.write(json.dumps(out) + "\n")
50+
count += 1
51+
52+
print(f"Wrote {count} problems to {OUTPUT_FPATH}")
53+
return OUTPUT_FPATH
54+
55+
56+
if __name__ == "__main__":
57+
prepare()
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Mirrors NeMo Skills' `nemo_skills/prompt/config/generic/math.yaml`.
2+
user: |-
3+
Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{{}}.
4+
5+
{question}

0 commit comments

Comments
 (0)