Commit 2b71315
authored
Add APEX Shortlist benchmark (#1105)
# Add APEX Shortlist benchmark
Migrates the `apex-shortlist` benchmark from NeMo Skills into Gym on top
of the existing `math_with_judge` resource server. Verification uses the
server's symbolic-only path (math-verify, `should_use_judge: false`)
with a new opt-in `parse_reasoning_like_skills` flag that mirrors
Skills' `parse_reasoning=True` + brace-matched `\boxed{…}` extraction —
needed to avoid spurious mid-reasoning extractions on truncated
generations.
## Includes
- `benchmarks/apex_shortlist/` — benchmark config, prepare.py, prompt
template
- Data source: `MathArena/apex-shortlist` on HuggingFace (48 problems,
32 integer + 16 symbolic answers)
- `resources_servers/math_with_judge/` — extended (not new)
- `_search_boxed` brace-matching extractor (mirrors
`nemo_skills.evaluation.math_grader.search_boxed`) and prefers the raw
`\boxed{…}` LaTeX over math-verify's normalized form as judge input
- `_strip_think_tags` + `skills_parity_mode` flag that routes rollouts
through Skills' full judge pipeline (`parse_reasoning` → `search_boxed`
→ prefill shortcuts → LLM judge) for per-rollout parity
- `parse_reasoning_like_skills` flag (new on this branch): applies the
same extraction to the symbolic-only path (no judge), for benchmarks
whose Skills config is `eval_type=math` + `should_use_judge=false`
## Validated against NeMo Skills
Single comparison run on draco-oci: 48 problems × 4 rollouts/task,
Nemotron-3-Nano-30B-A3B-BF16, T=1.0 top_p=0.95 max_output_tokens=65536.
Skills uses 4× single-node vLLM (one per seed); Gym uses a single 4-node
DP vLLM (TP=8, DP=4, Ray-coordinated) with `+num_repeats=4`.
```
===========================================================================
eval_type=math (symbolic-only, math-verify) | 4 rollouts/task | T=1.0 top_p=0.95
Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
===========================================================================
Metric Skills Gym Delta
---------------------------------------------------------------
pass@1[avg-of-4] 32.3% 34.4% +2.1%
majority@4 40.5% 41.7% +1.2%
pass@4 56.3% 54.2% -2.1%
no_answer@1[avg-of-4] 26.6% 29.7% +3.1%
```
---------
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Co-authored-by: gwarmstrong <gwarmstrong@users.noreply.github.com>1 parent 072e13d commit 2b71315
6 files changed
Lines changed: 138 additions & 0 deletions
File tree
- benchmarks/apex_shortlist
- data
- prompts
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
Whitespace-only changes.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
0 commit comments