Add APEX Shortlist benchmark (#1105)

gwarmstrong · web-flow · commit 2b71315b110f · 2026-04-24T19:07:37.000-07:00
# Add APEX Shortlist benchmark

Migrates the `apex-shortlist` benchmark from NeMo Skills into Gym on top
of the existing `math_with_judge` resource server. Verification uses the
server's symbolic-only path (math-verify, `should_use_judge: false`)
with a new opt-in `parse_reasoning_like_skills` flag that mirrors
Skills' `parse_reasoning=True` + brace-matched `\boxed{…}` extraction —
needed to avoid spurious mid-reasoning extractions on truncated
generations.

## Includes

- `benchmarks/apex_shortlist/` — benchmark config, prepare.py, prompt
template
- Data source: `MathArena/apex-shortlist` on HuggingFace (48 problems,
32 integer + 16 symbolic answers)
- `resources_servers/math_with_judge/` — extended (not new)
- `_search_boxed` brace-matching extractor (mirrors
`nemo_skills.evaluation.math_grader.search_boxed`) and prefers the raw
`\boxed{…}` LaTeX over math-verify's normalized form as judge input
- `_strip_think_tags` + `skills_parity_mode` flag that routes rollouts
through Skills' full judge pipeline (`parse_reasoning` → `search_boxed`
→ prefill shortcuts → LLM judge) for per-rollout parity
- `parse_reasoning_like_skills` flag (new on this branch): applies the
same extraction to the symbolic-only path (no judge), for benchmarks
whose Skills config is `eval_type=math` + `should_use_judge=false`

## Validated against NeMo Skills

Single comparison run on draco-oci: 48 problems × 4 rollouts/task,
Nemotron-3-Nano-30B-A3B-BF16, T=1.0 top_p=0.95 max_output_tokens=65536.
Skills uses 4× single-node vLLM (one per seed); Gym uses a single 4-node
DP vLLM (TP=8, DP=4, Ray-coordinated) with `+num_repeats=4`.

```
===========================================================================
eval_type=math (symbolic-only, math-verify) | 4 rollouts/task | T=1.0 top_p=0.95
Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
===========================================================================
Metric                        Skills         Gym         Delta
---------------------------------------------------------------
pass@1[avg-of-4]               32.3%         34.4%        +2.1%
majority@4                     40.5%         41.7%        +1.2%
pass@4                         56.3%         54.2%        -2.1%
no_answer@1[avg-of-4]          26.6%         29.7%        +3.1%
```

---------

Signed-off-by: gwarmstrong &lt;gwarmstrong@users.noreply.github.com&gt;
Co-authored-by: gwarmstrong &lt;gwarmstrong@users.noreply.github.com&gt;
diff --git a/benchmarks/apex_shortlist/README.md b/benchmarks/apex_shortlist/README.md
@@ -0,0 +1,51 @@
+# APEX Shortlist
+
+Math problems from MathArena's APEX Shortlist, sourced from
+`MathArena/apex-shortlist` on HuggingFace. Mirrors the NeMo Skills
+`apex-shortlist` benchmark (`nemo_skills/dataset/apex-shortlist/`).
+
+## Verification
+
+Reuses the `math_with_judge` resource server in **symbolic-only** mode
+(`should_use_judge: false`) to mirror NeMo Skills' `eval_type=math`
+default for this benchmark. The HuggingFace `math-verify` library does
+symbolic equivalence of the model-extracted `\boxed{...}` answer against
+`expected_answer`.
+
+## Prompt
+
+User-only prompt, character-for-character match with NeMo Skills'
+`generic/math.yaml`:
+
+```
+Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.
+
+<question>
+```
+
+## Data preparation
+
+```bash
+ng_prepare_benchmark '+config_paths=[benchmarks/apex_shortlist/config.yaml]'
+```
+
+Writes `data/apex_shortlist_benchmark.jsonl` with one row per problem:
+`{"question": "...", "expected_answer": "..."}`.
+
+## Running servers
+
+```bash
+config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
+benchmarks/apex_shortlist/config.yaml"
+ng_run "+config_paths=[$config_paths]"
+```
+
+## Collecting rollouts
+
+```bash
+ng_collect_rollouts \
+    +agent_name=apex_shortlist_math_with_judge_simple_agent \
+    +input_jsonl_fpath=benchmarks/apex_shortlist/data/apex_shortlist_benchmark.jsonl \
+    +output_jsonl_fpath=results/apex_shortlist_rollouts.jsonl \
+    +num_repeats=4
+```
diff --git a/benchmarks/apex_shortlist/__init__.py b/benchmarks/apex_shortlist/__init__.py
diff --git a/benchmarks/apex_shortlist/config.yaml b/benchmarks/apex_shortlist/config.yaml
@@ -0,0 +1,24 @@
+# Chain to existing resource server + agent config
+config_paths:
+  - resources_servers/math_with_judge/configs/math_with_judge.yaml
+
+# We use `_inherit_from` directives to inherit from and not use the generic config
+# above to ensure this benchmark config is isolated.
+apex_shortlist_math_with_judge_resources_server:
+  _inherit_from: math_with_judge
+  resources_servers:
+    math_with_judge:
+      should_use_judge: false
+
+apex_shortlist_math_with_judge_simple_agent:
+  _inherit_from: math_with_judge_simple_agent
+  responses_api_agents:
+    simple_agent:
+      resources_server:
+        name: apex_shortlist_math_with_judge_resources_server
+      datasets:
+      - name: apex_shortlist
+        type: benchmark
+        jsonl_fpath: benchmarks/apex_shortlist/data/apex_shortlist_benchmark.jsonl
+        prompt_config: benchmarks/apex_shortlist/prompts/default.yaml
+        prepare_script: benchmarks/apex_shortlist/prepare.py
diff --git a/benchmarks/apex_shortlist/data/.gitignore b/benchmarks/apex_shortlist/data/.gitignore
@@ -0,0 +1 @@
+*benchmark.jsonl
diff --git a/benchmarks/apex_shortlist/prepare.py b/benchmarks/apex_shortlist/prepare.py
@@ -0,0 +1,57 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Prepare the APEX Shortlist benchmark data.
+
+Downloads APEX Shortlist problems from HuggingFace and converts them to the
+Gym benchmark JSONL format with `question` and `expected_answer` fields.
+Mirrors NeMo Skills' `nemo_skills/dataset/apex-shortlist/prepare.py`.
+"""
+
+import json
+from pathlib import Path
+
+from datasets import load_dataset
+
+
+BENCHMARK_DIR = Path(__file__).parent
+DATA_DIR = BENCHMARK_DIR / "data"
+OUTPUT_FPATH = DATA_DIR / "apex_shortlist_benchmark.jsonl"
+
+HF_REPO_ID = "MathArena/apex-shortlist"
+
+
+def prepare() -> Path:
+    """Download and prepare APEX Shortlist data. Returns the output file path."""
+    DATA_DIR.mkdir(parents=True, exist_ok=True)
+
+    print(f"Loading APEX Shortlist data from {HF_REPO_ID}...")
+    ds = load_dataset(HF_REPO_ID, split="train")
+
+    count = 0
+    with open(OUTPUT_FPATH, "w") as f:
+        for row in ds:
+            out = {
+                "question": row["problem"],
+                "expected_answer": str(row["answer"]),
+            }
+            f.write(json.dumps(out) + "\n")
+            count += 1
+
+    print(f"Wrote {count} problems to {OUTPUT_FPATH}")
+    return OUTPUT_FPATH
+
+
+if __name__ == "__main__":
+    prepare()
diff --git a/benchmarks/apex_shortlist/prompts/default.yaml b/benchmarks/apex_shortlist/prompts/default.yaml
@@ -0,0 +1,5 @@
+# Mirrors NeMo Skills' `nemo_skills/prompt/config/generic/math.yaml`.
+user: |-
+  Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{{}}.
+
+  {question}