Skip to content

Commit b3c550a

Browse files
authored
Add Stirrup agent + GDPVal eval/RL environment (#1090)
## Summary Adds a Stirrup-based agent + a GDPVal benchmark built on the NeMo-Gym benchmark convention (`ng_prepare_benchmark` + `ng_e2e_collect_rollouts`), validated on the full 220-task GDPVal set in both rubric and comparison scoring modes. ### Architecture Split into three pieces, matching NeMo-Gym's server-type convention: **Benchmark** — `benchmarks/gdpval/` - `prepare.py` downloads `openai/gdpval` from HF → `data/gdpval_benchmark.jsonl` - `config.yaml` wires `gdpval_judge_model` + `gdpval_resources_server` + `gdpval_stirrup_agent` - Entry point: `ng_e2e_collect_rollouts +config_paths=[benchmarks/gdpval/config.yaml]` **Resources server** — `resources_servers/gdpval/` - Owns `verify()` and `aggregate_metrics()` with two modes via `reward_mode`: - `rubric` (default) — LLM-judge per-criterion score, reward in `[0, 1]` - `comparison` — pairwise vs `reference_deliverables_dir`, reward in `{0, 0.5, 1}`; `aggregate_metrics` reduces W/L/T → ELO anchored at `reference_elo` (default 1000) - All scoring, pairwise comparison, Office→PDF preconvert live here. Multimodal judge path used whenever content blocks are available. **Agent** — `responses_api_agents/stirrup_agent/` - `StirrupAgentWrapper` is task-agnostic; task-specific logic in a `TaskStrategy` subclass (`GDPValTask`) - `/run` executes the agent, persists deliverables, POSTs to the resources server's `/verify`, returns the response. Agent is scoring-free. - `aggregate_metrics` proxies to resources server so ELO extras flow through; `/verify` errors caught per-rollout so a single failure can't crash a run - Optional: Apptainer-backed `code_exec`, Tavily web-search **Dependency:** `stirrup>=0.1.7` (Apache 2.0) declared as an extra of the stirrup_agent server, not of core. ## Validation (Ultra V3 SFT iter16k, full 220-task GDPVal, num_repeats=2) **Rubric mode** (n=440): - mean/reward = **0.755**, pass@1 = 0.755, pass@2 = 0.821 - 56% of rollouts score ≥ 0.8 - Pre-refactor port-v4 baseline: 0.24 → **3.1× lift** (dominant contributor: always-visual judge when content blocks are available) **Comparison mode vs fork baseline** (4 trials per pairing, n=440): - W/L/T = 147 / 208 / 77 - win_rate = 0.429 - **eval_elo = 950.6** (vs fork=1000; port-v3 historical=917 → +34 ELO) ## Running ```bash ng_prepare_benchmark '+config_paths=[benchmarks/gdpval/config.yaml]' ng_e2e_collect_rollouts \ '+config_paths=[benchmarks/gdpval/config.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]' \ '++split=benchmark' \ '++output_jsonl_fpath=results/gdpval.jsonl' \ "++gdpval_stirrup_agent.responses_api_agents.stirrup_agent.persist_deliverables_dir=$PWD/output/gdpval" \ # ... policy_* overrides as usual # Add for comparison mode: '++gdpval_resources_server.resources_servers.gdpval.reward_mode=comparison' \ "++gdpval_resources_server.resources_servers.gdpval.reference_deliverables_dir=/path/to/reference" ``` ## Test plan - [x] `pytest resources_servers/gdpval/tests -x` — rubric + comparison unit tests - [x] 10-task rubric smoke (mean/reward 0.719) - [x] Full 220-task rubric (mean/reward 0.755) - [x] 10-task comparison smoke (eval_elo ~1017 on small sample) - [x] Full 220-task comparison (eval_elo 950.6) --------- Signed-off-by: Serge Panev <spanev@nvidia.com>
1 parent aaaf0be commit b3c550a

44 files changed

Lines changed: 5718 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

benchmarks/gdpval/README.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# GDPVal benchmark
2+
3+
[GDPVal](https://huggingface.co/datasets/openai/gdpval) — 220 professional
4+
knowledge-work tasks scored by an LLM judge against per-task rubrics. This
5+
benchmark wires the Stirrup-based agent (`responses_api_agents/stirrup_agent`)
6+
to the GDPVal resources server (`resources_servers/gdpval`).
7+
8+
## Prepare data
9+
10+
Downloads `openai/gdpval` from HuggingFace and writes
11+
`data/gdpval_benchmark.jsonl`:
12+
13+
```bash
14+
ng_prepare_benchmark "+config_paths=[benchmarks/gdpval/config.yaml]"
15+
```
16+
17+
## Run rubric mode (default)
18+
19+
Each deliverable is scored 0–1 against the task rubric.
20+
21+
```bash
22+
config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
23+
benchmarks/gdpval/config.yaml"
24+
ng_e2e_collect_rollouts \
25+
"+config_paths=[${config_paths}]" \
26+
++output_jsonl_fpath=results/gdpval_rubric.jsonl \
27+
++split=benchmark \
28+
++policy_base_url=<vllm_base_url> \
29+
++policy_api_key=<vllm_api_key> \
30+
++policy_model_name=<served_model_name>
31+
```
32+
33+
Required environment variables for the judge:
34+
35+
- `JUDGE_API_KEY` — sk- key for the judge inference API (nvapi- keys 401 on
36+
multimodal payloads)
37+
- `JUDGE_BASE_URL` — defaults to NVIDIA's internal inference API
38+
- `JUDGE_MODEL_NAME` — defaults to `gcp/google/gemini-3.1-pro-preview`
39+
- `HF_TOKEN` — for downloading reference files (avoids HF anonymous rate limits)
40+
41+
## Run comparison mode (pairwise ELO vs. a reference model)
42+
43+
Each deliverable is judged against a reference model's deliverable for the
44+
same `task_id`; aggregate metrics include ELO relative to a configurable
45+
anchor (default 1000).
46+
47+
```bash
48+
ng_e2e_collect_rollouts \
49+
"+config_paths=[${config_paths}]" \
50+
++output_jsonl_fpath=results/gdpval_compare.jsonl \
51+
++split=benchmark \
52+
++gdpval_resources_server.resources_servers.gdpval.reward_mode=comparison \
53+
++gdpval_resources_server.resources_servers.gdpval.reference_deliverables_dir=/path/to/reference/output
54+
```
55+
56+
The reference directory must be laid out as
57+
`<reference_deliverables_dir>/task_<task_id>/` with `finish_params.json` and
58+
the deliverable files (the same layout the Stirrup agent persists).
59+
60+
## Aggregate metrics
61+
62+
After `ng_e2e_collect_rollouts` returns, the resources server's
63+
`/aggregate_metrics` endpoint emits headline scores in
64+
`results/<output>_metrics.json`:
65+
66+
- Rubric mode: `mean/reward` (pass@1 equivalent)
67+
- Comparison mode: `comparison/wins`, `comparison/losses`, `comparison/ties`,
68+
`comparison/win_rate`, `comparison/eval_elo`, `comparison/normalized_elo`

benchmarks/gdpval/__init__.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.

benchmarks/gdpval/config.yaml

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# GDPVal benchmark — Stirrup agent + GDPVal resources server.
2+
#
3+
# Run:
4+
# ng_prepare_benchmark "+config_paths=[benchmarks/gdpval/config.yaml]"
5+
# ng_e2e_collect_rollouts \
6+
# "+config_paths=[responses_api_models/vllm_model/configs/vllm_model.yaml,benchmarks/gdpval/config.yaml]" \
7+
# ++split=benchmark \
8+
# ++output_jsonl_fpath=results/gdpval.jsonl
9+
#
10+
# Comparison mode (pairwise ELO vs a reference model's deliverables):
11+
# ++gdpval_resources_server.resources_servers.gdpval.reward_mode=comparison \
12+
# ++gdpval_resources_server.resources_servers.gdpval.reference_deliverables_dir=/path/to/fork
13+
14+
# Judge model — proxy to NVIDIA inference API for Gemini 3.1 Pro.
15+
gdpval_judge_model:
16+
responses_api_models:
17+
openai_model:
18+
entrypoint: app.py
19+
openai_base_url: ${oc.env:JUDGE_BASE_URL,https://inference-api.nvidia.com/v1}
20+
openai_api_key: ${oc.env:JUDGE_API_KEY,dummy}
21+
openai_model: ${oc.env:JUDGE_MODEL_NAME,gcp/google/gemini-3.1-pro-preview}
22+
23+
# GDPVal resources server (rubric scoring by default; switch to comparison via override).
24+
gdpval_resources_server:
25+
resources_servers:
26+
gdpval:
27+
entrypoint: app.py
28+
domain: other
29+
verified: false
30+
reward_mode: rubric
31+
reference_deliverables_dir: null
32+
num_comparison_trials: 4
33+
reference_elo: 1000.0
34+
preconvert_office_to_pdf: true
35+
preconvert_max_concurrent: 1
36+
judge_model_server:
37+
type: responses_api_models
38+
name: gdpval_judge_model
39+
judge_responses_create_params_overrides: {}
40+
41+
# Stirrup agent paired with the resources server above.
42+
gdpval_stirrup_agent:
43+
responses_api_agents:
44+
stirrup_agent:
45+
entrypoint: app.py
46+
task: gdpval
47+
agent_max_turns: 100
48+
concurrency: 32
49+
temperature: 1.0
50+
system_prompt_template: ${oc.env:SYSTEM_PROMPT_TEMPLATE,null}
51+
user_prompt_template: ${oc.env:USER_PROMPT_TEMPLATE,null}
52+
gdpval_container_path: ${oc.env:GDPVAL_CONTAINER_PATH,null}
53+
persist_deliverables_dir: ${oc.env:PERSIST_DELIVERABLES_DIR,output/gdpval/deliverables}
54+
resources_server:
55+
type: resources_servers
56+
name: gdpval_resources_server
57+
model_server:
58+
type: responses_api_models
59+
name: policy_model
60+
datasets:
61+
- name: gdpval
62+
type: benchmark
63+
jsonl_fpath: benchmarks/gdpval/data/gdpval_benchmark.jsonl
64+
prompt_config: null
65+
prepare_script: benchmarks/gdpval/prepare.py
66+
num_repeats: 2

benchmarks/gdpval/data/.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
*train.jsonl
2+
*validation.jsonl
3+
*benchmark.jsonl
4+
*train_prepare.jsonl
5+
*validation_prepare.jsonl
6+
*example_prepare.jsonl

benchmarks/gdpval/prepare.py

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
"""Prepare the GDPVal benchmark JSONL.
16+
17+
Downloads the ``openai/gdpval`` HuggingFace dataset and converts it into the
18+
NeMo-Gym benchmark JSONL format: each row has ``responses_create_params`` (an
19+
empty input — the Stirrup agent builds the actual prompt from the top-level
20+
``prompt`` / ``sector`` / ``occupation`` fields) plus task metadata at the
21+
top level so the GDPVal resources server can pick them up via /verify.
22+
"""
23+
24+
from __future__ import annotations
25+
26+
import json
27+
import os
28+
from pathlib import Path
29+
30+
31+
BENCHMARK_DIR = Path(__file__).parent
32+
DATA_DIR = BENCHMARK_DIR / "data"
33+
OUTPUT_FPATH = DATA_DIR / "gdpval_benchmark.jsonl"
34+
35+
HF_DATASET = "openai/gdpval"
36+
HF_SPLIT = "train"
37+
38+
39+
def prepare() -> Path:
40+
from datasets import load_dataset
41+
42+
DATA_DIR.mkdir(parents=True, exist_ok=True)
43+
# Pass HF_TOKEN explicitly — ``load_dataset`` doesn't always pick it up
44+
# from the env, and GDPVal's bucket aggressively rate-limits anonymous IPs.
45+
hf_token = os.environ.get("HF_TOKEN") or os.environ.get("HUGGING_FACE_HUB_TOKEN")
46+
ds = load_dataset(HF_DATASET, split=HF_SPLIT, token=hf_token)
47+
48+
with OUTPUT_FPATH.open("w") as f:
49+
for row in ds:
50+
record = {
51+
# Empty input: the Stirrup agent constructs the user prompt
52+
# from the top-level ``prompt`` field at runtime.
53+
"responses_create_params": {"input": []},
54+
"task_id": row["task_id"],
55+
"sector": row.get("sector", ""),
56+
"occupation": row.get("occupation", ""),
57+
"prompt": row["prompt"],
58+
"reference_files": row.get("reference_files", []),
59+
"reference_file_urls": row.get("reference_file_urls", []),
60+
"rubric_json": row.get("rubric_json", {}),
61+
"rubric_pretty": row.get("rubric_pretty", ""),
62+
}
63+
f.write(json.dumps(record) + "\n")
64+
65+
print(f"Wrote {len(ds)} tasks to {OUTPUT_FPATH}")
66+
return OUTPUT_FPATH
67+
68+
69+
if __name__ == "__main__":
70+
prepare()

resources_servers/gdpval/README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# GDPVal resources server
2+
3+
Scores deliverables produced by the Stirrup agent on the GDPVal benchmark.
4+
5+
Two modes via `reward_mode` config:
6+
7+
- `rubric` (default) — LLM judge scores each deliverable against a per-task
8+
rubric, reward in `[0.0, 1.0]`.
9+
- `comparison` — pairwise judge compares eval deliverable vs. a reference
10+
rollout (`reference_deliverables_dir` must be set), reward in
11+
`{0.0, 0.5, 1.0}`. `aggregate_metrics` reduces to an ELO rating.
12+
13+
Canonical entry point is the benchmark at `benchmarks/gdpval/`:
14+
15+
```bash
16+
ng_prepare_benchmark "+config_paths=[benchmarks/gdpval/config.yaml]"
17+
ng_e2e_collect_rollouts \
18+
"+config_paths=[responses_api_models/vllm_model/configs/vllm_model.yaml,benchmarks/gdpval/config.yaml]" \
19+
++split=benchmark
20+
```
21+
22+
See `benchmarks/gdpval/README.md` for the full run recipe.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.

0 commit comments

Comments
 (0)