Multimodel profiling instead of difficulty estimation by tatevik-t · Pull Request #1380 · NVIDIA-NeMo/Skills

tatevik-t · 2026-04-17T14:12:33Z

Replaces the single-model difficulty_estimation stage in the
OpenScienceReasoning SDG pipeline with a multi-model profiling stage that
runs per-model generate → judge → aggregate chains in parallel and merges
the results into a single profiling array per problem.

Each problem now carries per-model pass-rate metrics instead of a single
scalar, making it straightforward to filter on agreement/disagreement
across a panel of reference models.

Output schema

Per-row field produced by the new stage:

"profiling": [
  {"model": "qwen3-30b-a3b",   "pass_rate": 0.5, "pass_at_n": "2/4"},
  {"model": "nemotron-super",  "pass_rate": 0.8, "pass_at_n": "4/5"},
  {"model": "gemma-4-31b-it",  "pass_rate": 0.3, "pass_at_n": "1/4"}
]

filter_solutions accepts a corresponding dict bound:

profiling_pass_rate_range:
  qwen3-30b-a3b:  [0.0, 0.9]
  nemotron-super: [0.1, 1.0]

(bounds are min-exclusive, max-inclusive — consistent with the existing
generation_model_pass_rate_range.)

…culty_estimation Introduce a new `profiling` stage that runs per-model generate->judge->aggregate chains in parallel for N models and merges results into a single `profiling` array per problem: "profiling": [ {"model": "ModelA", "pass_rate": 0.5, "pass_at_n": "2/4"}, {"model": "ModelB", "pass_rate": 0.8, "pass_at_n": "4/5"}, ] Changes: - `run_pipeline.profiling()` orchestrates: shared prepare -> per-model chains (generate, judge, aggregate) in parallel -> final merge. Judge kwargs are copied per-iteration so args don't leak across models; `num_random_seeds` is inherited from generation if not explicitly set. - `aggregate_profiling_model.py` (new): per-model aggregator over per-seed `output-rs*.jsonl` files. Streams inputs — keeps only the BASE_FIELDS projection + a small counters dict per `(id, problem)` key — so aggregation fits in memory at 1M+ problem scale. Falls back to `(_lineno, line_number)` when neither `id` nor `problem` is present on a record. - `merge_profiling.py` (new): merges per-model result files. Asserts every per-model file contains the same `(id, problem)` key set so row-alignment mismatches fail loudly instead of silently dropping problems. After a successful merge, removes the per-model `result.jsonl` intermediates (folders — generation/, judgement/, logs/ — are retained for debugging). - `filter_solutions.py`: replaces the scalar `difficulty_model_pass_rate` bounds with a per-model dict `profiling_pass_rate_ranges: {model_name: [min, max]}` (min exclusive, max inclusive). - `validate_pipeline.py` and `scripts/utils/constants.py`: update stage-name and field-set checks (`PROFILING_FIELDS`, required `profiling` key, row-count equality for the new stage). - Base + settings YAMLs: renamed stage (`difficulty_estimation` -> `profiling`) and directory (`step-3-difficulty-estimation` -> `step-3-profiling`); new `profiling.models: [...]` list with per-model `generation_kwargs` and an optional per-model `judge_kwargs` override. - SLURM test: references `stages.profiling.models.0.generation_kwargs...` instead of the old top-level path. - README: updates the stage list + filter-parameter description. - `aggregate_difficulty.py` is removed (replaced by the new aggregator + merger).

… Super + Gemma-4) Settings overlay showing the three-model profiling pattern with a shared GPT-OSS-120B judge. Uses HF hub identifiers instead of absolute paths so the file is adaptable to any cluster; each model entry documents the recommended sampling parameters from its model card and, for Gemma, the vLLM-image caveat (override `server_container` if the default image predates `gemma4` architecture support). Usage: `ns run ... --settings profiling_example`.

tatevik-t added 2 commits April 17, 2026 18:03

tatevik-t requested a review from rimashahbazyan April 17, 2026 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodel profiling instead of difficulty estimation#1380

Multimodel profiling instead of difficulty estimation#1380
tatevik-t wants to merge 2 commits intomainfrom
tatevik-t/multi-model-profiling

tatevik-t commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tatevik-t commented Apr 17, 2026

Output schema

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant