Skip to content

Multimodel profiling instead of difficulty estimation#1380

Draft
tatevik-t wants to merge 2 commits intomainfrom
tatevik-t/multi-model-profiling
Draft

Multimodel profiling instead of difficulty estimation#1380
tatevik-t wants to merge 2 commits intomainfrom
tatevik-t/multi-model-profiling

Conversation

@tatevik-t
Copy link
Copy Markdown
Collaborator

Replaces the single-model difficulty_estimation stage in the
OpenScienceReasoning SDG pipeline with a multi-model profiling stage that
runs per-model generate → judge → aggregate chains in parallel and merges
the results into a single profiling array per problem.

Each problem now carries per-model pass-rate metrics instead of a single
scalar, making it straightforward to filter on agreement/disagreement
across a panel of reference models.

Output schema

Per-row field produced by the new stage:

"profiling": [
  {"model": "qwen3-30b-a3b",   "pass_rate": 0.5, "pass_at_n": "2/4"},
  {"model": "nemotron-super",  "pass_rate": 0.8, "pass_at_n": "4/5"},
  {"model": "gemma-4-31b-it",  "pass_rate": 0.3, "pass_at_n": "1/4"}
]

filter_solutions accepts a corresponding dict bound:

profiling_pass_rate_range:
  qwen3-30b-a3b:  [0.0, 0.9]
  nemotron-super: [0.1, 1.0]

(bounds are min-exclusive, max-inclusive — consistent with the existing
generation_model_pass_rate_range.)

…culty_estimation

Introduce a new `profiling` stage that runs per-model generate->judge->aggregate
chains in parallel for N models and merges results into a single `profiling`
array per problem:

    "profiling": [
        {"model": "ModelA", "pass_rate": 0.5, "pass_at_n": "2/4"},
        {"model": "ModelB", "pass_rate": 0.8, "pass_at_n": "4/5"},
    ]

Changes:
- `run_pipeline.profiling()` orchestrates: shared prepare -> per-model chains
  (generate, judge, aggregate) in parallel -> final merge. Judge kwargs are
  copied per-iteration so args don't leak across models; `num_random_seeds`
  is inherited from generation if not explicitly set.
- `aggregate_profiling_model.py` (new): per-model aggregator over per-seed
  `output-rs*.jsonl` files. Streams inputs — keeps only the BASE_FIELDS
  projection + a small counters dict per `(id, problem)` key — so aggregation
  fits in memory at 1M+ problem scale. Falls back to `(_lineno, line_number)`
  when neither `id` nor `problem` is present on a record.
- `merge_profiling.py` (new): merges per-model result files. Asserts every
  per-model file contains the same `(id, problem)` key set so row-alignment
  mismatches fail loudly instead of silently dropping problems. After a
  successful merge, removes the per-model `result.jsonl` intermediates
  (folders — generation/, judgement/, logs/ — are retained for debugging).
- `filter_solutions.py`: replaces the scalar `difficulty_model_pass_rate`
  bounds with a per-model dict `profiling_pass_rate_ranges:
  {model_name: [min, max]}` (min exclusive, max inclusive).
- `validate_pipeline.py` and `scripts/utils/constants.py`: update stage-name
  and field-set checks (`PROFILING_FIELDS`, required `profiling` key,
  row-count equality for the new stage).
- Base + settings YAMLs: renamed stage (`difficulty_estimation` -> `profiling`)
  and directory (`step-3-difficulty-estimation` -> `step-3-profiling`); new
  `profiling.models: [...]` list with per-model `generation_kwargs` and an
  optional per-model `judge_kwargs` override.
- SLURM test: references `stages.profiling.models.0.generation_kwargs...`
  instead of the old top-level path.
- README: updates the stage list + filter-parameter description.
- `aggregate_difficulty.py` is removed (replaced by the new aggregator +
  merger).
… Super + Gemma-4)

Settings overlay showing the three-model profiling pattern with a shared
GPT-OSS-120B judge. Uses HF hub identifiers instead of absolute paths so
the file is adaptable to any cluster; each model entry documents the
recommended sampling parameters from its model card and, for Gemma, the
vLLM-image caveat (override `server_container` if the default image
predates `gemma4` architecture support).

Usage: `ns run ... --settings profiling_example`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant