Multimodel profiling instead of difficulty estimation#1380
Draft
Multimodel profiling instead of difficulty estimation#1380
Conversation
…culty_estimation
Introduce a new `profiling` stage that runs per-model generate->judge->aggregate
chains in parallel for N models and merges results into a single `profiling`
array per problem:
"profiling": [
{"model": "ModelA", "pass_rate": 0.5, "pass_at_n": "2/4"},
{"model": "ModelB", "pass_rate": 0.8, "pass_at_n": "4/5"},
]
Changes:
- `run_pipeline.profiling()` orchestrates: shared prepare -> per-model chains
(generate, judge, aggregate) in parallel -> final merge. Judge kwargs are
copied per-iteration so args don't leak across models; `num_random_seeds`
is inherited from generation if not explicitly set.
- `aggregate_profiling_model.py` (new): per-model aggregator over per-seed
`output-rs*.jsonl` files. Streams inputs — keeps only the BASE_FIELDS
projection + a small counters dict per `(id, problem)` key — so aggregation
fits in memory at 1M+ problem scale. Falls back to `(_lineno, line_number)`
when neither `id` nor `problem` is present on a record.
- `merge_profiling.py` (new): merges per-model result files. Asserts every
per-model file contains the same `(id, problem)` key set so row-alignment
mismatches fail loudly instead of silently dropping problems. After a
successful merge, removes the per-model `result.jsonl` intermediates
(folders — generation/, judgement/, logs/ — are retained for debugging).
- `filter_solutions.py`: replaces the scalar `difficulty_model_pass_rate`
bounds with a per-model dict `profiling_pass_rate_ranges:
{model_name: [min, max]}` (min exclusive, max inclusive).
- `validate_pipeline.py` and `scripts/utils/constants.py`: update stage-name
and field-set checks (`PROFILING_FIELDS`, required `profiling` key,
row-count equality for the new stage).
- Base + settings YAMLs: renamed stage (`difficulty_estimation` -> `profiling`)
and directory (`step-3-difficulty-estimation` -> `step-3-profiling`); new
`profiling.models: [...]` list with per-model `generation_kwargs` and an
optional per-model `judge_kwargs` override.
- SLURM test: references `stages.profiling.models.0.generation_kwargs...`
instead of the old top-level path.
- README: updates the stage list + filter-parameter description.
- `aggregate_difficulty.py` is removed (replaced by the new aggregator +
merger).
… Super + Gemma-4) Settings overlay showing the three-model profiling pattern with a shared GPT-OSS-120B judge. Uses HF hub identifiers instead of absolute paths so the file is adaptable to any cluster; each model entry documents the recommended sampling parameters from its model card and, for Gemma, the vLLM-image caveat (override `server_container` if the default image predates `gemma4` architecture support). Usage: `ns run ... --settings profiling_example`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the single-model
difficulty_estimationstage in theOpenScienceReasoning SDG pipeline with a multi-model
profilingstage thatruns per-model
generate → judge → aggregatechains in parallel and mergesthe results into a single
profilingarray per problem.Each problem now carries per-model pass-rate metrics instead of a single
scalar, making it straightforward to filter on agreement/disagreement
across a panel of reference models.
Output schema
Per-row field produced by the new stage:
filter_solutions accepts a corresponding dict bound:
(bounds are min-exclusive, max-inclusive — consistent with the existing
generation_model_pass_rate_range.)