Add ContextASR-Bench benchmark (contextual ASR with NE-WER/NE-FNR metrics)#1365
Add ContextASR-Bench benchmark (contextual ASR with NE-WER/NE-FNR metrics)#1365KunalDhawan wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Kunal Dhawan <kunaldhawan97@gmail.com>
…if pre downloaded data doesn't exist Signed-off-by: Kunal Dhawan <kunaldhawan97@gmail.com>
…ssing contractions dep, fix prompt typo Signed-off-by: Kunal Dhawan <kunaldhawan97@gmail.com>
📝 WalkthroughWalkthroughAdds ContextASR‑Bench: dataset package and prepare script, three evaluation modes (contextless/coarse/fine), ContextASR evaluator and corpus metrics, score aggregation, docs, test registration, and a new dependency ( Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Prepare as "prepare.py"
participant HF as "HuggingFace Hub"
participant Evaluator as "ContextASREvaluator"
participant Metrics as "ContextASRMetrics"
participant Scorer as "contextasr_score.py"
User->>Prepare: main(--data_dir, --audio-prefix)
Prepare->>HF: download JSONL + audio tars
HF-->>Prepare: JSONL + tar files
Prepare->>Prepare: extract WAVs, build per-mode JSONL
Prepare-->>User: write contextless/coarse/fine test.jsonl
User->>Evaluator: eval_single(data_point)
Evaluator->>Evaluator: normalize text, expand contractions
Evaluator->>Evaluator: extract_entities (exact & fuzzy)
Evaluator->>Evaluator: calculate_wer, compute ne_wer/ne_fnr
Evaluator-->>User: sample-level metrics
User->>Metrics: update(predictions)
Metrics->>Metrics: accumulate corpus totals, compute pass@k/majority
Metrics-->>User: corpus-level metrics
User->>Scorer: compute_score(combined_metrics)
Scorer->>Scorer: aggregate weighted metrics across modes
Scorer-->>User: aggregated per-mode results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/test_datasets.py (1)
58-58: Consider adding a SLURM/GPU evaluation smoke for this benchmark.Given the new audio modality plus custom evaluator/metrics path, a lightweight SLURM/GPU smoke test would improve regression detection beyond discovery checks.
Based on learnings: When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_datasets.py` at line 58, Add "contextasr-bench" to the SLURM/GPU smoke test set so the new audio modality and custom evaluator/metrics path are exercised in CI; locate the tuple ("contextasr-bench", ["test"]) in tests/test_datasets.py and add it to the list/collection used for SLURM or GPU smoke runs (e.g., the SLURM_TESTS or GPU_SMOKE dataset list), ensuring the entry is included in the code path that triggers SLURM/GPU evaluation (so the benchmark runs with the GPU/SLURM runner in CI).nemo_skills/dataset/contextasr-bench/prepare.py (1)
88-88: Remove extraneous f-prefix from strings without placeholders.Static analysis correctly flags several f-strings that have no placeholders. These should be regular strings.
♻️ Proposed fix
- print(f"Total download size: ~22 GB (JSONL + 8 audio tar files)") + print("Total download size: ~22 GB (JSONL + 8 audio tar files)")Also apply to lines 123 and 254:
- print(f" Extracted. Removing tar file to save space...") + print(" Extracted. Removing tar file to save space...")- print(f"\nWriting JSONL splits...") + print("\nWriting JSONL splits...")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/contextasr-bench/prepare.py` at line 88, The code uses unnecessary f-strings with no placeholders (e.g., the print call in prepare.py that currently is print(f"Total download size: ~22 GB (JSONL + 8 audio tar files)")); change these to ordinary strings by removing the leading "f" and do the same for the other occurrences noted (lines around the other similar prints at the locations referenced in your review, e.g., the ones you mentioned near 123 and 254) so that no f-prefix is used when there are no interpolations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/contextasr-bench/contextasr_score.py`:
- Around line 71-76: The aggregation currently uses truthiness checks (if
weighted_wer / weighted_ne_wer / weighted_ne_fnr) which will skip valid zero
values; change those conditionals to presence checks (e.g., "is not None") so
zero metrics are included when computing agg["wer"], agg["ne_wer"], and
agg["ne_fnr"] using weighted_wer, weighted_ne_wer, weighted_ne_fnr divided by
total_entries; update the conditional expressions around the assignments to agg
in contextasr_score.py accordingly and ensure total_entries is used as before.
---
Nitpick comments:
In `@nemo_skills/dataset/contextasr-bench/prepare.py`:
- Line 88: The code uses unnecessary f-strings with no placeholders (e.g., the
print call in prepare.py that currently is print(f"Total download size: ~22 GB
(JSONL + 8 audio tar files)")); change these to ordinary strings by removing the
leading "f" and do the same for the other occurrences noted (lines around the
other similar prints at the locations referenced in your review, e.g., the ones
you mentioned near 123 and 254) so that no f-prefix is used when there are no
interpolations.
In `@tests/test_datasets.py`:
- Line 58: Add "contextasr-bench" to the SLURM/GPU smoke test set so the new
audio modality and custom evaluator/metrics path are exercised in CI; locate the
tuple ("contextasr-bench", ["test"]) in tests/test_datasets.py and add it to the
list/collection used for SLURM or GPU smoke runs (e.g., the SLURM_TESTS or
GPU_SMOKE dataset list), ensuring the entry is included in the code path that
triggers SLURM/GPU evaluation (so the benchmark runs with the GPU/SLURM runner
in CI).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 4755148e-f349-4ec6-9b91-5b6fab70c7ef
📒 Files selected for processing (13)
core/requirements.txtdocs/evaluation/speech-audio.mdnemo_skills/dataset/contextasr-bench/__init__.pynemo_skills/dataset/contextasr-bench/coarse/__init__.pynemo_skills/dataset/contextasr-bench/contextasr_score.pynemo_skills/dataset/contextasr-bench/contextless/__init__.pynemo_skills/dataset/contextasr-bench/fine/__init__.pynemo_skills/dataset/contextasr-bench/prepare.pynemo_skills/evaluation/evaluator/__init__.pynemo_skills/evaluation/evaluator/contextasr.pynemo_skills/evaluation/metrics/contextasr_metrics.pynemo_skills/evaluation/metrics/map_metrics.pytests/test_datasets.py
| This can take 30-60 minutes depending on network speed. If you already have the data | ||
| downloaded, use `--data_dir` to skip the download. | ||
|
|
||
| To download to a specific directory, or to use pre-downloaded data: |
There was a problem hiding this comment.
Is it stored in team cache by any chance?
(Not a comment to PR itself)
There was a problem hiding this comment.
Not currently in team cache. The dataset is downloaded from HuggingFace on demand (~22 GB). Happy to upload it to team cache if that would be useful, let me know the preferred location.
|
@KunalDhawan lets fix lint and format pre-commit run --all-files |
|
|
||
| import argparse | ||
| import json | ||
| import subprocess |
There was a problem hiding this comment.
Good catch, removed. It was left over from an earlier iteration that used subprocess for tar extraction. I switched to Python's tarfile module but forgot to clean the import.
| current = [] | ||
| result = [] | ||
| for word in words: | ||
| if not word: |
There was a problem hiding this comment.
In not word:
but if we inside for-loop that already means "word" exists?
There was a problem hiding this comment.
You're right, str.split() never produces empty strings, so the if not word: branch was dead code. Removed it.
There was a problem hiding this comment.
Thanks for the review! All addressed in the latest push:
- Fixed the contextasr_score truthiness bug (now uses presence checks)
- Ran pre-commit run --all-files, all hooks pass
- Also addressed the non-mandatory items: removed unused import subprocess, removed dead if not word: guard, fixed f-strings without placeholders, and added docstrings for 80%+ coverage.
…, clean up dead code, add docstrings Signed-off-by: Kunal Dhawan <kunaldhawan97@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/evaluation/metrics/contextasr_metrics.py`:
- Around line 43-45: The _get_score_dict function should return the standard
"correct" key and not silently default missing fields; change the return to use
direct indexing (e.g., prediction["correct"]) so a missing field raises an error
instead of returning False, ensuring get_metrics() can derive success_rate from
"correct" and failures in evaluator schema surface immediately.
- Around line 60-75: The update method currently sums per-generation WER/NE
accumulators (wer_total_errors, wer_total_ref_words, ne_wer_total_errors,
ne_wer_total_ref_words, ne_fnr_total_hits, ne_fnr_total_entities) across all
generations, which is incorrect when max_k > 1; either fail fast for unsupported
multi-generation aggregation or aggregate WER/NE from the selected hypothesis
per aggregation mode. Add a guard at the start of update that checks self.max_k
(or the equivalent attribute) and raises an error if > 1 to enforce
single-generation scoring, OR change the logic: call _compute_pass_at_k and
_compute_majority_at_k first to determine the chosen hypothesis per sample, then
compute and accumulate WER/NE counts only for those chosen hypotheses (instead
of summing every pred entry), updating wer_total_errors etc. accordingly;
reference update, _compute_pass_at_k, _compute_majority_at_k, and the wer_* /
ne_* accumulator names when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a5104ffa-9363-4a7c-8f4f-6ba06b645ced
📒 Files selected for processing (8)
nemo_skills/dataset/contextasr-bench/coarse/__init__.pynemo_skills/dataset/contextasr-bench/contextasr_score.pynemo_skills/dataset/contextasr-bench/contextless/__init__.pynemo_skills/dataset/contextasr-bench/fine/__init__.pynemo_skills/dataset/contextasr-bench/prepare.pynemo_skills/evaluation/evaluator/contextasr.pynemo_skills/evaluation/metrics/contextasr_metrics.pynemo_skills/evaluation/metrics/map_metrics.py
✅ Files skipped from review due to trivial changes (3)
- nemo_skills/dataset/contextasr-bench/contextless/init.py
- nemo_skills/dataset/contextasr-bench/fine/init.py
- nemo_skills/dataset/contextasr-bench/coarse/init.py
🚧 Files skipped from review as they are similar to previous changes (3)
- nemo_skills/evaluation/metrics/map_metrics.py
- nemo_skills/dataset/contextasr-bench/contextasr_score.py
- nemo_skills/dataset/contextasr-bench/prepare.py
| def _get_score_dict(self, prediction): | ||
| """Extract the binary correctness score from a prediction (WER < 0.5).""" | ||
| return {"is_correct": prediction.get("is_correct", False)} |
There was a problem hiding this comment.
Return the standard score key here.
get_metrics() only derives success_rate from correct on Line 82, so returning is_correct here leaves the primary success metric unset. Also, defaulting missing fields to False will silently skew results if the evaluator output schema ever regresses.
Suggested fix
def _get_score_dict(self, prediction):
"""Extract the binary correctness score from a prediction (WER < 0.5)."""
- return {"is_correct": prediction.get("is_correct", False)}
+ return {"correct": prediction["is_correct"]}As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/evaluation/metrics/contextasr_metrics.py` around lines 43 - 45,
The _get_score_dict function should return the standard "correct" key and not
silently default missing fields; change the return to use direct indexing (e.g.,
prediction["correct"]) so a missing field raises an error instead of returning
False, ensuring get_metrics() can derive success_rate from "correct" and
failures in evaluator schema surface immediately.
| def update(self, predictions): | ||
| """Accumulate per-sample error counts for corpus-level metric computation.""" | ||
| super().update(predictions) | ||
|
|
||
| predicted_answers = [pred.get("generation", "").strip() or None for pred in predictions] | ||
|
|
||
| for pred in predictions: | ||
| self.wer_total_errors += pred.get("wer_errors", 0) | ||
| self.wer_total_ref_words += pred.get("wer_ref_words", 0) | ||
| self.ne_wer_total_errors += pred.get("ne_wer_errors", 0) | ||
| self.ne_wer_total_ref_words += pred.get("ne_wer_ref_words", 0) | ||
| self.ne_fnr_total_hits += pred.get("ne_fnr_hits", 0) | ||
| self.ne_fnr_total_entities += pred.get("ne_fnr_total", 0) | ||
|
|
||
| self._compute_pass_at_k(predictions=predictions, predicted_answers=predicted_answers) | ||
| self._compute_majority_at_k(predictions=predictions, predicted_answers=predicted_answers) |
There was a problem hiding this comment.
Fail fast on max_k > 1 until corpus WER is aggregated per selection mode.
Right now these accumulators add counts from every generation, while get_metrics() exposes one corpus wer/ne_wer/ne_fnr value for each eval mode. With multiple generations, those numbers no longer correspond to pass@k or majority@k; they’re just averages over all raw candidates.
Minimal guard if multi-generation ASR scoring is not intended yet
def __init__(self, compute_no_answer: bool = True, max_k: int = 1):
"""Initialize accumulators for corpus-level WER, NE-WER, and NE-FNR."""
super().__init__(compute_no_answer=compute_no_answer)
+ if max_k != 1:
+ raise ValueError("ContextASRMetrics currently supports max_k=1 only.")
self.max_k = max_kIf you do want max_k > 1, the WER/NE metrics need to be recomputed from the hypothesis chosen by each aggregation mode instead of summing all entries in predictions.
As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/evaluation/metrics/contextasr_metrics.py` around lines 60 - 75,
The update method currently sums per-generation WER/NE accumulators
(wer_total_errors, wer_total_ref_words, ne_wer_total_errors,
ne_wer_total_ref_words, ne_fnr_total_hits, ne_fnr_total_entities) across all
generations, which is incorrect when max_k > 1; either fail fast for unsupported
multi-generation aggregation or aggregate WER/NE from the selected hypothesis
per aggregation mode. Add a guard at the start of update that checks self.max_k
(or the equivalent attribute) and raises an error if > 1 to enforce
single-generation scoring, OR change the logic: call _compute_pass_at_k and
_compute_majority_at_k first to determine the chosen hypothesis per sample, then
compute and accumulate WER/NE counts only for those chosen hypotheses (instead
of summing every pred entry), updating wer_total_errors etc. accordingly;
reference update, _compute_pass_at_k, _compute_majority_at_k, and the wer_* /
ne_* accumulator names when making the change.
Summary
contextless,coarse(domain context), andfine(domain + entity list context), evaluating how contextual information improves ASR transcription quality.ContextASREvaluator) and metrics class (ContextASRMetrics) that compute WER, Named Entity WER (NE-WER via fuzzy matching), and Named Entity False Negative Rate (NE-FNR) following the paper's evaluation methodology.prepare.pyscript is self-contained — it auto-downloads ~22 GB of audio data from HuggingFace when not already present, with--data_dirserving as both "use existing data" and "download here".New files
nemo_skills/dataset/contextasr-bench/__init__.pynemo_skills/dataset/contextasr-bench/prepare.pynemo_skills/dataset/contextasr-bench/contextasr_score.pynemo_skills/dataset/contextasr-bench/{contextless,coarse,fine}/__init__.pynemo_skills/evaluation/evaluator/contextasr.pynemo_skills/evaluation/metrics/contextasr_metrics.pyModified files
nemo_skills/evaluation/evaluator/__init__.pycontextasrevaluatornemo_skills/evaluation/metrics/map_metrics.pycontextasrmetricscore/requirements.txtcontractionsdependencydocs/evaluation/speech-audio.mdtests/test_datasets.pyVerified against
Qwen3-Omni baseline (ad-hoc evaluation): exact match on WER, NE-WER, and NE-FNR across all three modes.
Test plan
pytest tests/test_datasets.pypasses (benchmark is discoverable)ns prepare_data contextasr-bench --data_dir=<path>works with pre-downloaded datans prepare_data contextasr-bench --data_dir=<empty_dir>downloads data to that directorySummary by CodeRabbit
New Features
Documentation
Tests
Chores