Skip to content

refactor(dataset): extract MultimodalContent base for SingleTurn/RandomPool#876

Draft
furionw wants to merge 1 commit intomainfrom
qiwa/multimodal_content
Draft

refactor(dataset): extract MultimodalContent base for SingleTurn/RandomPool#876
furionw wants to merge 1 commit intomainfrom
qiwa/multimodal_content

Conversation

@furionw
Copy link
Copy Markdown
Contributor

@furionw furionw commented May 2, 2026

Summary

  • Extract MultimodalContent base class in dataset/loader/models.py holding the 8-field modality surface (text/texts, image/images, audio/audios, video/videos) plus the two shared validators (mutually-exclusive pairs and at-least-one-modality).
  • SingleTurn and RandomPool inherit from it; each retains its type discriminator and class-specific fields/validators (SingleTurn keeps timestamp/delay/role/session_id/output_length and a SingleTurn-only validate_timestamp_delay_exclusive validator).
  • No behavior change; all existing tests pass unchanged.
  • Sets up future modality additions to touch one class instead of two.

Picked option (A) for the timestamp/delay rule — keeping it as a separate validate_timestamp_delay_exclusive validator on SingleTurn rather than overriding the base validator. Pydantic v2 runs @model_validator(mode="after") validators independently, so the behavior is identical, the base validator's name stays truthful, and there's no super().validate_mutually_exclusive_fields() plumbing.

Test plan

  • `uv run pytest tests/unit/dataset/loader/ tests/unit/dataset/ -n auto` — green (1284 passed, identical to baseline)
  • `uv run pytest tests/unit/ -n auto` — same 8740 passed / 18 failed / 8 skipped / 20 errors as the unmodified branch tip; pre-existing flakes (timing/ZMQ/loop scheduler) unrelated to this refactor
  • `pre-commit run --files src/aiperf/dataset/loader/models.py` — all hooks pass; the only failure (`test-imports` flagging missing `trustme`) is pre-existing and reproduces on `main`

Summary by CodeRabbit

  • Refactor
    • Consolidated multimodal validation logic across dataset entry types to reduce code duplication and improve system maintainability while preserving all existing functionality.

…omPool

SingleTurn and RandomPool shared identical 8-field multimodal surface
(text/texts/image/images/audio/audios/video/videos) and two identical
validators (mutually-exclusive pairs, at-least-one-modality). Pull both
into a `MultimodalContent` base; the entry classes inherit and add
their type-specific fields on top.

Picked option (A): the base validator covers only the 4 modality pairs,
and SingleTurn adds a separate `validate_timestamp_delay_exclusive`
validator for its own timestamp/delay exclusivity rule. Pydantic v2
runs validators independently, so behavior is preserved without
inheritance gymnastics, and the base validator's name stays truthful
(it really does only check modality pairs).

No behavior change. All existing tests pass unchanged. Future modality
additions touch one class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Qi Wang <qiwa@nvidia.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@5b6911719e3f0d0a5a4ec2a10352d788f1a99ff0

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@5b6911719e3f0d0a5a4ec2a10352d788f1a99ff0

Last updated for commit: 5b69117Browse code

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 2, 2026

Walkthrough

A new MultimodalContent base model consolidates shared multimodal fields (text/texts, image/images, audio/audios, video/videos) and validation logic. SingleTurn and RandomPool now inherit from this base, eliminating duplicate field declarations and validators while retaining their specific logic.

Changes

Multimodal Base Model Inheritance Refactoring

Layer / File(s) Summary
Base Model Definition
src/aiperf/dataset/loader/models.py (lines 4–77)
New MultimodalContent class introduced with shared multimodal fields (text, texts, image, images, audio, audios, video, videos) and two validators: validate_mutually_exclusive_fields and validate_at_least_one_modality. Import of Self from typing_extensions added.
Model Inheritance Updates
src/aiperf/dataset/loader/models.py (lines 79–158)
SingleTurn and RandomPool refactored to inherit from MultimodalContent. Duplicate modality fields and shared validators removed from both classes. SingleTurn retains validate_timestamp_delay_exclusive; RandomPool retains its dataset-type discriminator.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

Hopping through code, I spotted the strain—
Fields repeated again and again.
One base to rule them, with validators true,
Now SingleTurn and RandomPool shine like new! 🐰✨
Duplication vanquished, inheritance reigns supreme.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main refactoring: extracting a new MultimodalContent base class used by SingleTurn and RandomPool to eliminate duplication.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/aiperf/dataset/loader/models.py (1)

13-76: 💤 Low value

Clean extraction — MultimodalContent base is well-structured.

All modality fields carry Field(description="..."), both validators are correct, and the -> Self return types replace the old forward-reference string literals cleanly. The execution order (exclusivity check before at-least-one check) is preserved by Pydantic v2's definition-order guarantee for mode="after" validators.

One optional note: MultimodalContent is directly instantiable. Since its docstring already calls out that concrete classes must add their own type discriminator, and it is not present in the CustomDatasetT TypeVar bound, the misuse risk is low. If you want stronger enforcement you can add a @model_validator(mode="after") guard or mark it with an abstract type field — but it's not necessary here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/models.py` around lines 13 - 76, MultimodalContent
can be instantiated directly; add a post-model validator on MultimodalContent
(e.g., def validate_has_type_discriminator(self) decorated with
`@model_validator`(mode="after")) that checks for a concrete type discriminator
(for example ensure getattr(self, "type", None) is not None or that
self.__class__ is not MultimodalContent) and raise a ValueError if missing, so
concrete subclasses must provide a `type` field or the base class cannot be used
directly; implement the validator in the MultimodalContent class (name it like
validate_has_type_discriminator) to enforce this.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/aiperf/dataset/loader/models.py`:
- Around line 113-118: The exclusivity check in
validate_timestamp_delay_exclusive currently uses truthiness (if self.timestamp
and self.delay) which incorrectly treats 0/0.0 as False; change the condition to
explicitly test presence with "is not None" (e.g., if self.timestamp is not None
and self.delay is not None) so timestamp=0 or delay=0 are correctly considered
set, and keep the rest of the method (raising ValueError and returning self)
unchanged.

---

Nitpick comments:
In `@src/aiperf/dataset/loader/models.py`:
- Around line 13-76: MultimodalContent can be instantiated directly; add a
post-model validator on MultimodalContent (e.g., def
validate_has_type_discriminator(self) decorated with
`@model_validator`(mode="after")) that checks for a concrete type discriminator
(for example ensure getattr(self, "type", None) is not None or that
self.__class__ is not MultimodalContent) and raise a ValueError if missing, so
concrete subclasses must provide a `type` field or the base class cannot be used
directly; implement the validator in the MultimodalContent class (name it like
validate_has_type_discriminator) to enforce this.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 574ea801-bd9a-4f42-889c-435565af6ba7

📥 Commits

Reviewing files that changed from the base of the PR and between 8763c57 and 5b69117.

📒 Files selected for processing (1)
  • src/aiperf/dataset/loader/models.py

Comment on lines +113 to +118
@model_validator(mode="after")
def validate_timestamp_delay_exclusive(self) -> Self:
"""Ensure timestamp and delay are not set together"""
if self.timestamp and self.delay:
raise ValueError("timestamp and delay cannot be set together")
return self
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Pre-existing: timestamp=0 / delay=0 bypass the exclusivity check.

if self.timestamp and self.delay: evaluates to False when either value is 0 or 0.0. A payload with {"timestamp": 0, "delay": 5, "text": "..."} would pass validation silently. If timestamp=0 (start of trace) is a valid business value, consider using is not None checks instead.

🛡️ Suggested fix
-        if self.timestamp and self.delay:
+        if self.timestamp is not None and self.delay is not None:

This behaviour predates this PR; flagging here as it becomes visible in review of the changed lines.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/models.py` around lines 113 - 118, The exclusivity
check in validate_timestamp_delay_exclusive currently uses truthiness (if
self.timestamp and self.delay) which incorrectly treats 0/0.0 as False; change
the condition to explicitly test presence with "is not None" (e.g., if
self.timestamp is not None and self.delay is not None) so timestamp=0 or delay=0
are correctly considered set, and keep the rest of the method (raising
ValueError and returning self) unchanged.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 2, 2026

Codecov Report

❌ Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/dataset/loader/models.py 88.23% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant