Skip to content

fix(sarm): handle BaseModelOutputWithPooling from transformers 5.x in…#3419

Open
masato-ka wants to merge 3 commits intohuggingface:mainfrom
masato-ka:fix/sarm-adapt-to-transformer5.x
Open

fix(sarm): handle BaseModelOutputWithPooling from transformers 5.x in…#3419
masato-ka wants to merge 3 commits intohuggingface:mainfrom
masato-ka:fix/sarm-adapt-to-transformer5.x

Conversation

@masato-ka
Copy link
Copy Markdown
Collaborator

Summary / Motivation

In transformers 5.x, CLIPModel.get_image_features() and get_text_features()
return a BaseModelOutputWithPooling object instead of a plain torch.FloatTensor.
This caused an AttributeError: 'BaseModelOutputWithPooling' object has no attribute 'detach' when running SARM policies with transformers 5.x.

This PR adds an isinstance check in SARMEncodingProcessorStep to extract
pooler_output when the return value is not a plain tensor, maintaining full
backward compatibility with transformers 4.x.

Related issues

What changed

  • src/lerobot/policies/sarm/processor_sarm.py: In both _encode_images() and
    _encode_text(), added a guard to unwrap BaseModelOutputWithPooling.pooler_output
    when get_image_features() / get_text_features() does not return a plain
    torch.Tensor. No behavioral change under transformers 4.x.

How was this tested (or how to run locally)

  • Manual verification: reproduced the AttributeError with transformers 5.x,
    confirmed it is resolved after this fix.
  • No dedicated unit test added (SARM tests require hardware/large model downloads);
    the fix is a two-line guard and straightforward to inspect.

To reproduce the original error:

pip install "transformers>=5.0"
# Run any SARM encode step that calls get_image_features / get_text_features        

Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest)
  • [] Documentation updated(Don't need)
  • [] CI is green
  • [] Community Review: I have reviewed another contributor's open PR and linked it
    here: # (insert PR number/link)

Reviewer notes

  • The only changed file is src/lerobot/policies/sarm/processor_sarm.py, two
    symmetric hunks (one for image, one for text encoding).
  • The fix relies on pooler_output, which is the standard attribute for pooled
    features in all HuggingFace model outputs — equivalent to what the old plain-tensor
    return contained.
  • Anyone in the community is free to review the PR.

… CLIP encoding

In transformers 5.x, CLIPModel.get_image_features() and get_text_features()
return BaseModelOutputWithPooling instead of a plain torch.FloatTensor.
Added isinstance check to extract pooler_output when the return value is not
a tensor, maintaining backward compatibility with transformers 4.x.

Fixes AttributeError: 'BaseModelOutputWithPooling' object has no attribute 'detach'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the policies Items related to robot policies label Apr 20, 2026
@pkooij pkooij self-requested a review April 20, 2026 17:24
# transformers 5.x returns BaseModelOutputWithPooling instead of a plain tensor
output = self.clip_model.get_image_features(**inputs)
if not isinstance(output, torch.Tensor):
output = output.pooler_output
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we should assert output is not None to please mypy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

policies Items related to robot policies

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] SARM training fails with AttributeError: 'BaseModelOutputWithPooling' object has no attribute 'detach' on transformers 5.x

3 participants