Skip to content

fix(omni): GLM-Image noise in dynamo disaggregated path#8679

Draft
ptarasiewiczNV wants to merge 3 commits intomainfrom
ptarasiewicz/glm-image-dynamo-fix
Draft

fix(omni): GLM-Image noise in dynamo disaggregated path#8679
ptarasiewiczNV wants to merge 3 commits intomainfrom
ptarasiewicz/glm-image-dynamo-fix

Conversation

@ptarasiewiczNV
Copy link
Copy Markdown
Contributor

@ptarasiewiczNV ptarasiewiczNV commented Apr 24, 2026

Summary

Three commits fix GLM-Image in dynamo's disaggregated omni path:

  • refactor(omni) — colocate glm_image.yaml under examples/backends/vllm/launch/stage_configs/ (matching single_stage_llm.yaml, qwen2_5_omni_pd.yaml) instead of reaching into the installed vllm_omni package. Removes a leaky abstraction that breaks the launch script against vllm-omni builds that don't ship that specific yaml.
  • fix(omni) — attach mm_processor_kwargs={target_h, target_w} to the stage-0 OmniTextPrompt for image/video generation requests, so OmniInputPreprocessor._process_text routes through the HF multimodal processor path. Without this, AR-based image-gen models (e.g. GLM-Image) never emit their image-generation scaffold and the DiT stage denoises a collapsed token stream into textured noise.
  • fix(omni) — place target h/w on the original_prompt dict (both as mm_processor_kwargs for the post-feat: JailedStream #3034 ar2diffusion and as top-level height/width for the 0.19.0rc1 ar2diffusion shipped in the dynamo runtime). Stage processors read this to upsample AR-generated prior tokens; without it they fall back to the 1024x1024 default and decouple from the requested size.

Evidence

zai-org/GLM-Image, prompt "a red apple on a white table", 2×A6000:

Default size (1024×1024) — MD5-identical output across the native and fixed disaggregated paths:

Path Setup Output MD5
dynamo disagg (before fix, 3 runs) same container, same mount stack 36421ed1d1cfb07499fd166141f7998c — red striated noise
vllm-omni serve --omni (native) same container, same local vllm-omni install fa91343423d032e053327eb6047459b4 — coherent apple
dynamo disagg (after fix) byte-identical to the native path fa91343423d032e053327eb6047459b4 — coherent apple

Non-default size (512×512):

Path Before first fix After route image-gen through multimodal processor (commit 2) After pass target h/w to stage processor (commit 3)
dynamo disagg noise RuntimeError at glm_image_transformer.py:883 (tensor a (1024) vs tensor b (4096) — DiT at requested 512 but AR prior upsampled to 1024-scale) ✅ coherent apple, ~60s

Root cause

The dynamo disagg path is a set of separate worker processes glued together by a custom router. It bypasses vllm-omni's OpenAI chat entrypoint entirely (goes through dynamo.vllm.omni.stage_routerdynamo.vllm.omni.stage_workerAsyncOmni directly), so the upstream vllm-omni#3034 fix doesn't reach it. Two separate pieces of size-metadata have to make it across the stage boundary in the dynamo path and neither did:

  1. Stage 0 (AR) — needs mm_processor_kwargs={target_h, target_w} on the engine prompt so OmniInputPreprocessor._process_text takes the multimodal branch and the HF processor emits GLM-Image's scaffold (<|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|>). Without the scaffold, AR produces a handful of repeated VQ codes and DiT denoises them into noise.
  2. Stage 1 (DiT via the ar2diffusion custom processor) — needs the target size on original_prompt (as mm_processor_kwargs["target_h"/"target_w"] on post-feat: JailedStream #3034 vllm-omni, or top-level height/width on 0.19.0rc1) so it slices and upsamples the AR prior token grid to the right latent shape. Without it, it defaults to 1024×1024 and produces a 64×64 prior regardless of the requested size, which then mismatches the DiT hidden-state shape for any non-1024 latent.

Scope

Minimal, two touched files (utils.py for the preprocessing plumbing, disagg_omni_glm_image.sh + new yaml for the colocation refactor). Image/video-generation paths only; chat / text / audio branches of parse_omni_request are untouched. Models whose HF processor and stage processor ignore target_h/target_w and height/width are unaffected.

Test plan

  • Manual: disagg_omni_glm_image.sh + /v1/images/generations at 1024x1024 — output MD5-identical to vllm-omni serve.
  • Manual: same at 512x512 — coherent image (was RuntimeError before).
  • Regression: 1024x1024 re-tested after the 512 fix — still MD5-identical to the native baseline.
  • Regression (reviewer): qwen omni agg / single-stage paths — confirm the chat/text branch of parse_omni_request and the audio handler are unchanged (not touched in the diff).
  • CI: pre-commit hooks pass locally (black / ruff / codespell / yaml / shebangs all clean across all three commits).

🤖 Generated with Claude Code

ptarasiewiczNV and others added 2 commits April 24, 2026 16:17
disagg_omni_glm_image.sh resolved its STAGE_CONFIG from
vllm_omni/model_executor/stage_configs/glm_image.yaml inside the
installed vllm-omni package — a leaky abstraction that made the script
fragile to vllm-omni version drift (local vllm-omni branches without
that particular file break the script).

Move the yaml under examples/backends/vllm/launch/stage_configs/, matching
the pattern already used by agg_omni.sh (single_stage_llm.yaml) and the
PD disagg launch scripts (qwen2_5_omni_pd.yaml). Default STAGE_CONFIG to
the colocated path so the script works with any vllm-omni build.

No behavior change — the yaml content is copied verbatim from vllm-omni
0.19.0rc1; the launch script produces MD5-identical output to the
previous vllm_omni-package-resolved path.

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
GLM-Image served via dynamo's disaggregated path (/v1/images/generations
→ stage_router → AR → DiT) produced noisy / striated images: the AR
stage never entered image-generation mode and emitted a handful of
repeated VQ codes which the DiT denoised into incoherent textures.

Root cause mirrors vllm-omni issue #3034 on the standalone serving-chat
path: OmniInputPreprocessor._process_text only routes through the
multimodal processor when the prompt carries mm_processor_kwargs.
Dynamo's parse_omni_request built the stage-0 OmniTextPrompt with just
{prompt}, so the preprocessor fell back to plain _tokenize_prompt,
skipping the HF processor that would otherwise emit GLM-Image's
image-generation scaffold.

Fix: attach mm_processor_kwargs={target_h, target_w} to the stage-0
OmniTextPrompt for IMAGE_GENERATION / VIDEO_GENERATION requests. The
non-empty dict triggers the multimodal processor path; target_h/target_w
feed the HF processor so it can size the scaffold. Models whose HF
processor ignores these kwargs are unaffected.

Verified end-to-end with zai-org/GLM-Image at the default 1024x1024:
dynamo disagg now produces output that is MD5-identical to
`vllm-omni serve zai-org/GLM-Image --omni` for the same prompt/seed.
Non-default sizes (e.g. 512x512) expose a separate DiT / AR scaffold
sizing mismatch that is out of scope for this bugfix and will be
tracked as follow-up work.

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
@github-actions github-actions Bot added fix backend::vllm Relates to the vllm backend multimodal labels Apr 24, 2026
The previous commit made GLM-Image produce a coherent image at the
default 1024x1024, but non-default sizes (e.g. 512x512) still failed
with a DiT tensor-dim mismatch (AR-scale prior upsampled to 64x64 while
DiT ran at the requested 32x32 latent).

Root cause: build_original_prompt dropped the height/width arguments
instead of placing them on the prompt dict. Stage processors like
GLM-Image's ar2diffusion look up the target size from original_prompt
to slice and upsample AR-generated prior tokens; with the fields
missing it fell through to a 1024x1024 default regardless of what the
request asked for. After the size was locked at 1024, the prior tensor
shape never matched the DiT latent shape at other sizes.

Fix: put target h/w into the prompt dict. Write both

- mm_processor_kwargs={target_h, target_w} — the shape the
  post-#3034 ar2diffusion reads
- top-level height/width — the shape the dynamo runtime's bundled
  vllm-omni 0.19.0rc1 ar2diffusion reads

so the fix works across vllm-omni versions without needing to pin
a specific downstream release.

Verified on 2xA6000 with zai-org/GLM-Image:
- 1024x1024: still MD5-identical to `vllm-omni serve --omni` on the
  same container (no regression on the default-size path)
- 512x512: now produces a coherent image in ~60s; previously failed
  with RuntimeError at glm_image_transformer.py:883
  (tensor a (1024) vs tensor b (4096) on the hidden_states +
  prior_hidden_states add)

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
@ayushag-nv
Copy link
Copy Markdown
Contributor

Thanks for working on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants