fix(omni): GLM-Image noise in dynamo disaggregated path by ptarasiewiczNV · Pull Request #8679 · ai-dynamo/dynamo

ptarasiewiczNV · 2026-04-24T14:18:47Z

Summary

Three commits fix GLM-Image in dynamo's disaggregated omni path:

refactor(omni) — colocate glm_image.yaml under examples/backends/vllm/launch/stage_configs/ (matching single_stage_llm.yaml, qwen2_5_omni_pd.yaml) instead of reaching into the installed vllm_omni package. Removes a leaky abstraction that breaks the launch script against vllm-omni builds that don't ship that specific yaml.
fix(omni) — attach mm_processor_kwargs={target_h, target_w} to the stage-0 OmniTextPrompt for image/video generation requests, so OmniInputPreprocessor._process_text routes through the HF multimodal processor path. Without this, AR-based image-gen models (e.g. GLM-Image) never emit their image-generation scaffold and the DiT stage denoises a collapsed token stream into textured noise.
fix(omni) — place target h/w on the original_prompt dict (both as mm_processor_kwargs for the post-feat: JailedStream #3034 ar2diffusion and as top-level height/width for the 0.19.0rc1 ar2diffusion shipped in the dynamo runtime). Stage processors read this to upsample AR-generated prior tokens; without it they fall back to the 1024x1024 default and decouple from the requested size.

Evidence

zai-org/GLM-Image, prompt "a red apple on a white table", 2×A6000:

Default size (1024×1024) — MD5-identical output across the native and fixed disaggregated paths:

Path	Setup	Output MD5
dynamo disagg (before fix, 3 runs)	same container, same mount stack	`36421ed1d1cfb07499fd166141f7998c` — red striated noise
`vllm-omni serve --omni` (native)	same container, same local vllm-omni install	`fa91343423d032e053327eb6047459b4` — coherent apple
dynamo disagg (after fix)	byte-identical to the native path	`fa91343423d032e053327eb6047459b4` — coherent apple

Non-default size (512×512):

Path	Before first fix	After `route image-gen through multimodal processor` (commit 2)	After `pass target h/w to stage processor` (commit 3)
dynamo disagg	noise	`RuntimeError` at `glm_image_transformer.py:883` (`tensor a (1024) vs tensor b (4096)` — DiT at requested 512 but AR prior upsampled to 1024-scale)	✅ coherent apple, ~60s

Root cause

The dynamo disagg path is a set of separate worker processes glued together by a custom router. It bypasses vllm-omni's OpenAI chat entrypoint entirely (goes through dynamo.vllm.omni.stage_router → dynamo.vllm.omni.stage_worker → AsyncOmni directly), so the upstream vllm-omni#3034 fix doesn't reach it. Two separate pieces of size-metadata have to make it across the stage boundary in the dynamo path and neither did:

Stage 0 (AR) — needs mm_processor_kwargs={target_h, target_w} on the engine prompt so OmniInputPreprocessor._process_text takes the multimodal branch and the HF processor emits GLM-Image's scaffold (<|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|>). Without the scaffold, AR produces a handful of repeated VQ codes and DiT denoises them into noise.
Stage 1 (DiT via the ar2diffusion custom processor) — needs the target size on original_prompt (as mm_processor_kwargs["target_h"/"target_w"] on post-feat: JailedStream #3034 vllm-omni, or top-level height/width on 0.19.0rc1) so it slices and upsamples the AR prior token grid to the right latent shape. Without it, it defaults to 1024×1024 and produces a 64×64 prior regardless of the requested size, which then mismatches the DiT hidden-state shape for any non-1024 latent.

Scope

Minimal, two touched files (utils.py for the preprocessing plumbing, disagg_omni_glm_image.sh + new yaml for the colocation refactor). Image/video-generation paths only; chat / text / audio branches of parse_omni_request are untouched. Models whose HF processor and stage processor ignore target_h/target_w and height/width are unaffected.

Test plan

Manual: disagg_omni_glm_image.sh + /v1/images/generations at 1024x1024 — output MD5-identical to vllm-omni serve.
Manual: same at 512x512 — coherent image (was RuntimeError before).
Regression: 1024x1024 re-tested after the 512 fix — still MD5-identical to the native baseline.
Regression (reviewer): qwen omni agg / single-stage paths — confirm the chat/text branch of parse_omni_request and the audio handler are unchanged (not touched in the diff).
CI: pre-commit hooks pass locally (black / ruff / codespell / yaml / shebangs all clean across all three commits).

🤖 Generated with Claude Code

disagg_omni_glm_image.sh resolved its STAGE_CONFIG from vllm_omni/model_executor/stage_configs/glm_image.yaml inside the installed vllm-omni package — a leaky abstraction that made the script fragile to vllm-omni version drift (local vllm-omni branches without that particular file break the script). Move the yaml under examples/backends/vllm/launch/stage_configs/, matching the pattern already used by agg_omni.sh (single_stage_llm.yaml) and the PD disagg launch scripts (qwen2_5_omni_pd.yaml). Default STAGE_CONFIG to the colocated path so the script works with any vllm-omni build. No behavior change — the yaml content is copied verbatim from vllm-omni 0.19.0rc1; the launch script produces MD5-identical output to the previous vllm_omni-package-resolved path. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>

GLM-Image served via dynamo's disaggregated path (/v1/images/generations → stage_router → AR → DiT) produced noisy / striated images: the AR stage never entered image-generation mode and emitted a handful of repeated VQ codes which the DiT denoised into incoherent textures. Root cause mirrors vllm-omni issue #3034 on the standalone serving-chat path: OmniInputPreprocessor._process_text only routes through the multimodal processor when the prompt carries mm_processor_kwargs. Dynamo's parse_omni_request built the stage-0 OmniTextPrompt with just {prompt}, so the preprocessor fell back to plain _tokenize_prompt, skipping the HF processor that would otherwise emit GLM-Image's image-generation scaffold. Fix: attach mm_processor_kwargs={target_h, target_w} to the stage-0 OmniTextPrompt for IMAGE_GENERATION / VIDEO_GENERATION requests. The non-empty dict triggers the multimodal processor path; target_h/target_w feed the HF processor so it can size the scaffold. Models whose HF processor ignores these kwargs are unaffected. Verified end-to-end with zai-org/GLM-Image at the default 1024x1024: dynamo disagg now produces output that is MD5-identical to `vllm-omni serve zai-org/GLM-Image --omni` for the same prompt/seed. Non-default sizes (e.g. 512x512) expose a separate DiT / AR scaffold sizing mismatch that is out of scope for this bugfix and will be tracked as follow-up work. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>

The previous commit made GLM-Image produce a coherent image at the default 1024x1024, but non-default sizes (e.g. 512x512) still failed with a DiT tensor-dim mismatch (AR-scale prior upsampled to 64x64 while DiT ran at the requested 32x32 latent). Root cause: build_original_prompt dropped the height/width arguments instead of placing them on the prompt dict. Stage processors like GLM-Image's ar2diffusion look up the target size from original_prompt to slice and upsample AR-generated prior tokens; with the fields missing it fell through to a 1024x1024 default regardless of what the request asked for. After the size was locked at 1024, the prior tensor shape never matched the DiT latent shape at other sizes. Fix: put target h/w into the prompt dict. Write both - mm_processor_kwargs={target_h, target_w} — the shape the post-#3034 ar2diffusion reads - top-level height/width — the shape the dynamo runtime's bundled vllm-omni 0.19.0rc1 ar2diffusion reads so the fix works across vllm-omni versions without needing to pin a specific downstream release. Verified on 2xA6000 with zai-org/GLM-Image: - 1024x1024: still MD5-identical to `vllm-omni serve --omni` on the same container (no regression on the default-size path) - 512x512: now produces a coherent image in ~60s; previously failed with RuntimeError at glm_image_transformer.py:883 (tensor a (1024) vs tensor b (4096) on the hidden_states + prior_hidden_states add) Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>

ayushag-nv · 2026-04-24T16:17:39Z

Thanks for working on this

ptarasiewiczNV and others added 2 commits April 24, 2026 16:17

pull-request-size Bot added the size/M label Apr 24, 2026

github-actions Bot added fix backend::vllm Relates to the vllm backend multimodal labels Apr 24, 2026

pull-request-size Bot added size/L and removed size/M labels Apr 24, 2026

copy-pr-bot Bot temporarily deployed to GITLAB April 24, 2026 14:53 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 24, 2026 14:54 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(omni): GLM-Image noise in dynamo disaggregated path#8679

fix(omni): GLM-Image noise in dynamo disaggregated path#8679
ptarasiewiczNV wants to merge 3 commits intomainfrom
ptarasiewicz/glm-image-dynamo-fix

ptarasiewiczNV commented Apr 24, 2026 •

edited

Loading

Uh oh!

ayushag-nv commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ptarasiewiczNV commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Evidence

Root cause

Scope

Test plan

Uh oh!

ayushag-nv commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ptarasiewiczNV commented Apr 24, 2026 •

edited

Loading