fix(omni): GLM-Image noise in dynamo disaggregated path#8679
Draft
ptarasiewiczNV wants to merge 3 commits intomainfrom
Draft
fix(omni): GLM-Image noise in dynamo disaggregated path#8679ptarasiewiczNV wants to merge 3 commits intomainfrom
ptarasiewiczNV wants to merge 3 commits intomainfrom
Conversation
disagg_omni_glm_image.sh resolved its STAGE_CONFIG from vllm_omni/model_executor/stage_configs/glm_image.yaml inside the installed vllm-omni package — a leaky abstraction that made the script fragile to vllm-omni version drift (local vllm-omni branches without that particular file break the script). Move the yaml under examples/backends/vllm/launch/stage_configs/, matching the pattern already used by agg_omni.sh (single_stage_llm.yaml) and the PD disagg launch scripts (qwen2_5_omni_pd.yaml). Default STAGE_CONFIG to the colocated path so the script works with any vllm-omni build. No behavior change — the yaml content is copied verbatim from vllm-omni 0.19.0rc1; the launch script produces MD5-identical output to the previous vllm_omni-package-resolved path. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
GLM-Image served via dynamo's disaggregated path (/v1/images/generations → stage_router → AR → DiT) produced noisy / striated images: the AR stage never entered image-generation mode and emitted a handful of repeated VQ codes which the DiT denoised into incoherent textures. Root cause mirrors vllm-omni issue #3034 on the standalone serving-chat path: OmniInputPreprocessor._process_text only routes through the multimodal processor when the prompt carries mm_processor_kwargs. Dynamo's parse_omni_request built the stage-0 OmniTextPrompt with just {prompt}, so the preprocessor fell back to plain _tokenize_prompt, skipping the HF processor that would otherwise emit GLM-Image's image-generation scaffold. Fix: attach mm_processor_kwargs={target_h, target_w} to the stage-0 OmniTextPrompt for IMAGE_GENERATION / VIDEO_GENERATION requests. The non-empty dict triggers the multimodal processor path; target_h/target_w feed the HF processor so it can size the scaffold. Models whose HF processor ignores these kwargs are unaffected. Verified end-to-end with zai-org/GLM-Image at the default 1024x1024: dynamo disagg now produces output that is MD5-identical to `vllm-omni serve zai-org/GLM-Image --omni` for the same prompt/seed. Non-default sizes (e.g. 512x512) expose a separate DiT / AR scaffold sizing mismatch that is out of scope for this bugfix and will be tracked as follow-up work. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
The previous commit made GLM-Image produce a coherent image at the
default 1024x1024, but non-default sizes (e.g. 512x512) still failed
with a DiT tensor-dim mismatch (AR-scale prior upsampled to 64x64 while
DiT ran at the requested 32x32 latent).
Root cause: build_original_prompt dropped the height/width arguments
instead of placing them on the prompt dict. Stage processors like
GLM-Image's ar2diffusion look up the target size from original_prompt
to slice and upsample AR-generated prior tokens; with the fields
missing it fell through to a 1024x1024 default regardless of what the
request asked for. After the size was locked at 1024, the prior tensor
shape never matched the DiT latent shape at other sizes.
Fix: put target h/w into the prompt dict. Write both
- mm_processor_kwargs={target_h, target_w} — the shape the
post-#3034 ar2diffusion reads
- top-level height/width — the shape the dynamo runtime's bundled
vllm-omni 0.19.0rc1 ar2diffusion reads
so the fix works across vllm-omni versions without needing to pin
a specific downstream release.
Verified on 2xA6000 with zai-org/GLM-Image:
- 1024x1024: still MD5-identical to `vllm-omni serve --omni` on the
same container (no regression on the default-size path)
- 512x512: now produces a coherent image in ~60s; previously failed
with RuntimeError at glm_image_transformer.py:883
(tensor a (1024) vs tensor b (4096) on the hidden_states +
prior_hidden_states add)
Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Contributor
|
Thanks for working on this |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three commits fix GLM-Image in dynamo's disaggregated omni path:
refactor(omni)— colocateglm_image.yamlunderexamples/backends/vllm/launch/stage_configs/(matchingsingle_stage_llm.yaml,qwen2_5_omni_pd.yaml) instead of reaching into the installedvllm_omnipackage. Removes a leaky abstraction that breaks the launch script against vllm-omni builds that don't ship that specific yaml.fix(omni)— attachmm_processor_kwargs={target_h, target_w}to the stage-0OmniTextPromptfor image/video generation requests, soOmniInputPreprocessor._process_textroutes through the HF multimodal processor path. Without this, AR-based image-gen models (e.g. GLM-Image) never emit their image-generation scaffold and the DiT stage denoises a collapsed token stream into textured noise.fix(omni)— place target h/w on theoriginal_promptdict (both asmm_processor_kwargsfor the post-feat: JailedStream #3034 ar2diffusion and as top-levelheight/widthfor the 0.19.0rc1 ar2diffusion shipped in the dynamo runtime). Stage processors read this to upsample AR-generated prior tokens; without it they fall back to the 1024x1024 default and decouple from the requested size.Evidence
zai-org/GLM-Image, prompt"a red apple on a white table", 2×A6000:Default size (1024×1024) — MD5-identical output across the native and fixed disaggregated paths:
36421ed1d1cfb07499fd166141f7998c— red striated noisevllm-omni serve --omni(native)fa91343423d032e053327eb6047459b4— coherent applefa91343423d032e053327eb6047459b4— coherent appleNon-default size (512×512):
route image-gen through multimodal processor(commit 2)pass target h/w to stage processor(commit 3)RuntimeErroratglm_image_transformer.py:883(tensor a (1024) vs tensor b (4096)— DiT at requested 512 but AR prior upsampled to 1024-scale)Root cause
The dynamo disagg path is a set of separate worker processes glued together by a custom router. It bypasses vllm-omni's OpenAI chat entrypoint entirely (goes through
dynamo.vllm.omni.stage_router→dynamo.vllm.omni.stage_worker→AsyncOmnidirectly), so the upstream vllm-omni#3034 fix doesn't reach it. Two separate pieces of size-metadata have to make it across the stage boundary in the dynamo path and neither did:mm_processor_kwargs={target_h, target_w}on the engine prompt soOmniInputPreprocessor._process_texttakes the multimodal branch and the HF processor emits GLM-Image's scaffold (<|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|>). Without the scaffold, AR produces a handful of repeated VQ codes and DiT denoises them into noise.ar2diffusioncustom processor) — needs the target size onoriginal_prompt(asmm_processor_kwargs["target_h"/"target_w"]on post-feat: JailedStream #3034 vllm-omni, or top-levelheight/widthon 0.19.0rc1) so it slices and upsamples the AR prior token grid to the right latent shape. Without it, it defaults to 1024×1024 and produces a 64×64 prior regardless of the requested size, which then mismatches the DiT hidden-state shape for any non-1024 latent.Scope
Minimal, two touched files (
utils.pyfor the preprocessing plumbing,disagg_omni_glm_image.sh+ new yaml for the colocation refactor). Image/video-generation paths only; chat / text / audio branches ofparse_omni_requestare untouched. Models whose HF processor and stage processor ignoretarget_h/target_wandheight/widthare unaffected.Test plan
disagg_omni_glm_image.sh+/v1/images/generationsat 1024x1024 — output MD5-identical tovllm-omni serve.RuntimeErrorbefore).parse_omni_requestand the audio handler are unchanged (not touched in the diff).🤖 Generated with Claude Code