[WIP] Add bitwise parity test for MoE EP by wwwjn · Pull Request #3172 · pytorch/torchtitan

wwwjn · 2026-04-30T03:46:29Z

Stack from ghstack (oldest at bottom):

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
python scripts/rl/create_debug_moe_ckpt.py
torchrun --nproc_per_node=4 -m pytest \
torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. [ghstack-poisoned]

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. ghstack-source-id: 65e6b7a Pull Request resolved: #3172

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and TestBitwiseParityMoEEP test class for verifying bitwise identity between trainer prefill and vLLM generator prefill with MoE EP. Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled. Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH). Run with: python scripts/rl/create_debug_moe_ckpt.py torchrun --nproc_per_node=4 -m pytest \\ torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP Note: bitwise parity not yet achieved — current max_delta ~2e-2. Likely causes: batch-invariant mode disabled (requires batch_invariant_ops), attention impl differences (varlen vs paged), MoE token routing order. [ghstack-poisoned]

This was referenced Apr 30, 2026

[rl] Enable TP2EP for MoE inference in vLLM wrapper #3142

Open

[WIP]Enable DP-to-EP for MoE inference #3171

Open

pytorch-bot Bot added the ciflow/8gpu label Apr 30, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 30, 2026

wwwjn changed the title ~~Add bitwise parity test for MoE EP~~ [WIP] Add bitwise parity test for MoE EP Apr 30, 2026

wwwjn mentioned this pull request May 1, 2026

[MoE] Pad token count to a multiple of sp_size in AllToAllTokenDispatcher #3193

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add bitwise parity test for MoE EP#3172

[WIP] Add bitwise parity test for MoE EP#3172
wwwjn wants to merge 3 commits intogh/wwwjn/17/basefrom
gh/wwwjn/17/head

wwwjn commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wwwjn commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wwwjn commented Apr 30, 2026 •

edited

Loading