Skip to content

[WIP] Add bitwise parity test for MoE EP#3172

Open
wwwjn wants to merge 3 commits intogh/wwwjn/17/basefrom
gh/wwwjn/17/head
Open

[WIP] Add bitwise parity test for MoE EP#3172
wwwjn wants to merge 3 commits intogh/wwwjn/17/basefrom
gh/wwwjn/17/head

Conversation

@wwwjn
Copy link
Copy Markdown
Contributor

@wwwjn wwwjn commented Apr 30, 2026

Stack from ghstack (oldest at bottom):

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
python scripts/rl/create_debug_moe_ckpt.py
torchrun --nproc_per_node=4 -m pytest \
torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

[ghstack-poisoned]
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 30, 2026
@wwwjn wwwjn changed the title Add bitwise parity test for MoE EP [WIP] Add bitwise parity test for MoE EP Apr 30, 2026
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

[ghstack-poisoned]
wwwjn added a commit that referenced this pull request May 1, 2026
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

ghstack-source-id: 65e6b7a
Pull Request resolved: #3172
Add rl_grpo_qwen3_moe_debug_ep_batch_invariant config and
TestBitwiseParityMoEEP test class for verifying bitwise identity
between trainer prefill and vLLM generator prefill with MoE EP.

Config: TP=4, EP=4 on 4 GPUs, deterministic mode, compile disabled.
Uses /tmp/debug_moe_ckpt by default (override with MOE_HF_ASSETS_PATH).

Run with:
    python scripts/rl/create_debug_moe_ckpt.py
    torchrun --nproc_per_node=4 -m pytest \\
      torchtitan/experiments/rl/tests/test_bitwise_parity.py::TestBitwiseParityMoEEP

Note: bitwise parity not yet achieved — current max_delta ~2e-2.
Likely causes: batch-invariant mode disabled (requires batch_invariant_ops),
attention impl differences (varlen vs paged), MoE token routing order.

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant