Hi authors,
Thank you for the impressive work on SDPO. I am currently trying to reproduce the results on the Tool Use task using the official codebase and the default configuration provided. My experiments are conducted on Ascend 910C accelerators.
Observations:
Loss vs. Gradient Norm Mismatch: The sdpo_loss stays within a seemingly normal range (typically ~0.0-0.2), which initially suggests stable optimization. However, the gradient norm (grad_norm) consistently drops to ~10^-5, which is orders of magnitude smaller than the values reported in Figure 18 of the paper (where it hovers between 0 and 20 althougt it's on LCBv6) and the wandb log.

Validation Performance: Correspondingly, validation metrics (accuracy/pass rate) fluctuate randomly without any upward trend, indicating that model parameters are effectively not updating.
Stability Concern: The collapse in gradient flow happens early and persists, suggesting the model is stuck in a plateau rather than converging.
Questions:
Is a grad_norm of closing to 0 expected behavior in SDPO, or does it indicate a gradient vanishing/collapse issue? The training curves in the paper show significantly larger norms.
Any insights or suggestions would be greatly appreciated!
Environment:
Hardware: Ascend 910C
Model: Olmo3-7b
Task: Tool Use
Framework: PyTorch + CANN / verl
Config: Default settings from the repo
Hi authors,
Thank you for the impressive work on SDPO. I am currently trying to reproduce the results on the Tool Use task using the official codebase and the default configuration provided. My experiments are conducted on Ascend 910C accelerators.
Observations:
Loss vs. Gradient Norm Mismatch: The sdpo_loss stays within a seemingly normal range (typically ~0.0-0.2), which initially suggests stable optimization. However, the gradient norm (grad_norm) consistently drops to ~10^-5, which is orders of magnitude smaller than the values reported in Figure 18 of the paper (where it hovers between 0 and 20 althougt it's on LCBv6) and the wandb log.
Validation Performance: Correspondingly, validation metrics (accuracy/pass rate) fluctuate randomly without any upward trend, indicating that model parameters are effectively not updating.
Stability Concern: The collapse in gradient flow happens early and persists, suggesting the model is stuck in a plateau rather than converging.
Questions:
Is a grad_norm of closing to 0 expected behavior in SDPO, or does it indicate a gradient vanishing/collapse issue? The training curves in the paper show significantly larger norms.
Any insights or suggestions would be greatly appreciated!
Environment:
Hardware: Ascend 910C
Model: Olmo3-7b
Task: Tool Use
Framework: PyTorch + CANN / verl
Config: Default settings from the repo