Fix LocalBackend fork_checkpoint to overwrite initial LoRA for vLLM by arcticfly · Pull Request #652 · OpenPipe/ART

arcticfly · 2026-04-13T20:57:29Z

Problem

LocalBackend._experimental_fork_checkpoint has two issues that prevent forked LoRA checkpoints from being used correctly:

1. vLLM loads the wrong checkpoint

model.register(backend) creates an empty LoRA at checkpoints/0000. The fork then copies the real weights to checkpoints/{source_step} (e.g. 0686). But start_openai_server calls get_last_checkpoint_dir() which finds the forked checkpoint — however the vLLM subprocess is configured with @0 alias pointing to 0000.

2. Trainer uses the wrong weights

UnslothService._state is a cached_property that may be initialized before the fork runs. Even when it re-initializes after fork, create_unsloth_train_context calls FastLanguageModel.from_pretrained(model_name=checkpoint_dir) which sets up the LoRA architecture but may not load the trained weights correctly across precision boundaries (e.g. checkpoint trained in 4-bit, loaded in 16-bit).

3. Mixed bf16/fp16 dtype crash on H200

On H200 GPUs, base model activations run in bf16 while LoRA adapter weights are fp16. Unsloth's fused matmul_lora and fast_linear_forward call addmm_/addmv_ which crash on mixed dtypes: RuntimeError: self and mat2 must have the same dtype, but got Half and BFloat16.

Fix

Checkpoint loading (backend.py)

Overwrite checkpoints/0000 with the forked weights so vLLM loads the correct adapter on startup.
Invalidate the _state cache so the trainer re-initializes with the forked checkpoint path.
Set _forked_checkpoint_dir on the service and call load_adapter_from_checkpoint on the first training call to explicitly load the adapter weights via set_peft_model_state_dict.

Dtype patch (dtype_patch.py)

Patches matmul_lora and fast_linear_forward to cast tensors to a common dtype (preferring bf16) before fused ops. Applied automatically when _state is first accessed.

Verification

Without fix: val/qa_failed = 40-80% at step 0 (base model behavior) + dtype crashes
With fix: val/qa_failed = 7% matching W&B Inference baseline (8%), no crashes

pr-test-001 (no local patches, ART PR only):
  val/qa_failed:              7.0%
  val/formatting_failed:     33.0%
  val/filler_words_failed:    4.0%
  val/self_correction_failed: 8.0%
  val/voice_commands_failed:  3.0%
  val/faithfulness_failed:    1.0%
  val/lost_meaning_failed:    4.0%
  val/organization_failed:    1.0%
  val/emoji_failed:           2.0%
  val/reward:                 0.681

Closes #651

@0

When forking a checkpoint, the source checkpoint was copied to checkpoints/{source_step} in the destination model directory. However, model.register(backend) already created an empty LoRA at checkpoints/0000. When vLLM starts, it loads @0 — the empty 0000 checkpoint — not the forked one. Fix by also copying the forked weights to checkpoints/0000 so vLLM loads the correct weights on startup. Fixes #651 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The real issue is that UnslothService._state (a cached_property) may be initialized before the fork copies the checkpoint, caching the base model instead of the forked weights. Invalidating the cache after fork ensures the trainer picks up the forked checkpoint on next access. The step-0 overwrite was unnecessary — vLLM's start_openai_server already calls get_last_checkpoint_dir() which finds the forked checkpoint at its original step number.

After _experimental_fork_checkpoint, store the checkpoint path on the service. On the first _train_dedicated/_train_shared call, load the adapter weights via load_lora_adapter before training begins. This is needed because create_unsloth_train_context may initialize the LoRA architecture from adapter_config.json without loading the actual trained weights from adapter_model.safetensors, especially when the checkpoint was trained at a different precision than the current load config.

On H200 GPUs, base model activations run in bf16 while LoRA adapter weights are fp16. Unsloth's fused matmul_lora and fast_linear_forward call addmm_/addmv_ which crash on mixed dtypes. This patch casts tensors to a common dtype before those ops. Applied automatically when UnslothService._state is first accessed.

…ures

arcticfly and others added 2 commits April 13, 2026 13:57

Format with ruff

7ee591e

Kovbo approved these changes Apr 13, 2026

View reviewed changes

arcticfly added 3 commits April 13, 2026 15:56

Fix type: cast Model to TrainableModel for _get_service

aa96333

arcticfly force-pushed the fix/fork-checkpoint-overwrite-step0 branch from 5489fde to 0d53531 Compare April 14, 2026 20:46

Fix type errors: declare _forked_checkpoint_dir on UnslothService

62e4fbc

arcticfly force-pushed the fix/fork-checkpoint-overwrite-step0 branch from 0d53531 to 62e4fbc Compare April 14, 2026 20:48

arcticfly force-pushed the fix/fork-checkpoint-overwrite-step0 branch 2 times, most recently from 9346894 to e0decea Compare April 14, 2026 22:55

Fix type errors: add type ignores for unsloth runtime function signat…

ba4c406

…ures

arcticfly force-pushed the fix/fork-checkpoint-overwrite-step0 branch from e0decea to ba4c406 Compare April 14, 2026 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LocalBackend fork_checkpoint to overwrite initial LoRA for vLLM#652

Fix LocalBackend fork_checkpoint to overwrite initial LoRA for vLLM#652
arcticfly wants to merge 8 commits intomainfrom
fix/fork-checkpoint-overwrite-step0

arcticfly commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arcticfly commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

1. vLLM loads the wrong checkpoint

2. Trainer uses the wrong weights

3. Mixed bf16/fp16 dtype crash on H200

Fix

Checkpoint loading (backend.py)

Dtype patch (dtype_patch.py)

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arcticfly commented Apr 13, 2026 •

edited

Loading