Skip to content

Fix LocalBackend fork_checkpoint to overwrite initial LoRA for vLLM#652

Open
arcticfly wants to merge 8 commits intomainfrom
fix/fork-checkpoint-overwrite-step0
Open

Fix LocalBackend fork_checkpoint to overwrite initial LoRA for vLLM#652
arcticfly wants to merge 8 commits intomainfrom
fix/fork-checkpoint-overwrite-step0

Conversation

@arcticfly
Copy link
Copy Markdown
Collaborator

@arcticfly arcticfly commented Apr 13, 2026

Problem

LocalBackend._experimental_fork_checkpoint has two issues that prevent forked LoRA checkpoints from being used correctly:

1. vLLM loads the wrong checkpoint

model.register(backend) creates an empty LoRA at checkpoints/0000. The fork then copies the real weights to checkpoints/{source_step} (e.g. 0686). But start_openai_server calls get_last_checkpoint_dir() which finds the forked checkpoint — however the vLLM subprocess is configured with @0 alias pointing to 0000.

2. Trainer uses the wrong weights

UnslothService._state is a cached_property that may be initialized before the fork runs. Even when it re-initializes after fork, create_unsloth_train_context calls FastLanguageModel.from_pretrained(model_name=checkpoint_dir) which sets up the LoRA architecture but may not load the trained weights correctly across precision boundaries (e.g. checkpoint trained in 4-bit, loaded in 16-bit).

3. Mixed bf16/fp16 dtype crash on H200

On H200 GPUs, base model activations run in bf16 while LoRA adapter weights are fp16. Unsloth's fused matmul_lora and fast_linear_forward call addmm_/addmv_ which crash on mixed dtypes: RuntimeError: self and mat2 must have the same dtype, but got Half and BFloat16.

Fix

Checkpoint loading (backend.py)

  1. Overwrite checkpoints/0000 with the forked weights so vLLM loads the correct adapter on startup.
  2. Invalidate the _state cache so the trainer re-initializes with the forked checkpoint path.
  3. Set _forked_checkpoint_dir on the service and call load_adapter_from_checkpoint on the first training call to explicitly load the adapter weights via set_peft_model_state_dict.

Dtype patch (dtype_patch.py)

Patches matmul_lora and fast_linear_forward to cast tensors to a common dtype (preferring bf16) before fused ops. Applied automatically when _state is first accessed.

Verification

Without fix: val/qa_failed = 40-80% at step 0 (base model behavior) + dtype crashes
With fix: val/qa_failed = 7% matching W&B Inference baseline (8%), no crashes

pr-test-001 (no local patches, ART PR only):
  val/qa_failed:              7.0%
  val/formatting_failed:     33.0%
  val/filler_words_failed:    4.0%
  val/self_correction_failed: 8.0%
  val/voice_commands_failed:  3.0%
  val/faithfulness_failed:    1.0%
  val/lost_meaning_failed:    4.0%
  val/organization_failed:    1.0%
  val/emoji_failed:           2.0%
  val/reward:                 0.681

Closes #651

arcticfly and others added 2 commits April 13, 2026 13:57
When forking a checkpoint, the source checkpoint was copied to
checkpoints/{source_step} in the destination model directory. However,
model.register(backend) already created an empty LoRA at checkpoints/0000.
When vLLM starts, it loads @0 — the empty 0000 checkpoint — not the
forked one. Fix by also copying the forked weights to checkpoints/0000
so vLLM loads the correct weights on startup.

Fixes #651

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The real issue is that UnslothService._state (a cached_property)
may be initialized before the fork copies the checkpoint, caching
the base model instead of the forked weights. Invalidating the
cache after fork ensures the trainer picks up the forked checkpoint
on next access.

The step-0 overwrite was unnecessary — vLLM's start_openai_server
already calls get_last_checkpoint_dir() which finds the forked
checkpoint at its original step number.
After _experimental_fork_checkpoint, store the checkpoint path on
the service. On the first _train_dedicated/_train_shared call, load
the adapter weights via load_lora_adapter before training begins.

This is needed because create_unsloth_train_context may initialize
the LoRA architecture from adapter_config.json without loading the
actual trained weights from adapter_model.safetensors, especially
when the checkpoint was trained at a different precision than the
current load config.
@arcticfly arcticfly force-pushed the fix/fork-checkpoint-overwrite-step0 branch from 5489fde to 0d53531 Compare April 14, 2026 20:46
@arcticfly arcticfly force-pushed the fix/fork-checkpoint-overwrite-step0 branch from 0d53531 to 62e4fbc Compare April 14, 2026 20:48
On H200 GPUs, base model activations run in bf16 while LoRA adapter
weights are fp16. Unsloth's fused matmul_lora and fast_linear_forward
call addmm_/addmv_ which crash on mixed dtypes. This patch casts
tensors to a common dtype before those ops.

Applied automatically when UnslothService._state is first accessed.
@arcticfly arcticfly force-pushed the fix/fork-checkpoint-overwrite-step0 branch 2 times, most recently from 9346894 to e0decea Compare April 14, 2026 22:55
@arcticfly arcticfly force-pushed the fix/fork-checkpoint-overwrite-step0 branch from e0decea to ba4c406 Compare April 14, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LocalBackend fork_checkpoint doesn't update vLLM's initial LoRA

2 participants