Description
I'm experiencing unusually high gradient norms and loss values during the initial steps of training Gemma-4-E2B-it with full fine-tuning. The gradient norms are reaching 6880-8128 despite having max_grad_norm: 1.0 set, and the initial loss is around 185.9, which seems excessively high.
Training Logs
Unsloth: Will smartly offload gradients to save VRAM!
{'loss': '185.9', 'grad_norm': '6880', 'learning_rate': '1.263e-06', 'epoch': '0.002296'}
{'loss': '142.5', 'grad_norm': '4512', 'learning_rate': '2.538e-06', 'epoch': '0.004592'}
{'loss': '117.6', 'grad_norm': '8128', 'learning_rate': '3.814e-06', 'epoch': '0.006888'}
{'loss': '92.86', 'grad_norm': '1344', 'learning_rate': '5.089e-06', 'epoch': '0.009185'}
Questions:
- Is it normal to have such high gradient norms (6880-8128) during initial training? The
max_grad_norm is set to 1.0 in my config, but the reported values are much higher.
- Is an initial loss of 185.9 expected for Gemma-4-E2B-it? The README mentions Gemma-4-E4B loss should be 13-15, but I'm using E2B.
- Could the gradient offloading message be related? "Unsloth: Will smartly offload gradients to save VRAM!"
Environment Information
Note: Please do not remove the questions. Answer beside them.
-
Did you update? pip install --upgrade unsloth unsloth_zoo.
- Yes, using latest versions via
uv pip install -e .
-
Colab or Kaggle or local / cloud.
- Local server with H100 GPUs
-
Number GPUs used, use nvidia-smi
-
Which notebook? Please link!
- Custom training script:
train_ift.py (see below)
-
Which Unsloth version, TRL version, transformers version, PyTorch version?
- Unsloth: latest (installed via
uv pip install unsloth)
- TRL: latest
- Transformers: latest
- PyTorch: latest (with CUDA support)
-
Which trainer? SFTTrainer, GRPOTrainer etc
Minimal Code to Reproduce
# train_ift.py - Simplified relevant parts
import torch
from unsloth import FastModel, get_chat_template, train_on_responses_only
from trl import SFTTrainer
# Load model
model, tokenizer = FastModel.from_pretrained(
model_name="unsloth/gemma-4-E2B-it",
max_seq_length=8192,
load_in_4bit=False,
load_in_8bit=False,
full_finetuning=True,
trust_remote_code=True,
)
# Apply chat template
tokenizer = get_chat_template(
tokenizer,
chat_template="gemma-4",
)
# Training arguments
training_args = TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=5e-5,
num_train_epochs=3,
max_grad_norm=1.0,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
bf16=True,
gradient_checkpointing=True,
optim="adamw_torch",
adam_beta1=0.9,
adam_beta2=0.95,
weight_decay=0.1,
)
# Dataset: 10 Vietnamese datasets (~650K samples)
# Format: Conversational with reasoning content
# Initialize trainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
args=training_args,
train_on_responses_only=True,
instruction_part="<|turn>user\n",
response_part="<|turn>model\n",
)
# Apply train_on_responses_only
train_on_responses_only(
trainer,
instruction_part="<|turn>user\n",
response_part="<|turn>model\n",
)
# Train
trainer.train()
Configuration Details
Model: unsloth/gemma-4-E2B-it
- Full fine-tuning (no LoRA)
- Max sequence length: 8192
- Chat template: "gemma-4"
- Thinking enabled: Yes
Training Config:
- Batch size: 1 per device
- Gradient accumulation: 16 (effective batch = 16)
- Learning rate: 5e-5
- Warmup ratio: 3%
- Max grad norm: 1.0 (but seeing 6880-8128!)
- Optimizer: AdamW (adamw_torch)
- Precision: bf16
Datasets:
- 10 Vietnamese datasets (~650K samples total)
- Mix: Education, Math, Legal, History, NLI, Summarization, etc.
- Format: ShareGPT/Alpaca with reasoning content
Expected vs Actual Behavior
Expected:
- Gradient norm should be clipped to max_grad_norm (1.0)
- Loss should start at a reasonable value (not 185.9)
- Training should be stable from the start
Actual:
- Gradient norms: 6880, 4512, 8128, 1344 (much higher than max_grad_norm)
- Initial loss: 185.9 (extremely high)
- Loss is decreasing but starting from a very high point
Sample Training Data (with Chat Template)
Here are examples of the actual training data with the Gemma-4 chat template applied:
Example 1 (Vietnamese Math MCQ):
<|turn>system
<|think|>
<turn|>
<|turn>user
Câu 1:
Độ dài của chiếc bút xoá là:
Các đáp án:
A. 10 cm
B. 8 cm
C. 9 cm
D. 7 cm<turn|>
<|turn>model
<|channel>thought
Hướng dẫn giải
Đáp án đúng là: C
Quan sát hình vẽ ta thấy độ dài của chiếc bút xoá là 9 cm .
<channel|>
Đáp án: C<turn|>
Example 2 (English Math with Reasoning):
<|turn>system
<|think|>
<turn|>
<|turn>user
[MATH - medium] 252 fifth-grade students and 8 teachers at Yeji's school are going on a field trip. If the cost of renting a 41-seater bus is 300,000 won and the highway toll per bus is 7,500 won, how much does it cost to rent the bus and pay the toll?<turn|>
<|turn>model
<|channel>thought
Let me work through this problem step by step.
First, I need to understand what's being asked: 252 fifth-grade students and 8 teachers at Yeji's school are going on a field trip. If the cost of renting a 41-seater bus is 300,000 won and the highway toll per bus is 7,500 won, how much does it cost to rent the bus and pay the toll?
Key values given: 252, 8, 41, 300,000, 7,500, ,
My approach:
1. Find the total number of people going on the field trip
- Fifth-grade students: 252
- Teachers: 8
- Total people: 252 + 8 = 260 people
2. Calculate how many buses are needed
Each bus has 41 seats.
$$\text{Number of buses} = \frac{260}{41} = 6.34...$$
Since we cannot rent a partial bus, we must round up to the
3. whole number.
Buses needed: 7 buses
Verification:
- 6 buses × 41 seats = 246 seats (not enough for 260 people)
- 7 buses × 41 seats = 287 seats (sufficient for 260 people) ✓
4. Calculate the bus rental cost
$$\text{Bus rental cost} = 7 \text{ buses} \times 300,000 \text{ won} = 2,100,000 \text{ won}$$
5. Calculate the highway toll
$$\text{Highway toll} = 7 \text{ buses} \times 7,500 \text{ won} = 52,500 \text{ won}$$
6. Calculate the total cost
$$\text{Total cost} = 2,100,000 + 52,500 = 2,152,500 \text{ won}$$
Final Answer
| Item | Calculation | Cost |
Bus rental | 7 × 300,000 won | 2,100,000 won |
| Highway tol
This gives us the final answer: s
<channel|>
# Solution: Calculating Bus Rental and Toll Costs
## Step 1: Find the total number of people going on the field trip
- Fifth-grade students: 252
- Teachers: 8
- **Total people**: 252 + 8 = **260 people**
## Step 2: Calculate how many buses are needed
Each bus has 41 seats.
$$\text{Number of buses} = \frac{260}{41} = 6.34...$$
Since we cannot rent a partial bus, we must round up to the next whole number.
**Buses needed**: 7 buses
**Verification**:
- 6 buses × 41 seats = 246 seats (not enough for 260 people)
- 7 buses × 41 seats = 287 seats (sufficient for 260 people) ✓
## Step 3: Calculate the bus rental cost
$$\text{Bus rental cost} = 7 \text{ buses} \times 300,000 \text{ won} = 2,100,000 \text{ won}$$
## Step 4: Calculate the highway toll
$$\text{Highway toll} = 7 \text{ buses} \times 7,500 \text{ won} = 52,500 \text{ won}$$
## Step 5: Calculate the total cost
$$\text{Total cost} = 2,100,000 + 52,500 = 2,152,500 \text{ won}$$
---
## Final Answer
| Item | Calculation | Cost |
|------|-------------|------|
| Bus rental | 7 × 300,000 won | 2,100,000 won |
| Highway toll | 7 × 7,500 won | 52,500 won |
| **Total** | | **2,152,500 won** |
**The total cost to rent the buses and pay the toll is 2,152,500 won.**<turn|>
Notes:
- Using
train_on_responses_only=True with instruction_part="<|turn>user\n" and response_part="<|turn>model\n"
- The
<|channel>thought blocks contain reasoning content that is also being trained on
- Mix of Vietnamese and English content
Additional Context
- The "Unsloth: Will smartly offload gradients to save VRAM!" message appears at the start
- Using
train_on_responses_only=True with Gemma-4 chat template
- Training on Vietnamese instruction datasets with reasoning content
- H100 80GB GPU, bf16 precision
- This is continued fine-tuning of an already instruction-tuned model (Gemma-4-E2B-it)
Is this behavior normal for fine-tuning an already instruction-tuned model on a different language/domain? Or is there something wrong with my configuration?
Thank you for your help!
Description
I'm experiencing unusually high gradient norms and loss values during the initial steps of training Gemma-4-E2B-it with full fine-tuning. The gradient norms are reaching 6880-8128 despite having
max_grad_norm: 1.0set, and the initial loss is around 185.9, which seems excessively high.Training Logs
Questions:
max_grad_normis set to 1.0 in my config, but the reported values are much higher.Environment Information
Note: Please do not remove the questions. Answer beside them.
Did you update?
pip install --upgrade unsloth unsloth_zoo.uv pip install -e .Colab or Kaggle or local / cloud.
Number GPUs used, use
nvidia-smiWhich notebook? Please link!
train_ift.py(see below)Which Unsloth version, TRL version, transformers version, PyTorch version?
uv pip install unsloth)Which trainer?
SFTTrainer,GRPOTraineretcSFTTrainerfrom TRLMinimal Code to Reproduce
Configuration Details
Model:
unsloth/gemma-4-E2B-itTraining Config:
Datasets:
Expected vs Actual Behavior
Expected:
Actual:
Sample Training Data (with Chat Template)
Here are examples of the actual training data with the Gemma-4 chat template applied:
Example 1 (Vietnamese Math MCQ):
Example 2 (English Math with Reasoning):
Notes:
train_on_responses_only=Truewithinstruction_part="<|turn>user\n"andresponse_part="<|turn>model\n"<|channel>thoughtblocks contain reasoning content that is also being trained onAdditional Context
train_on_responses_only=Truewith Gemma-4 chat templateIs this behavior normal for fine-tuning an already instruction-tuned model on a different language/domain? Or is there something wrong with my configuration?
Thank you for your help!