[Question] High Gradient Norm and Loss During Initial Training with Gemma-4-E2B

## Description
I'm experiencing unusually high gradient norms and loss values during the initial steps of training Gemma-4-E2B-it with full fine-tuning. The gradient norms are reaching **6880-8128** despite having `max_grad_norm: 1.0` set, and the initial loss is around **185.9**, which seems excessively high.

## Training Logs
```
Unsloth: Will smartly offload gradients to save VRAM!
{'loss': '185.9', 'grad_norm': '6880', 'learning_rate': '1.263e-06', 'epoch': '0.002296'}  
{'loss': '142.5', 'grad_norm': '4512', 'learning_rate': '2.538e-06', 'epoch': '0.004592'}
{'loss': '117.6', 'grad_norm': '8128', 'learning_rate': '3.814e-06', 'epoch': '0.006888'}
{'loss': '92.86', 'grad_norm': '1344', 'learning_rate': '5.089e-06', 'epoch': '0.009185'}
```

**Questions:**
1. **Is it normal to have such high gradient norms (6880-8128) during initial training?** The `max_grad_norm` is set to 1.0 in my config, but the reported values are much higher.
2. **Is an initial loss of 185.9 expected for Gemma-4-E2B-it?** The README mentions Gemma-4-E4B loss should be 13-15, but I'm using E2B.
3. **Could the gradient offloading message be related?** "Unsloth: Will smartly offload gradients to save VRAM!"

---

## Environment Information

**Note: Please do not remove the questions. Answer beside them.**

1. **Did you update?** `pip install --upgrade unsloth unsloth_zoo`. 
   - Yes, using latest versions via `uv pip install -e .`

2. **Colab or Kaggle or local / cloud.** 
   - Local server with H100 GPUs

3. **Number GPUs used, use `nvidia-smi`**
   - 1x H100 80GB

4. **Which notebook? Please link!**
   - Custom training script: `train_ift.py` (see below)

5. **Which Unsloth version, TRL version, transformers version, PyTorch version?**
   - Unsloth: latest (installed via `uv pip install unsloth`)
   - TRL: latest
   - Transformers: latest
   - PyTorch: latest (with CUDA support)

6. **Which trainer? `SFTTrainer`, `GRPOTrainer` etc**
   - `SFTTrainer` from TRL

---

## Minimal Code to Reproduce

```python
# train_ift.py - Simplified relevant parts
import torch
from unsloth import FastModel, get_chat_template, train_on_responses_only
from trl import SFTTrainer

# Load model
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-4-E2B-it",
    max_seq_length=8192,
    load_in_4bit=False,
    load_in_8bit=False,
    full_finetuning=True,
    trust_remote_code=True,
)

# Apply chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma-4",
)

# Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=5e-5,
    num_train_epochs=3,
    max_grad_norm=1.0,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    bf16=True,
    gradient_checkpointing=True,
    optim="adamw_torch",
    adam_beta1=0.9,
    adam_beta2=0.95,
    weight_decay=0.1,
)

# Dataset: 10 Vietnamese datasets (~650K samples)
# Format: Conversational with reasoning content

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=training_args,
    train_on_responses_only=True,
    instruction_part="<|turn>user\n",
    response_part="<|turn>model\n",
)

# Apply train_on_responses_only
train_on_responses_only(
    trainer,
    instruction_part="<|turn>user\n",
    response_part="<|turn>model\n",
)

# Train
trainer.train()
```

---

## Configuration Details

**Model:** `unsloth/gemma-4-E2B-it`
- Full fine-tuning (no LoRA)
- Max sequence length: 8192
- Chat template: "gemma-4"
- Thinking enabled: Yes

**Training Config:**
- Batch size: 1 per device
- Gradient accumulation: 16 (effective batch = 16)
- Learning rate: 5e-5
- Warmup ratio: 3%
- Max grad norm: 1.0 (but seeing 6880-8128!)
- Optimizer: AdamW (adamw_torch)
- Precision: bf16

**Datasets:**
- 10 Vietnamese datasets (~650K samples total)
- Mix: Education, Math, Legal, History, NLI, Summarization, etc.
- Format: ShareGPT/Alpaca with reasoning content

---

## Expected vs Actual Behavior

**Expected:**
- Gradient norm should be clipped to max_grad_norm (1.0)
- Loss should start at a reasonable value (not 185.9)
- Training should be stable from the start

**Actual:**
- Gradient norms: 6880, 4512, 8128, 1344 (much higher than max_grad_norm)
- Initial loss: 185.9 (extremely high)
- Loss is decreasing but starting from a very high point

---

## Sample Training Data (with Chat Template)

Here are examples of the actual training data with the Gemma-4 chat template applied:

### Example 1 (Vietnamese Math MCQ):
```
<|turn>system
<|think|>
<turn|>
<|turn>user
Câu 1: 
 
 Độ dài của chiếc bút xoá là:

Các đáp án:
A. 10 cm
B. 8 cm
C. 9 cm
D. 7 cm<turn|>
<|turn>model
<|channel>thought
Hướng dẫn giải 
 Đáp án đúng là: C 
 Quan sát hình vẽ ta thấy độ dài của chiếc bút xoá là 9 cm .
<channel|>
Đáp án: C<turn|>
```

### Example 2 (English Math with Reasoning):
```
<|turn>system
<|think|>
<turn|>
<|turn>user
[MATH - medium] 252 fifth-grade students and 8 teachers at Yeji's school are going on a field trip. If the cost of renting a 41-seater bus is 300,000 won and the highway toll per bus is 7,500 won, how much does it cost to rent the bus and pay the toll?<turn|>
<|turn>model
<|channel>thought
Let me work through this problem step by step.

First, I need to understand what's being asked: 252 fifth-grade students and 8 teachers at Yeji's school are going on a field trip. If the cost of renting a 41-seater bus is 300,000 won and the highway toll per bus is 7,500 won, how much does it cost to rent the bus and pay the toll?

Key values given: 252, 8, 41, 300,000, 7,500, ,

My approach:
1. Find the total number of people going on the field trip

- Fifth-grade students: 252
- Teachers: 8
- Total people: 252 + 8 = 260 people
2. Calculate how many buses are needed

Each bus has 41 seats.

$$\text{Number of buses} = \frac{260}{41} = 6.34...$$

Since we cannot rent a partial bus, we must round up to the
3. whole number.

Buses needed: 7 buses

Verification: 
- 6 buses × 41 seats = 246 seats (not enough for 260 people)
- 7 buses × 41 seats = 287 seats (sufficient for 260 people) ✓
4. Calculate the bus rental cost

$$\text{Bus rental cost} = 7 \text{ buses} \times 300,000 \text{ won} = 2,100,000 \text{ won}$$
5. Calculate the highway toll

$$\text{Highway toll} = 7 \text{ buses} \times 7,500 \text{ won} = 52,500 \text{ won}$$
6. Calculate the total cost

$$\text{Total cost} = 2,100,000 + 52,500 = 2,152,500 \text{ won}$$

Final Answer

| Item | Calculation | Cost |
 Bus rental | 7 × 300,000 won | 2,100,000 won |
| Highway tol

This gives us the final answer: s
<channel|>
# Solution: Calculating Bus Rental and Toll Costs

## Step 1: Find the total number of people going on the field trip

- Fifth-grade students: 252
- Teachers: 8
- **Total people**: 252 + 8 = **260 people**

## Step 2: Calculate how many buses are needed

Each bus has 41 seats.

$$\text{Number of buses} = \frac{260}{41} = 6.34...$$

Since we cannot rent a partial bus, we must round up to the next whole number.

**Buses needed**: 7 buses

**Verification**: 
- 6 buses × 41 seats = 246 seats (not enough for 260 people)
- 7 buses × 41 seats = 287 seats (sufficient for 260 people) ✓

## Step 3: Calculate the bus rental cost

$$\text{Bus rental cost} = 7 \text{ buses} \times 300,000 \text{ won} = 2,100,000 \text{ won}$$

## Step 4: Calculate the highway toll

$$\text{Highway toll} = 7 \text{ buses} \times 7,500 \text{ won} = 52,500 \text{ won}$$

## Step 5: Calculate the total cost

$$\text{Total cost} = 2,100,000 + 52,500 = 2,152,500 \text{ won}$$

---

## Final Answer

| Item | Calculation | Cost |
|------|-------------|------|
| Bus rental | 7 × 300,000 won | 2,100,000 won |
| Highway toll | 7 × 7,500 won | 52,500 won |
| **Total** | | **2,152,500 won** |

**The total cost to rent the buses and pay the toll is 2,152,500 won.**<turn|>
```

**Notes:**
- Using `train_on_responses_only=True` with `instruction_part="<|turn>user\n"` and `response_part="<|turn>model\n"`
- The `<|channel>thought` blocks contain reasoning content that is also being trained on
- Mix of Vietnamese and English content

---

## Additional Context

- The "Unsloth: Will smartly offload gradients to save VRAM!" message appears at the start
- Using `train_on_responses_only=True` with Gemma-4 chat template
- Training on Vietnamese instruction datasets with reasoning content
- H100 80GB GPU, bf16 precision
- This is **continued fine-tuning** of an already instruction-tuned model (Gemma-4-E2B-it)

Is this behavior normal for fine-tuning an already instruction-tuned model on a different language/domain? Or is there something wrong with my configuration?

Thank you for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] High Gradient Norm and Loss During Initial Training with Gemma-4-E2B #4982

Description

Training Logs

Environment Information

Minimal Code to Reproduce

Configuration Details

Expected vs Actual Behavior

Sample Training Data (with Chat Template)

Example 1 (Vietnamese Math MCQ):

Example 2 (English Math with Reasoning):

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Question] High Gradient Norm and Loss During Initial Training with Gemma-4-E2B #4982

Description

Description

Training Logs

Environment Information

Minimal Code to Reproduce

Configuration Details

Expected vs Actual Behavior

Sample Training Data (with Chat Template)

Example 1 (Vietnamese Math MCQ):

Example 2 (English Math with Reasoning):

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions