Skip to content

[Question] High Gradient Norm and Loss During Initial Training with Gemma-4-E2B #4982

@hungphongtrn

Description

@hungphongtrn

Description

I'm experiencing unusually high gradient norms and loss values during the initial steps of training Gemma-4-E2B-it with full fine-tuning. The gradient norms are reaching 6880-8128 despite having max_grad_norm: 1.0 set, and the initial loss is around 185.9, which seems excessively high.

Training Logs

Unsloth: Will smartly offload gradients to save VRAM!
{'loss': '185.9', 'grad_norm': '6880', 'learning_rate': '1.263e-06', 'epoch': '0.002296'}  
{'loss': '142.5', 'grad_norm': '4512', 'learning_rate': '2.538e-06', 'epoch': '0.004592'}
{'loss': '117.6', 'grad_norm': '8128', 'learning_rate': '3.814e-06', 'epoch': '0.006888'}
{'loss': '92.86', 'grad_norm': '1344', 'learning_rate': '5.089e-06', 'epoch': '0.009185'}

Questions:

  1. Is it normal to have such high gradient norms (6880-8128) during initial training? The max_grad_norm is set to 1.0 in my config, but the reported values are much higher.
  2. Is an initial loss of 185.9 expected for Gemma-4-E2B-it? The README mentions Gemma-4-E4B loss should be 13-15, but I'm using E2B.
  3. Could the gradient offloading message be related? "Unsloth: Will smartly offload gradients to save VRAM!"

Environment Information

Note: Please do not remove the questions. Answer beside them.

  1. Did you update? pip install --upgrade unsloth unsloth_zoo.

    • Yes, using latest versions via uv pip install -e .
  2. Colab or Kaggle or local / cloud.

    • Local server with H100 GPUs
  3. Number GPUs used, use nvidia-smi

    • 1x H100 80GB
  4. Which notebook? Please link!

    • Custom training script: train_ift.py (see below)
  5. Which Unsloth version, TRL version, transformers version, PyTorch version?

    • Unsloth: latest (installed via uv pip install unsloth)
    • TRL: latest
    • Transformers: latest
    • PyTorch: latest (with CUDA support)
  6. Which trainer? SFTTrainer, GRPOTrainer etc

    • SFTTrainer from TRL

Minimal Code to Reproduce

# train_ift.py - Simplified relevant parts
import torch
from unsloth import FastModel, get_chat_template, train_on_responses_only
from trl import SFTTrainer

# Load model
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-4-E2B-it",
    max_seq_length=8192,
    load_in_4bit=False,
    load_in_8bit=False,
    full_finetuning=True,
    trust_remote_code=True,
)

# Apply chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma-4",
)

# Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=5e-5,
    num_train_epochs=3,
    max_grad_norm=1.0,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    bf16=True,
    gradient_checkpointing=True,
    optim="adamw_torch",
    adam_beta1=0.9,
    adam_beta2=0.95,
    weight_decay=0.1,
)

# Dataset: 10 Vietnamese datasets (~650K samples)
# Format: Conversational with reasoning content

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=training_args,
    train_on_responses_only=True,
    instruction_part="<|turn>user\n",
    response_part="<|turn>model\n",
)

# Apply train_on_responses_only
train_on_responses_only(
    trainer,
    instruction_part="<|turn>user\n",
    response_part="<|turn>model\n",
)

# Train
trainer.train()

Configuration Details

Model: unsloth/gemma-4-E2B-it

  • Full fine-tuning (no LoRA)
  • Max sequence length: 8192
  • Chat template: "gemma-4"
  • Thinking enabled: Yes

Training Config:

  • Batch size: 1 per device
  • Gradient accumulation: 16 (effective batch = 16)
  • Learning rate: 5e-5
  • Warmup ratio: 3%
  • Max grad norm: 1.0 (but seeing 6880-8128!)
  • Optimizer: AdamW (adamw_torch)
  • Precision: bf16

Datasets:

  • 10 Vietnamese datasets (~650K samples total)
  • Mix: Education, Math, Legal, History, NLI, Summarization, etc.
  • Format: ShareGPT/Alpaca with reasoning content

Expected vs Actual Behavior

Expected:

  • Gradient norm should be clipped to max_grad_norm (1.0)
  • Loss should start at a reasonable value (not 185.9)
  • Training should be stable from the start

Actual:

  • Gradient norms: 6880, 4512, 8128, 1344 (much higher than max_grad_norm)
  • Initial loss: 185.9 (extremely high)
  • Loss is decreasing but starting from a very high point

Sample Training Data (with Chat Template)

Here are examples of the actual training data with the Gemma-4 chat template applied:

Example 1 (Vietnamese Math MCQ):

<|turn>system
<|think|>
<turn|>
<|turn>user
Câu 1: 
 
 Độ dài của chiếc bút xoá là:

Các đáp án:
A. 10 cm
B. 8 cm
C. 9 cm
D. 7 cm<turn|>
<|turn>model
<|channel>thought
Hướng dẫn giải 
 Đáp án đúng là: C 
 Quan sát hình vẽ ta thấy độ dài của chiếc bút xoá là 9 cm .
<channel|>
Đáp án: C<turn|>

Example 2 (English Math with Reasoning):

<|turn>system
<|think|>
<turn|>
<|turn>user
[MATH - medium] 252 fifth-grade students and 8 teachers at Yeji's school are going on a field trip. If the cost of renting a 41-seater bus is 300,000 won and the highway toll per bus is 7,500 won, how much does it cost to rent the bus and pay the toll?<turn|>
<|turn>model
<|channel>thought
Let me work through this problem step by step.

First, I need to understand what's being asked: 252 fifth-grade students and 8 teachers at Yeji's school are going on a field trip. If the cost of renting a 41-seater bus is 300,000 won and the highway toll per bus is 7,500 won, how much does it cost to rent the bus and pay the toll?

Key values given: 252, 8, 41, 300,000, 7,500, ,

My approach:
1. Find the total number of people going on the field trip

- Fifth-grade students: 252
- Teachers: 8
- Total people: 252 + 8 = 260 people
2. Calculate how many buses are needed

Each bus has 41 seats.

$$\text{Number of buses} = \frac{260}{41} = 6.34...$$

Since we cannot rent a partial bus, we must round up to the
3. whole number.

Buses needed: 7 buses

Verification: 
- 6 buses × 41 seats = 246 seats (not enough for 260 people)
- 7 buses × 41 seats = 287 seats (sufficient for 260 people) ✓
4. Calculate the bus rental cost

$$\text{Bus rental cost} = 7 \text{ buses} \times 300,000 \text{ won} = 2,100,000 \text{ won}$$
5. Calculate the highway toll

$$\text{Highway toll} = 7 \text{ buses} \times 7,500 \text{ won} = 52,500 \text{ won}$$
6. Calculate the total cost

$$\text{Total cost} = 2,100,000 + 52,500 = 2,152,500 \text{ won}$$

Final Answer

| Item | Calculation | Cost |
 Bus rental | 7 × 300,000 won | 2,100,000 won |
| Highway tol

This gives us the final answer: s
<channel|>
# Solution: Calculating Bus Rental and Toll Costs

## Step 1: Find the total number of people going on the field trip

- Fifth-grade students: 252
- Teachers: 8
- **Total people**: 252 + 8 = **260 people**

## Step 2: Calculate how many buses are needed

Each bus has 41 seats.

$$\text{Number of buses} = \frac{260}{41} = 6.34...$$

Since we cannot rent a partial bus, we must round up to the next whole number.

**Buses needed**: 7 buses

**Verification**: 
- 6 buses × 41 seats = 246 seats (not enough for 260 people)
- 7 buses × 41 seats = 287 seats (sufficient for 260 people) ✓

## Step 3: Calculate the bus rental cost

$$\text{Bus rental cost} = 7 \text{ buses} \times 300,000 \text{ won} = 2,100,000 \text{ won}$$

## Step 4: Calculate the highway toll

$$\text{Highway toll} = 7 \text{ buses} \times 7,500 \text{ won} = 52,500 \text{ won}$$

## Step 5: Calculate the total cost

$$\text{Total cost} = 2,100,000 + 52,500 = 2,152,500 \text{ won}$$

---

## Final Answer

| Item | Calculation | Cost |
|------|-------------|------|
| Bus rental | 7 × 300,000 won | 2,100,000 won |
| Highway toll | 7 × 7,500 won | 52,500 won |
| **Total** | | **2,152,500 won** |

**The total cost to rent the buses and pay the toll is 2,152,500 won.**<turn|>

Notes:

  • Using train_on_responses_only=True with instruction_part="<|turn>user\n" and response_part="<|turn>model\n"
  • The <|channel>thought blocks contain reasoning content that is also being trained on
  • Mix of Vietnamese and English content

Additional Context

  • The "Unsloth: Will smartly offload gradients to save VRAM!" message appears at the start
  • Using train_on_responses_only=True with Gemma-4 chat template
  • Training on Vietnamese instruction datasets with reasoning content
  • H100 80GB GPU, bf16 precision
  • This is continued fine-tuning of an already instruction-tuned model (Gemma-4-E2B-it)

Is this behavior normal for fine-tuning an already instruction-tuned model on a different language/domain? Or is there something wrong with my configuration?

Thank you for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions