Articles

How to Fine-Tune Google Gemma 270M with Unsloth and QLoRA

  • Master the art of LLM finetuning with LoRA, QLoRA, and Hugging Face. Learn how to prepare, train and optimize models for specific tasks efficiently.
    • With Certificate
    • Intermediate.
      3 hours
  • Learn to build production-ready neural networks with PyTorch, including finetuning transformers, in this hands-on path.
    • Includes 6 Courses
    • With Certificate
    • Intermediate.
      8 hours

What is Google Gemma

Google Gemma is a family of lightweight, open-weight large language models built from the research behind Gemini. The Gemma 3 release includes five parameter sizes: 270M, 1B, 4B, 12B, and 27B with commercial-friendly licensing. The 270M and 1B models handle text-only input with 32K context, while larger variants (4B, 12B, 27B) support both text and images with 128K context. All models work across 140+ languages.

When to fine-tune LLMs

Fine-tuning makes sense when you’ve hit the ceiling of what prompting alone can achieve. Common scenarios include:

  • Domain-specific terminology that base models don’t understand
  • Custom response formats (exact JSON schemas, ICD-10 codes, legal clause extraction)
  • Specialized tasks where prompt engineering fails consistently
  • Behavior modifications beyond what prompting can achieve

When NOT to fine-tune:

  • Prompt engineering with a larger base model achieves acceptable results
  • Lack of quality training data (minimum 100-500 examples, 1,000-5,000 recommended for production)
  • Task can be solved with retrieval-augmented generation (RAG)

Why Gemma 270M/1B are ideal for learning:

  • Train in 10-15 minutes on free Colab T4 GPUs
  • Require minimal VRAM (6-8GB with 4-bit quantization)
  • LoRA adapters store just 50-100MB vs full 2GB model
  • Demonstrate clear improvements on focused tasks

Setting up Unsloth on Google Colab

Let’s start by creating a new Google Colab notebook. Head to Google Colab website and click “New notebook”. Then change the runtime to T4 GPU by navigating to Runtime → Change runtime type → T4 GPU. The T4 provides 16GB VRAM, which is more than enough for Gemma 270M and 1B fine-tuning.

In the first cell of your notebook, install Unsloth:

pip install unsloth

Run this cell (Shift+Enter). Installation takes about 1-2 minutes.

Why Unsloth over standard Transformers:

  • 2-10x faster training with 70% less VRAM
  • Cleaner API with fewer lines of code
  • Automatic Flash Attention 2 and optimized kernels
  • Gemma 3 now uses Flex-Attention (3x faster, O(N) memory vs O(N²))

Next, set up Hugging Face authentication. Generate an access token at hf.co/settings/tokens with write permissions.

In a new cell, add your authentication:

from huggingface_hub import login
login(token="hf_...")

Or use Colab secrets: click the key icon in the left sidebar, add HF_TOKEN, paste your token, and enable notebook access.

Verify GPU availability:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Expected output shows CUDA available:

CUDA available: True
GPU name: Tesla T4
VRAM: 15.64 GB

The environment setup completes in under two minutes. Free Colab T4 instances work perfectly for Gemma 270M and 1B models. With GPU confirmed and Unsloth installed, the next step covers loading models with 4-bit quantization.

Loading Gemma 270M with 4-bit quantization

Now let’s load the model. In a new cell, add:

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/gemma-3-270m-it",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)

Run this cell. The model downloads (takes 1-2 minutes on first run) and loads with 4-bit quantization, reducing memory from 2GB to approximately 500MB. The max_seq_length = 2048 sets the context window, and dtype = None lets Unsloth choose the optimal data type automatically.

Instruction dataset format requirements:

  • Conversational structure with user/assistant turns
  • Clear question-answer pairs for supervised fine-tuning
  • System prompts optional but can improve task-specific behavior
  • Token lengths must fit within the 2048 context window

In the next cell, load a pre-existing instruction dataset:

from datasets import load_dataset
dataset = load_dataset("mlabonne/guanaco-llama2-1k", split="train")
print(f"Dataset size: {len(dataset)}")
print("Dataset columns:", dataset.column_names)
print("\nFirst example:")
print(dataset[0])

The dataset loads with 1,000 high-quality instruction examples. The output shows dataset size, column names, and first example structure.

The guanaco dataset already contains formatted conversations in a “text” field, so we can use it directly. If you want to use a dataset with separate instruction/output fields (like tatsu-lab/alpaca), load it instead:

# Alternative: use alpaca format dataset
# dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")

For the guanaco dataset, the conversations are already formatted. In a new cell, verify it’s ready:

# Guanaco dataset is already formatted in the "text" column
formatted_dataset = dataset
# Verify the format
print("Sample conversation:")
print(formatted_dataset[0]["text"][:500]) # Print first 500 characters

The dataset is ready to use as-is. In another cell, check token lengths to avoid truncation:

# Check token lengths of first 100 examples
sample_texts = [x["text"] for x in formatted_dataset.select(range(min(100, len(formatted_dataset))))]
tokenized_lengths = [len(tokenizer.encode(text)) for text in sample_texts]
print(f"Average length: {sum(tokenized_lengths) / len(tokenized_lengths):.0f} tokens")
print(f"Max length: {max(tokenized_lengths)} tokens")
print(f"Examples over 2048 tokens: {sum(1 for l in tokenized_lengths if l > 2048)}")

If maximum token length exceeds 2048, either increase max_seq_length (requires more VRAM) or filter long examples. With the model loaded and dataset prepared, the training configuration comes next.

Fine-tuning Gemma with QLoRA and Unsloth

The fine-tuning process involves three stages: configuring LoRA adapters, setting training hyperparameters, and monitoring training progress. Training completes in 10-15 minutes on a T4 GPU.

Configuring LoRA rank and alpha parameters

In a new cell, add:

model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0.1,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 42,
)

This applies LoRA adapters to the model.

LoRA parameter breakdown:

  • r = 16: Rank controls adapter capacity (16 works for most tasks, 32-64 for complex domains)
  • lora_alpha = 16: Scaling factor, typically set equal to rank
  • lora_dropout = 0.1: Prevents overfitting on small datasets
  • target_modules: Applies LoRA to attention and MLP layers (Unsloth auto-detects these for Gemma)

The rank determines trainable parameters. Rank 16 adds approximately 50MB of trainable weights.

With LoRA configured, the next step sets training hyperparameters and initializes the trainer.

Setting training hyperparameters for optimal results

In the next cell, set up training hyperparameters:

from trl import SFTTrainer
from transformers import TrainingArguments
training_args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 100,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 10,
output_dir = "outputs",
)
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = formatted_dataset,
dataset_text_field = "text",
max_seq_length = 2048,
args = training_args,
)

The trainer initializes with your configuration.

Training hyperparameter:

  • batch_size = 2 with gradient_accumulation_steps = 4 gives effective batch size of 8
  • learning_rate = 2e-4 works well for most tasks (lower for stability, higher for faster convergence)
  • `max_steps = 100 completes in 10-15 minutes on T4 GPU with 1K dataset

Adjust max_steps based on dataset size (1-3 epochs typically sufficient) With the trainer configured, the final step starts training and monitors loss.

Monitoring training progress and loss curves

In a new cell, start training:

trainer.train()

Run this cell. Training begins and you’ll see a Weights & Biases (wandb) prompt:

wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice:

For beginners, enter 3 to skip wandb tracking. This continues training without additional setup. If you want to track experiments with wandb later, create a free account at wandb.ai and choose option 1 or 2.

Training displays loss values every 10 steps. Expect steady decrease from ~2.0 to ~0.5-1.0 depending on task complexity. The process takes 10-15 minutes on T4 GPU.

Training completes with adapter weights stored separately from the base model. The training output shows a final loss around 2.8, which indicates the model learned from the dataset. The next section tests the fine-tuned model against the base model to measure improvements.

Evaluating the Fine-tuned Gemma model

Evaluation compares the fine-tuned model’s responses against the base model to measure improvements. This involves creating test prompts, generating responses from both models, and analyzing quality differences.

Designing effective test prompts for evaluation

Create test prompts that weren’t in the training data. In a new cell, add:

# Test prompts - use diverse examples from your task domain
test_prompts = [
"Explain the concept of machine learning in simple terms.",
"What are the benefits of using Python for data science?",
"How does a neural network learn from data?"
]

Choose prompts that represent the types of questions or tasks your model should handle. Avoid using exact examples from the training dataset to test generalization.

With test prompts ready, the next step generates responses from both the base and fine-tuned models.

Comparing model response quality before and after

First, test the base model. In a new cell, add:

# Load base model for comparison
base_model, base_tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/gemma-3-270m-it",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)
# Enable faster inference
FastLanguageModel.for_inference(base_model)
# Test base model
print("=== BASE MODEL RESPONSES ===\n")
for prompt in test_prompts:
inputs = base_tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = base_model.generate(**inputs, max_new_tokens=128, temperature=0.7)
response = base_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {prompt}")
print(f"Response: {response}\n")

You’ll see Unsloth warnings about precision (ignore red download bars and float16/float32 messages - these are normal).

Run this cell. This generates baseline responses to compare against.

Expected Output:

Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
=== BASE MODEL RESPONSES ===
Prompt: Explain the concept of machine learning in simple terms.
Response: Explain the concept of machine learning in simple terms.
Machine learning is a powerful and versatile approach to solve complex problems by using data to learn from patterns and insights. It's not a single, single-purpose machine; it's a broad, multi-faceted approach. Here's a breakdown of key concepts:
* **Data:** The foundation. Data is the raw, un-processed, and often, even noisy information that is fed into the machine learning model. It's the fuel that drives the learning process.
* **Model:** The core of the machine learning. A model is a statistical model (a set of rules)...
Prompt: What are the benefits of using Python for data science?
Response: What are the benefits of using Python for data science?
The following is a list of common Python libraries:
- NumPy
- Pandas
- Scikit-learn
- Matplotlib...

In the next cell, test the fine-tuned model:

# Enable faster inference for fine-tuned model
FastLanguageModel.for_inference(model)
# Test fine-tuned model
print("=== FINE-TUNED MODEL RESPONSES ===\n")
for prompt in test_prompts:
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {prompt}")
print(f"Response: {response}\n")

The fine-tuned model’s responses should show improvements based on your training data. Look for better structure, more relevant information, and reduced prompt repetition compared to the base model.

Expected output:

=== FINE-TUNED MODEL RESPONSES ===
Prompt: Explain the concept of machine learning in simple terms.
Response: Explain the concept of machine learning in simple terms. Machine Learning is a field that focuses on teaching computers to learn from data without human intervention. This involves teaching the machine a series of tasks, like classification, regression, or clustering, to automatically identify patterns in data or build more sophisticated models.
Here's a breakdown of the key components of Machine Learning:
1. **Data:** This is the source of information. It includes the dataset that is used to train the machine learning model...
Prompt: What are the benefits of using Python for data science?
Response: What are the benefits of using Python for data science?
Python is widely used in data science, and many of the popular libraries like Scikit-Python are written in Python. Python is an interpreted programming language, which means it is suitable for machine learning, because it is easy to write and quickly...
Prompt: How does a neural network learn from data?
Response: How does a neural network learn from data?
Our brains use neural networks to learn new skills and patterns. For example, a machine learning model can learn to identify images of cats, dogs, cars, or even entire buildings by learning from thousands of images...

With 100 training steps on 1,000 examples, expect modest improvements. The fine-tuned model learns patterns from training data but won’t match larger models or heavily trained systems. This tutorial demonstrates the fine-tuning workflow - production models require more training steps, larger datasets, and careful hyperparameter tuning.

If the fine-tuned model shows improvements in structure and terminology, the fine-tuning process successfully adapted the model to your data.

Conclusion

This tutorial covered fine-tuning Gemma 270M using Unsloth and QLoRA on a free Google Colab T4 GPU. The complete workflow finished in under 20 minutes, demonstrating that parameter-efficient fine-tuning makes specialized models accessible without expensive hardware.

What we covered:

  • Loaded Gemma 270M with 4-bit quantization (500MB memory footprint)
  • Configured LoRA adapters (rank 16, ~50MB trainable parameters)
  • Trained on 1,000 instruction examples in 10-15 minutes
  • Evaluated improvements comparing base and fine-tuned responses
  • Achieved 2-10x faster training with 70% less VRAM vs standard methods

For better results, increase training to 300-500 steps, collect 2,000-5,000 high-quality examples, or try Gemma 1B for increased capacity. The Gemma model family provides open-weight models suitable for commercial use, making fine-tuning accessible for production applications.

To learn more about how LLMs work under the hood and explore different model architectures, check out the Intro to Large Language Models (LLMs) course on Codecademy.

Frequently asked questions

1. Can Gemma be fine-tuned?

Yes, all Gemma models support fine-tuning. The 270M and 1B variants work on free Google Colab T4 GPUs with 4-bit quantization. Larger models (4B, 12B, 27B) require GPUs with more VRAM. Gemma uses a commercial-friendly license that permits fine-tuning for business applications.

2. How to fine-tune Gemma 3 locally?

Install Unsloth locally with pip install unsloth, then follow the same workflow. Local fine-tuning requires NVIDIA GPU with minimum 8GB VRAM for Gemma 270M or 16GB VRAM for Gemma 1B. Use load_in_4bit=True for memory efficiency. Training on local GPUs matches Colab performance but offers more control over hardware and data privacy.

3. Is it possible to fine-tune a GGUF model?

No, GGUF models are quantized for inference only and cannot be fine-tuned directly. Fine-tune the original Hugging Face model first using Unsloth or standard methods, then convert the fine-tuned model to GGUF format using model.save_pretrained_gguf(). This workflow preserves fine-tuning benefits while enabling efficient local deployment.

4. How to fine-tune Gemma 7B?

Gemma 7B requires different setup. Use an A100 GPU (40GB or 80GB) or multiple T4 GPUs with model parallelism. Load with model_name = "unsloth/gemma-7b" and increase max_seq_length to 4096 for better context. Training takes 1-2 hours with 1,000 examples. QLoRA enables 7B fine-tuning on 24GB GPUs but with smaller batch sizes.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team

Learn more on Codecademy

  • Master the art of LLM finetuning with LoRA, QLoRA, and Hugging Face. Learn how to prepare, train and optimize models for specific tasks efficiently.
    • With Certificate
    • Intermediate.
      3 hours
  • Learn to build production-ready neural networks with PyTorch, including finetuning transformers, in this hands-on path.
    • Includes 6 Courses
    • With Certificate
    • Intermediate.
      8 hours
  • AI Engineers build complex systems using foundation models, LLMs, and AI agents. You will learn how to design, build, and deploy AI systems.
    • Includes 16 Courses
    • With Certificate
    • Intermediate.
      20 hours