Fine-Tuning: Customizing LLMs
Dec 29, 2025 • 22 min read
Prompt engineering is powerful, but it has hard limits: it can't change how a model reasons, can't permanently teach it a new skill, and consumes expensive tokens every request. Fine-tuning lets you permanently update a model's weights to specialize it for your task — reducing prompt length, improving consistency, and often enabling capabilities that prompting alone can't achieve.
1. When to Fine-Tune vs. When to Prompt
| Scenario | Use Prompt Engineering | Use Fine-Tuning |
|---|---|---|
| Output format (JSON, XML) | ✓ Works well with structured output | ✓ More consistent at scale |
| Domain knowledge (facts) | ✓ Use RAG instead | ✗ Models forget facts (hallucinate) |
| Tone & style | ✓ Few-shot examples in prompt | ✓ Permanent, no prompt overhead |
| New task type | ✓ Try zero-shot first | ✓ When 0-shot quality < threshold |
| Latency reduction | ✗ Prompting won't help | ✓ Shorter prompts = lower cost |
| Privacy (no data to API) | ✗ | ✓ Fine-tune open-source model |
2. Full Fine-Tuning vs. LoRA vs. QLoRA
You almost never want full fine-tuning. Here's why:
- Full Fine-Tuning: Updates all 7-70 billion parameters. Requires 80-140 GB VRAM. Takes hours on A100s. Costs hundreds of dollars per run. Risk of catastrophic forgetting.
- LoRA (Low-Rank Adaptation): Freezes original weights, adds small trainable "adapter" matrices. Updates less than 1% of parameters. Requires 16-24 GB VRAM for a 7B model. Same quality as full fine-tuning on most tasks.
- QLoRA: LoRA + 4-bit quantization of the frozen base model. Fine-tune a 70B model on a single 24 GB GPU. Minimal quality loss vs LoRA. The gold standard for resource-constrained fine-tuning.
3. Dataset Preparation
Data quality is more important than training duration. The standard format is JSONL (one JSON example per line) in instruction-following format:
# Alpaca format (most common)
{"instruction": "Classify this support ticket by urgency.", "input": "My account was charged twice!", "output": "URGENT"}
{"instruction": "Classify this support ticket by urgency.", "input": "How do I change my profile picture?", "output": "LOW"}
# ShareGPT / ChatML format (for multi-turn fine-tuning)
{"conversations": [
{"from": "system", "value": "You are a customer support agent for Acme Corp."},
{"from": "human", "value": "I can't log in to my account"},
{"from": "gpt", "value": "I'm sorry to hear that. Let me help you reset your password..."}
]}
# Minimum viable dataset: 200-500 examples for task specialization
# Production quality: 2,000-10,000 examples with careful curationGenerating Synthetic Training Data
If you don't have labeled examples, use GPT-4 to generate them from your documents:
from openai import OpenAI
import json
client = OpenAI()
def generate_training_examples(document: str, n: int = 10) -> list[dict]:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"""Generate {n} question-answer pairs from this document.
Return as JSON array: [{{"instruction": "...", "input": "", "output": "..."}}]
Document: {document}"""
}],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)4. Fine-Tuning with Unsloth (Fastest Setup)
Unsloth makes QLoRA fine-tuning 2-5x faster and uses 60% less VRAM than raw HuggingFace. Best for getting started quickly on Google Colab or Runpod:
# Install (Google Colab T4 GPU — free)
!pip install unsloth[colab-new] xformers trl
from unsloth import FastLanguageModel
import torch
# Load base model in 4-bit (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3.1-8b-bnb-4bit",
max_seq_length=2048,
dtype=None, # Auto-detect
load_in_4bit=True, # QLoRA quantization
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank (higher = more trainable params)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth", # Reduces VRAM usage
)
# Standard HuggingFace SFT Trainer
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=your_dataset,
dataset_text_field="text", # Column containing formatted examples
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
output_dir="./outputs",
save_strategy="epoch",
),
)
trainer.train()5. Fine-Tuning with Axolotl (Production Config)
Axolotl is the best config-driven fine-tuner for production use. Define everything in YAML, run with a single command:
# config.yml (Axolotl QLoRA config for Llama 3.1 8B)
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true
datasets:
- path: ./data/train.jsonl
type: alpaca # Dataset format
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
sequence_len: 2048
sample_packing: true # Pack multiple short examples into one sequence
pad_to_sequence_len: true
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 2e-4
optimizer: adamw_bnb_8bit # Memory-efficient optimizer
lr_scheduler: cosine
output_dir: ./model-out
save_total_limit: 3
wandb_project: my-fine-tune # Optional experiment tracking
# Run training:
# axolotl train config.yml6. Saving and Using the Fine-Tuned Model
# Save merged model (base + LoRA weights combined)
model.save_pretrained_merged("my-fine-tuned-model", tokenizer, save_method="merged_16bit")
# Or save as GGUF for llama.cpp / Ollama deployment
model.save_pretrained_gguf("my-model-gguf", tokenizer, quantization_method="q4_k_m")
# Use with Ollama:
# ollama create my-custom-model -f ./Modelfile
# where Modelfile contains: FROM ./my-model-gguf/model-q4_k_m.gguf
# Or upload to HuggingFace Hub
model.push_to_hub("your-username/my-fine-tuned-model", tokenizer=tokenizer)7. Evaluating Your Fine-Tuned Model
Never ship a fine-tuned model without evaluation. Compare it against the base model on a held-out test set:
from evaluate import load
import numpy as np
# For classification tasks
accuracy = load("accuracy")
base_predictions = [predict(base_model, x) for x in test_inputs]
ft_predictions = [predict(fine_tuned_model, x) for x in test_inputs]
ground_truth = [x["output"] for x in test_data]
base_score = accuracy.compute(predictions=base_predictions, references=ground_truth)
ft_score = accuracy.compute(predictions=ft_predictions, references=ground_truth)
print(f"Base model: {base_score['accuracy']:.2%}")
print(f"Fine-tuned: {ft_score['accuracy']:.2%}")
print(f"Improvement: +{(ft_score['accuracy'] - base_score['accuracy']):.2%}")Frequently Asked Questions
How much training data do I need?
For task specialization (teaching a new format or style): 200-500 high-quality examples. For domain adaptation (teaching new knowledge domain): 2,000-10,000 examples. For instruction following (aligning a base model): 10,000-50,000 multi-turn conversations. Quality always beats quantity — 200 carefully curated examples outperform 2,000 low-quality ones.
Will fine-tuning affect capabilities the model already has?
Yes — this is called catastrophic forgetting. If you fine-tune only on customer support data, coding ability may degrade. Mitigation: include diverse general-purpose examples (5-10% of your dataset from open benchmarks like OpenHermes), use LoRA instead of full fine-tuning, and keep learning rate low (2e-4 to 2e-5).
Is fine-tuning worth the cost?
Calculate the ROI: if fine-tuning reduces your system prompt from 2,000 to 200 tokens per request, and you serve 1M requests/month, you save 1.8B tokens/month. At GPT-4o prices, that's $9,000/month in savings vs $100-300 one-time fine-tuning cost. For high-volume production systems, fine-tuning almost always pays off.
Conclusion
Fine-tuning has gone from a research topic requiring PhDs and GPU clusters to a weekend project thanks to QLoRA, Unsloth, and Axolotl. The combination of a free Google Colab T4 GPU and Unsloth lets any developer fine-tune a 7-8B model on custom data. Start with Unsloth for experiments, graduate to Axolotl for production pipelines, and always evaluate against the base model before deploying.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.