⏱ 9–14 min read🎓 Intermediate → AdvancedUpdated Apr 2026

Fine-Tuning: Customizing LLMs

Dec 29, 2025 • 22 min read

Prompt engineering is powerful, but it has hard limits: it can't change how a model reasons, can't permanently teach it a new skill, and consumes expensive tokens every request. Fine-tuning lets you permanently update a model's weights to specialize it for your task — reducing prompt length, improving consistency, and often enabling capabilities that prompting alone can't achieve.

1. When to Fine-Tune vs. When to Prompt

Scenario	Use Prompt Engineering	Use Fine-Tuning
Output format (JSON, XML)	✓ Works well with structured output	✓ More consistent at scale
Domain knowledge (facts)	✓ Use RAG instead	✗ Models forget facts (hallucinate)
Tone & style	✓ Few-shot examples in prompt	✓ Permanent, no prompt overhead
New task type	✓ Try zero-shot first	✓ When 0-shot quality < threshold
Latency reduction	✗ Prompting won't help	✓ Shorter prompts = lower cost
Privacy (no data to API)	✗	✓ Fine-tune open-source model

2. Full Fine-Tuning vs. LoRA vs. QLoRA

You almost never want full fine-tuning. Here's why:

Full Fine-Tuning: Updates all 7-70 billion parameters. Requires 80-140 GB VRAM. Takes hours on A100s. Costs hundreds of dollars per run. Risk of catastrophic forgetting.
LoRA (Low-Rank Adaptation): Freezes original weights, adds small trainable "adapter" matrices. Updates less than 1% of parameters. Requires 16-24 GB VRAM for a 7B model. Same quality as full fine-tuning on most tasks.
QLoRA: LoRA + 4-bit quantization of the frozen base model. Fine-tune a 70B model on a single 24 GB GPU. Minimal quality loss vs LoRA. The gold standard for resource-constrained fine-tuning.

3. Dataset Preparation

Data quality is more important than training duration. The standard format is JSONL (one JSON example per line) in instruction-following format:

# Alpaca format (most common)
{"instruction": "Classify this support ticket by urgency.", "input": "My account was charged twice!", "output": "URGENT"}
{"instruction": "Classify this support ticket by urgency.", "input": "How do I change my profile picture?", "output": "LOW"}

# ShareGPT / ChatML format (for multi-turn fine-tuning)
{"conversations": [
  {"from": "system", "value": "You are a customer support agent for Acme Corp."},
  {"from": "human", "value": "I can't log in to my account"},
  {"from": "gpt", "value": "I'm sorry to hear that. Let me help you reset your password..."}
]}

# Minimum viable dataset: 200-500 examples for task specialization
# Production quality: 2,000-10,000 examples with careful curation

Generating Synthetic Training Data

If you don't have labeled examples, use GPT-4 to generate them from your documents:

from openai import OpenAI
import json

client = OpenAI()

def generate_training_examples(document: str, n: int = 10) -> list[dict]:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Generate {n} question-answer pairs from this document.
Return as JSON array: [{{"instruction": "...", "input": "", "output": "..."}}]

Document: {document}"""
        }],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

4. Fine-Tuning with Unsloth (Fastest Setup)

Unsloth makes QLoRA fine-tuning 2-5x faster and uses 60% less VRAM than raw HuggingFace. Best for getting started quickly on Google Colab or Runpod:

# Install (Google Colab T4 GPU — free)
!pip install unsloth[colab-new] xformers trl

from unsloth import FastLanguageModel
import torch

# Load base model in 4-bit (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.1-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,        # Auto-detect
    load_in_4bit=True, # QLoRA quantization
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,              # LoRA rank (higher = more trainable params)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth", # Reduces VRAM usage
)

# Standard HuggingFace SFT Trainer
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=your_dataset,
    dataset_text_field="text",   # Column containing formatted examples
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        output_dir="./outputs",
        save_strategy="epoch",
    ),
)
trainer.train()

5. Fine-Tuning with Axolotl (Production Config)

Axolotl is the best config-driven fine-tuner for production use. Define everything in YAML, run with a single command:

# config.yml (Axolotl QLoRA config for Llama 3.1 8B)
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true

datasets:
  - path: ./data/train.jsonl
    type: alpaca           # Dataset format

lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj

sequence_len: 2048
sample_packing: true       # Pack multiple short examples into one sequence
pad_to_sequence_len: true

micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 2e-4
optimizer: adamw_bnb_8bit  # Memory-efficient optimizer
lr_scheduler: cosine

output_dir: ./model-out
save_total_limit: 3

wandb_project: my-fine-tune  # Optional experiment tracking

# Run training:
# axolotl train config.yml

6. Saving and Using the Fine-Tuned Model

# Save merged model (base + LoRA weights combined)
model.save_pretrained_merged("my-fine-tuned-model", tokenizer, save_method="merged_16bit")

# Or save as GGUF for llama.cpp / Ollama deployment
model.save_pretrained_gguf("my-model-gguf", tokenizer, quantization_method="q4_k_m")

# Use with Ollama:
# ollama create my-custom-model -f ./Modelfile
# where Modelfile contains: FROM ./my-model-gguf/model-q4_k_m.gguf

# Or upload to HuggingFace Hub
model.push_to_hub("your-username/my-fine-tuned-model", tokenizer=tokenizer)

7. Evaluating Your Fine-Tuned Model

Never ship a fine-tuned model without evaluation. Compare it against the base model on a held-out test set:

from evaluate import load
import numpy as np

# For classification tasks
accuracy = load("accuracy")

base_predictions = [predict(base_model, x) for x in test_inputs]
ft_predictions = [predict(fine_tuned_model, x) for x in test_inputs]
ground_truth = [x["output"] for x in test_data]

base_score = accuracy.compute(predictions=base_predictions, references=ground_truth)
ft_score = accuracy.compute(predictions=ft_predictions, references=ground_truth)

print(f"Base model: {base_score['accuracy']:.2%}")
print(f"Fine-tuned: {ft_score['accuracy']:.2%}")
print(f"Improvement: +{(ft_score['accuracy'] - base_score['accuracy']):.2%}")

Frequently Asked Questions

How much training data do I need?

For task specialization (teaching a new format or style): 200-500 high-quality examples. For domain adaptation (teaching new knowledge domain): 2,000-10,000 examples. For instruction following (aligning a base model): 10,000-50,000 multi-turn conversations. Quality always beats quantity — 200 carefully curated examples outperform 2,000 low-quality ones.

Will fine-tuning affect capabilities the model already has?

Yes — this is called catastrophic forgetting. If you fine-tune only on customer support data, coding ability may degrade. Mitigation: include diverse general-purpose examples (5-10% of your dataset from open benchmarks like OpenHermes), use LoRA instead of full fine-tuning, and keep learning rate low (2e-4 to 2e-5).

Is fine-tuning worth the cost?

Calculate the ROI: if fine-tuning reduces your system prompt from 2,000 to 200 tokens per request, and you serve 1M requests/month, you save 1.8B tokens/month. At GPT-4o prices, that's $9,000/month in savings vs $100-300 one-time fine-tuning cost. For high-volume production systems, fine-tuning almost always pays off.

Conclusion

Fine-tuning has gone from a research topic requiring PhDs and GPU clusters to a weekend project thanks to QLoRA, Unsloth, and Axolotl. The combination of a free Google Colab T4 GPU and Unsloth lets any developer fine-tune a 7-8B model on custom data. Start with Unsloth for experiments, graduate to Axolotl for production pipelines, and always evaluate against the base model before deploying.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact