opncrafter

Training Style LoRAs: Your Personal Art Style

Dec 30, 2025 • 22 min read

Flux.1 is a 12 billion parameter image generation model. Full fine-tuning would require 40-80GB VRAM and days of compute. LoRA (Low-Rank Adaptation) makes this possible on a consumer RTX 4090 in under 2 hours. Rather than updating all 12B weights, LoRA inserts tiny "adapter" matrices at key points in the network — matrices with a dramatically reduced rank that capture the style concept far more efficiently. A style LoRA file might be 100MB; the base model is 23GB. Yet it transforms every generation to match your aesthetic precisely.

1. Why LoRA Works: The Math in Plain English

Every layer in a neural network is a matrix of weights. For a layer with weight matrix W of size 4096×4096, updating it directly during fine-tuning requires storing and computing gradients for 16 million numbers. LoRA's insight: instead of updating W directly, represent the update as the product of two much smaller matrices: ΔW = A × B, where A has shape 4096×r and B has shape r×4096, and r is the "rank" (typically 4-64).

With rank 16, the number of parameters drops from 16M to 2×(4096×16) = 130K — a 120x reduction. The training only updates A and B. The original W stays frozen. At inference time, the contribution is simply W + A×B. The low-rank assumption works because neural network weight updates during fine-tuning tend to be low-rank in practice — the concepts being learned don't require full-rank updates.

2. Dataset Preparation: Quality Over Quantity

Good Training Images
  • 15-30 images for style LoRAs (not 1000!)
  • Consistent artistic style throughout
  • Varied subjects and compositions
  • Clean background or intentional props
  • Minimum 512×512 resolution (1024+ for Flux)
  • No watermarks, logos, unintended text
Common Mistakes
  • Too many images with same composition
  • Mixed styles (different artists in one dataset)
  • Low resolution or blurry images
  • Captions that over-describe subject (starving trigger word)
  • Generic trigger words like "style" or "art"
  • Training too long — causes catastrophic forgetting
# Captioning Strategy: The Most Critical Step
# 
# Your trigger word + sparse caption = strong style learning
# Dense natural captions = model learns subjects, not style

# BAD caption (too descriptive — model learns "woman" not your style):
# "a beautiful young woman with long brown hair wearing a red dress, 
#  standing in a garden, detailed illustration, high quality"

# GOOD caption (sparse — forces model to associate visual style with trigger):
# "zxy_style, woman in garden"
# "zxy_style, landscape, mountains"
# "zxy_style, portrait of a man"

# Auto-captioning with BLIP-2 (as starting point to then trim):
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import torch
import os

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b", 
    torch_dtype=torch.float16,
    device_map="auto",
)

TRIGGER_WORD = "zxy_style"  # Use a unique token the model has never seen!
                             # Good: zxy_style, myartst2024, v1st, sks
                             # Bad: "style", "painting", "illustration" (too common)

dataset_dir = "./my_lora_dataset/"

for img_file in os.listdir(dataset_dir):
    if not img_file.lower().endswith(('.png', '.jpg', '.jpeg')):
        continue
    
    image = Image.open(os.path.join(dataset_dir, img_file)).convert('RGB')
    
    inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
    generated_ids = model.generate(**inputs, max_new_tokens=30)  # Short caption!
    auto_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
    
    # Prepend trigger word, strip to sparse description
    # "a woman with long hair in a garden" → "zxy_style, woman, garden"
    sparse_caption = TRIGGER_WORD + ", " + auto_caption.split(',')[0].replace("a ", "")
    
    txt_path = os.path.join(dataset_dir, img_file.rsplit('.', 1)[0] + '.txt')
    with open(txt_path, 'w') as f:
        f.write(sparse_caption)
    
    print(f"{img_file}: {sparse_caption}")

3. Training with kohya_ss for Flux.1

# Install kohya_ss (sd-scripts):
# git clone https://github.com/kohya-ss/sd-scripts
# pip install -r requirements.txt

# Training script for Flux.1 [dev] LoRA
accelerate launch \
    flux_train_network.py \
    --pretrained_model_name_or_path "flux1-dev.safetensors" \
    --train_data_dir "./my_lora_dataset" \
    --output_dir "./output_lora" \
    --output_name "my_style_v1" \
    \
    # Architecture settings:
    --network_module "networks.lora_flux" \
    --network_dim 16 \           # Rank: 4-8 for subtle style, 16-32 for strong style, 64 for subject
    --network_alpha 16 \         # Alpha: typically equals rank (controls effective learning rate)
    \
    # Training hyperparameters:
    --learning_rate 1e-4 \       # Flux prefers higher LR than SDXL (which wants 1e-5)
    --lr_scheduler "cosine_with_restarts" \
    --lr_warmup_steps 50 \
    --max_train_steps 500 \      # 15-30 images × ~20 steps per image = 300-600 total
    --save_every_n_steps 100 \   # Save intermediate checkpoints for comparison
    \
    # Memory and precision:
    --mixed_precision "bf16" \   # BF16 required (Ampere+ GPUs: RTX 3000 series or newer)
    --gradient_checkpointing \   # Saves VRAM (2x slower training)
    --optimizer_type "adafactor" \ # AdaFactor: memory-efficient (12GB VRAM feasible!)
    --optimizer_args "relative_step=False scale_parameter=False warmup_init=False" \
    \
    # Data settings:
    --resolution "1024,1024" \   # Flux native resolution
    --enable_bucket \            # Enable multiple resolutions (portrait, landscape)
    --bucket_no_upscale \        # Don't upscale smaller images
    --max_data_loader_n_workers 4 \
    \
    # Flux-specific:
    --flux_train_t_exponent 7.0 \   # Timestamp distribution (higher = focus on noisy timesteps)
    --guidance_scale 1.0

4. Diagnosing Common Failures

SymptomCauseFix
Output looks like molten lava / deep friedLearning rate too highLower LR: try 5e-5 or 4e-5
Trigger word has no effectOver-captioned or trigger word too commonUse unique trigger, strip captions to 2-5 words
Style learned but quality terriblePoor quality training imagesCurate: only your 15 best images
Style works but catastrophic forgettingTrained too many stepsReduce to 300-500 steps, earlier checkpoint
Only generates one compositionLow variety in training setMore diverse subjects/angles in dataset
OOM despite gradient checkpointingBatch too large or resolution too highSet batch_size=1, use 768px if needed

Frequently Asked Questions

Should I use rank 4, 16, 32, or 64?

For art styles: rank 8-16 is sufficient. Styles are relatively low-complexity concepts — the aesthetic differences between impressionism and photorealism don't require high-rank updates. For specific subjects/characters (face likeness, specific character design): rank 32-64 captures more detailed identity information. For concepts and objects: rank 4-8 is often enough. Always train with low rank first — if results are weak, increase rank before adjusting any other parameter.

Can I train on images I don't own?

Copyright law for AI training is actively evolving and varies by jurisdiction. The safest approach for commercial use: train only on images you own, license under CC0 or appropriate creative commons licenses, or generate from models you have commercial rights to. Many LoRA trainers use their own artwork or commission datasets specifically for this purpose. For research/personal use, the legal landscape is more permissive in most jurisdictions but still uncertain.

Conclusion

Style LoRAs democratize access to custom model fine-tuning: 15-30 carefully captioned images, a single kohya_ss training run, and you have a persistent art style that works across any prompt. The critical success factors are: unique trigger words that won't conflict with the base model's existing vocabulary, sparse captions that force the model to associate visual style with the trigger rather than subject content, and modest training length that captures the style without overwriting too much of the base model's capabilities. Save intermediate checkpoints every 100 steps and evaluate multiple before selecting your production version.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK