Fine-tuning adapts a pre-trained language model to your specific task or style. Done well, it can dramatically improve performance on narrow tasks. Done poorly, it wastes time and money. This guide covers when to fine-tune and how to do it correctly.
When Fine-Tuning Is Worth It
Fine-tuning makes sense when:
- You need consistent format/style that prompt engineering can’t reliably achieve
- You have 100+ labeled examples of the desired behavior
- Latency or cost matters (fine-tuned smaller models can match larger ones on specific tasks)
- You have domain-specific knowledge that needs to be baked in
Usually not worth fine-tuning when:
- You need general intelligence — base models are better
- You don’t have good training data
- Your use case changes frequently
- You haven’t exhausted prompt engineering first
Rule of thumb: Exhaust prompt engineering before fine-tuning. A well-crafted system prompt often achieves 80% of what fine-tuning would.
Data Preparation
Training data quality is more important than quantity.
Format for OpenAI Fine-Tuning
{"messages": [
{"role": "system", "content": "You are a customer support agent for Acme Software."},
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "To reset your password: Go to Settings > Account > Security, then click 'Reset Password'. You'll receive an email with a reset link within 5 minutes. If you don't see it, check your spam folder."}
]}
{"messages": [
{"role": "system", "content": "You are a customer support agent for Acme Software."},
{"role": "user", "content": "I'm getting error code 403"},
{"role": "assistant", "content": "Error 403 typically means you don't have permission to access that resource. Common causes: 1) Your subscription may have expired, 2) Your account may not have access to that feature. Check your subscription status in Settings > Billing. If you believe this is an error, contact [email protected]."}
]}
Save as .jsonl (one JSON object per line).
Data Collection Strategies
import json
def create_training_example(system_prompt: str, user_message: str, ideal_response: str) -> dict:
return {
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
{"role": "assistant", "content": ideal_response},
]
}
def save_training_data(examples: list[dict], output_path: str):
with open(output_path, "w") as f:
for example in examples:
f.write(json.dumps(example) + "\n")
print(f"Saved {len(examples)} training examples to {output_path}")
Generating Synthetic Training Data
import openai
client = openai.OpenAI()
def generate_training_data(task_description: str, n_examples: int = 50) -> list[dict]:
prompt = f"""Generate {n_examples} diverse training examples for this task:
{task_description}
Format each example as JSON with keys: user_message, assistant_response
Make examples diverse — vary phrasing, complexity, and edge cases.
Output valid JSON array."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
data = json.loads(response.choices[0].message.content)
return data.get("examples", [])
OpenAI Fine-Tuning
Upload and Train
import openai
from pathlib import Path
client = openai.OpenAI()
def upload_training_file(file_path: str) -> str:
with open(file_path, "rb") as f:
response = client.files.create(file=f, purpose="fine-tune")
print(f"Uploaded file: {response.id}")
return response.id
def start_fine_tune(training_file_id: str, model: str = "gpt-4o-mini-2024-07-18") -> str:
job = client.fine_tuning.jobs.create(
training_file=training_file_id,
model=model,
hyperparameters={
"n_epochs": 3, # Start with 3, increase if underfitting
},
)
print(f"Fine-tune job started: {job.id}")
return job.id
def check_status(job_id: str) -> dict:
job = client.fine_tuning.jobs.retrieve(job_id)
print(f"Status: {job.status}")
if job.fine_tuned_model:
print(f"Model ready: {job.fine_tuned_model}")
return job
# Full pipeline
file_id = upload_training_file("training_data.jsonl")
job_id = start_fine_tune(file_id)
# Check periodically
status = check_status(job_id)
Using the Fine-Tuned Model
def use_fine_tuned_model(model_id: str, user_message: str) -> str:
response = client.chat.completions.create(
model=model_id, # e.g., "ft:gpt-4o-mini:org-xyz:name:abc123"
messages=[
{"role": "system", "content": "Your system prompt here"},
{"role": "user", "content": user_message},
],
)
return response.choices[0].message.content
Local Fine-Tuning with LoRA
LoRA (Low-Rank Adaptation) fine-tunes a fraction of parameters, making local training feasible.
Setup
pip install transformers datasets peft accelerate bitsandbytes
Fine-Tuning with PEFT/LoRA
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
from datasets import Dataset
import torch
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
load_in_4bit=True, # QLoRA — 4-bit quantization
)
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank — higher = more parameters, more capacity
lora_alpha=32, # Scaling parameter (typically 2x rank)
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"], # Which layers to adapt
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # Shows % of params being trained
def format_instruction(example):
return {
"text": f"<s>[INST] {example['instruction']} [/INST] {example['output']} </s>"
}
# Load your data
data = [
{"instruction": "Explain machine learning", "output": "Machine learning is..."},
# ... more examples
]
dataset = Dataset.from_list(data).map(format_instruction)
Evaluation
Evaluate before and after fine-tuning on a held-out test set:
def evaluate_model(model_id: str, test_cases: list[dict]) -> dict:
results = []
for case in test_cases:
response = client.chat.completions.create(
model=model_id,
messages=[{"role": "user", "content": case["input"]}],
)
output = response.choices[0].message.content
# Simple scoring — replace with LLM-as-judge for better evaluation
passed = case["expected"] in output.lower()
results.append({
"input": case["input"],
"output": output,
"expected": case["expected"],
"passed": passed,
})
accuracy = sum(r["passed"] for r in results) / len(results)
return {"accuracy": accuracy, "results": results}
# Compare base vs fine-tuned
base_results = evaluate_model("gpt-4o-mini", test_cases)
ft_results = evaluate_model("ft:gpt-4o-mini:org:name:id", test_cases)
print(f"Base model accuracy: {base_results['accuracy']:.1%}")
print(f"Fine-tuned accuracy: {ft_results['accuracy']:.1%}")
Cost Estimates
OpenAI Fine-Tuning (gpt-4o-mini):
- Training: $0.003 per 1K tokens (~$3 for 100 examples of 1K tokens each)
- Inference: ~2x base model cost
Local Fine-Tuning:
- Hardware: A100 GPU (80GB) ~$2/hour on Lambda or RunPod
- 100 examples, 3 epochs: ~15-30 minutes
- Total: ~$1 for a small dataset
Common Fine-Tuning Mistakes
Too little data: 50 examples minimum; 500+ for reliable results. More data = better generalization.
Catastrophic forgetting: Fine-tuning too aggressively makes the model forget general capabilities. Use lower learning rates.
Data leakage: Don’t include test examples in training data. Split before you generate.
Not evaluating the base model: Always benchmark before fine-tuning to know if you’re actually improving.
Wrong task for fine-tuning: Fine-tuning doesn’t add knowledge. If you need factual knowledge, use RAG instead.