How to Fine-Tune an LLM: OpenAI, Mistral, and Local Models

Fine-tuning adapts a pre-trained language model to your specific task or style. Done well, it can dramatically improve performance on narrow tasks. Done poorly, it wastes time and money. This guide covers when to fine-tune and how to do it correctly.

When Fine-Tuning Is Worth It

Fine-tuning makes sense when:

You need consistent format/style that prompt engineering can’t reliably achieve
You have 100+ labeled examples of the desired behavior
Latency or cost matters (fine-tuned smaller models can match larger ones on specific tasks)
You have domain-specific knowledge that needs to be baked in

Usually not worth fine-tuning when:

You need general intelligence — base models are better
You don’t have good training data
Your use case changes frequently
You haven’t exhausted prompt engineering first

Rule of thumb: Exhaust prompt engineering before fine-tuning. A well-crafted system prompt often achieves 80% of what fine-tuning would.

Data Preparation

Training data quality is more important than quantity.

Format for OpenAI Fine-Tuning

{"messages": [
  {"role": "system", "content": "You are a customer support agent for Acme Software."},
  {"role": "user", "content": "How do I reset my password?"},
  {"role": "assistant", "content": "To reset your password: Go to Settings > Account > Security, then click 'Reset Password'. You'll receive an email with a reset link within 5 minutes. If you don't see it, check your spam folder."}
]}
{"messages": [
  {"role": "system", "content": "You are a customer support agent for Acme Software."},
  {"role": "user", "content": "I'm getting error code 403"},
  {"role": "assistant", "content": "Error 403 typically means you don't have permission to access that resource. Common causes: 1) Your subscription may have expired, 2) Your account may not have access to that feature. Check your subscription status in Settings > Billing. If you believe this is an error, contact [email protected]."}
]}

Save as .jsonl (one JSON object per line).

Data Collection Strategies

import json

def create_training_example(system_prompt: str, user_message: str, ideal_response: str) -> dict:
    return {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": ideal_response},
        ]
    }

def save_training_data(examples: list[dict], output_path: str):
    with open(output_path, "w") as f:
        for example in examples:
            f.write(json.dumps(example) + "\n")
    print(f"Saved {len(examples)} training examples to {output_path}")

Generating Synthetic Training Data

import openai

client = openai.OpenAI()

def generate_training_data(task_description: str, n_examples: int = 50) -> list[dict]:
    prompt = f"""Generate {n_examples} diverse training examples for this task:
{task_description}

Format each example as JSON with keys: user_message, assistant_response
Make examples diverse — vary phrasing, complexity, and edge cases.
Output valid JSON array."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    
    data = json.loads(response.choices[0].message.content)
    return data.get("examples", [])

OpenAI Fine-Tuning

Upload and Train

import openai
from pathlib import Path

client = openai.OpenAI()

def upload_training_file(file_path: str) -> str:
    with open(file_path, "rb") as f:
        response = client.files.create(file=f, purpose="fine-tune")
    print(f"Uploaded file: {response.id}")
    return response.id

def start_fine_tune(training_file_id: str, model: str = "gpt-4o-mini-2024-07-18") -> str:
    job = client.fine_tuning.jobs.create(
        training_file=training_file_id,
        model=model,
        hyperparameters={
            "n_epochs": 3,  # Start with 3, increase if underfitting
        },
    )
    print(f"Fine-tune job started: {job.id}")
    return job.id

def check_status(job_id: str) -> dict:
    job = client.fine_tuning.jobs.retrieve(job_id)
    print(f"Status: {job.status}")
    if job.fine_tuned_model:
        print(f"Model ready: {job.fine_tuned_model}")
    return job

# Full pipeline
file_id = upload_training_file("training_data.jsonl")
job_id = start_fine_tune(file_id)

# Check periodically
status = check_status(job_id)

Using the Fine-Tuned Model

def use_fine_tuned_model(model_id: str, user_message: str) -> str:
    response = client.chat.completions.create(
        model=model_id,  # e.g., "ft:gpt-4o-mini:org-xyz:name:abc123"
        messages=[
            {"role": "system", "content": "Your system prompt here"},
            {"role": "user", "content": user_message},
        ],
    )
    return response.choices[0].message.content

Local Fine-Tuning with LoRA

LoRA (Low-Rank Adaptation) fine-tunes a fraction of parameters, making local training feasible.

Setup

pip install transformers datasets peft accelerate bitsandbytes

Fine-Tuning with PEFT/LoRA

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
from datasets import Dataset
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,  # QLoRA — 4-bit quantization
)

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,           # Rank — higher = more parameters, more capacity
    lora_alpha=32,  # Scaling parameter (typically 2x rank)
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # Shows % of params being trained

def format_instruction(example):
    return {
        "text": f"<s>[INST] {example['instruction']} [/INST] {example['output']} </s>"
    }

# Load your data
data = [
    {"instruction": "Explain machine learning", "output": "Machine learning is..."},
    # ... more examples
]
dataset = Dataset.from_list(data).map(format_instruction)

Evaluation

Evaluate before and after fine-tuning on a held-out test set:

def evaluate_model(model_id: str, test_cases: list[dict]) -> dict:
    results = []
    
    for case in test_cases:
        response = client.chat.completions.create(
            model=model_id,
            messages=[{"role": "user", "content": case["input"]}],
        )
        
        output = response.choices[0].message.content
        
        # Simple scoring — replace with LLM-as-judge for better evaluation
        passed = case["expected"] in output.lower()
        
        results.append({
            "input": case["input"],
            "output": output,
            "expected": case["expected"],
            "passed": passed,
        })
    
    accuracy = sum(r["passed"] for r in results) / len(results)
    return {"accuracy": accuracy, "results": results}

# Compare base vs fine-tuned
base_results = evaluate_model("gpt-4o-mini", test_cases)
ft_results = evaluate_model("ft:gpt-4o-mini:org:name:id", test_cases)

print(f"Base model accuracy: {base_results['accuracy']:.1%}")
print(f"Fine-tuned accuracy: {ft_results['accuracy']:.1%}")

Cost Estimates

OpenAI Fine-Tuning (gpt-4o-mini):

Training: $0.003 per 1K tokens (~$3 for 100 examples of 1K tokens each)
Inference: ~2x base model cost

Local Fine-Tuning:

Hardware: A100 GPU (80GB) ~$2/hour on Lambda or RunPod
100 examples, 3 epochs: ~15-30 minutes
Total: ~$1 for a small dataset

Common Fine-Tuning Mistakes

Too little data: 50 examples minimum; 500+ for reliable results. More data = better generalization.

Catastrophic forgetting: Fine-tuning too aggressively makes the model forget general capabilities. Use lower learning rates.

Data leakage: Don’t include test examples in training data. Split before you generate.

Not evaluating the base model: Always benchmark before fine-tuning to know if you’re actually improving.

Wrong task for fine-tuning: Fine-tuning doesn’t add knowledge. If you need factual knowledge, use RAG instead.