How to Run Claude-Like AI Models Locally with Ollama

Claude (the model from Anthropic) cannot be run locally — it’s a cloud service. But several open-source alternatives are now close enough in quality for many development tasks, and they run completely on your hardware with zero API costs.

This guide covers running local AI models with Ollama and connecting them to your development workflow.

Why Run Local AI Models?

Privacy: Your code and conversations never leave your machine. Important for work on proprietary codebases or in regulated industries.

Cost: Zero per-token cost. High-volume workflows that would cost hundreds monthly via API are free.

Speed: For smaller models, local inference can be faster than API round-trips, especially with a modern GPU.

Offline capability: Works without internet connection.

What you lose vs. Claude API: The best open-source models (Llama 3.3 70B, Qwen 2.5 72B) are genuinely capable but still below Claude Sonnet 3.7 on complex tasks. For simple coding tasks, they’re competitive.

Hardware Requirements

Model size	Minimum RAM	Recommended	Performance
7B models	8GB VRAM	16GB	Fast, good quality
13B models	16GB VRAM	24GB	Medium speed, better quality
34B models	32GB VRAM	48GB	Slow without GPU
70B models	64GB+ or Apple M2 Max+	M3 Max/M4	Best open-source quality

Apple Silicon (M1/M2/M3/M4): Apple Silicon Macs are exceptional for local AI — unified memory means even 70B models run well on a 64GB M4 Max.

NVIDIA GPU: RTX 4090 (24GB VRAM) handles up to 34B models at good speed. RTX 3090 (24GB) is similar. Consumer GPUs with 8GB VRAM handle 7B models comfortably.

CPU-only: Works for small 7B models, but slow. Not recommended for development use.

Step 1: Install Ollama

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download from ollama.com

# Verify
ollama --version

Step 2: Download a Model

Best models by hardware:

For M3/M4 Max Mac or NVIDIA 24GB+:

# Best open-source coding model (72B)
ollama pull qwen2.5-coder:72b

# Best general model (70B)
ollama pull llama3.3:70b

For M2 Mac / NVIDIA 16GB:

# Excellent quality coding model (32B)
ollama pull qwen2.5-coder:32b

# Good general model  
ollama pull llama3.2:latest

For 8GB VRAM / 16GB RAM:

# Best 7B coding model
ollama pull qwen2.5-coder:7b

# General 7B
ollama pull llama3.2:8b

Step 3: Test the Model

# Run interactively
ollama run qwen2.5-coder:32b

# Or via API (Ollama runs on port 11434)
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-coder:32b",
  "prompt": "Write a Python function that reads a CSV file and returns the top 5 rows by a specified column",
  "stream": false
}'

Step 4: Connect to Continue (VS Code)

Continue is an open-source IDE extension that supports local models:

Install Continue from the VS Code marketplace
Open ~/.continue/config.json
Add your local model:

{
  "models": [
    {
      "title": "Qwen 2.5 Coder 32B (Local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen 7B Autocomplete",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

Now Continue’s chat and autocomplete use your local model — no API calls, no cost, no data leaving your machine.

Step 5: Connect to Aider (Terminal)

Aider works with Ollama via the --model flag:

aider --model ollama/qwen2.5-coder:32b

Or set it as the default:

export AIDER_MODEL=ollama/qwen2.5-coder:32b

Step 6: Connect to Cline (VS Code)

Install Cline extension
Go to Cline settings
Under API Provider, select “Ollama”
Set base URL to http://localhost:11434
Select your model

OpenAI-Compatible API

Ollama provides an OpenAI-compatible API endpoint, which means anything that works with OpenAI’s API also works with local Ollama models:

from openai import OpenAI

# Point to local Ollama instead of OpenAI
client = OpenAI(
    api_key="not-needed",  # Ollama doesn't require a key
    base_url="http://localhost:11434/v1"
)

response = client.chat.completions.create(
    model="qwen2.5-coder:32b",
    messages=[
        {"role": "user", "content": "Explain async/await in Python"}
    ]
)

print(response.choices[0].message.content)

Best Models by Use Case

Coding: qwen2.5-coder:32b — Qwen 2.5 Coder 32B is specifically optimized for code generation, code completion, and programming questions. Consistently ranks at the top of coding benchmarks for open-source models.

General use: llama3.3:70b — Meta’s Llama 3.3 70B is the best general-purpose model at this size class.

Fastest / smallest: qwen2.5-coder:7b — Very fast for autocomplete. Good enough for most routine coding tasks on any hardware.

Privacy-sensitive (medical, legal): Any of the above — they all run locally with zero data leaving your machine.

Quality Comparison: Qwen 2.5 Coder 32B vs Claude Sonnet 3.7

On simple to medium coding tasks (write a function, debug this code, explain this error): comparable quality. You’d have to look closely to notice a difference.

On complex tasks (design a distributed system, analyze this 2,000-line codebase, complex algorithmic optimization): Claude Sonnet 3.7 is noticeably better. The reasoning quality at scale favors the frontier models.

For most day-to-day development tasks, local Qwen 2.5 Coder is a realistic substitute, especially if:

Privacy requires local processing
Cost matters (high-volume usage)
You’re willing to occasionally step up to Claude API for complex problems

Managing Multiple Models

# List all downloaded models
ollama list

# Remove a model (to free disk space)
ollama rm qwen2.5-coder:72b

# Pull the latest version of a model
ollama pull llama3.3:latest

Models are stored in ~/.ollama/models and can be several GB each. Plan your storage accordingly (70B models are ~40GB).

Conclusion

Local AI models are now practical for serious development work. The workflow: use local Qwen 2.5 Coder for daily coding tasks (zero cost, private), escalate to Claude API for complex reasoning or architecture work (small cost, maximum quality).

This hybrid approach gives you the best of both — privacy and zero cost for routine work, frontier model quality when it matters.