Claude (the model from Anthropic) cannot be run locally — it’s a cloud service. But several open-source alternatives are now close enough in quality for many development tasks, and they run completely on your hardware with zero API costs.
This guide covers running local AI models with Ollama and connecting them to your development workflow.
Why Run Local AI Models?
Privacy: Your code and conversations never leave your machine. Important for work on proprietary codebases or in regulated industries.
Cost: Zero per-token cost. High-volume workflows that would cost hundreds monthly via API are free.
Speed: For smaller models, local inference can be faster than API round-trips, especially with a modern GPU.
Offline capability: Works without internet connection.
What you lose vs. Claude API: The best open-source models (Llama 3.3 70B, Qwen 2.5 72B) are genuinely capable but still below Claude Sonnet 3.7 on complex tasks. For simple coding tasks, they’re competitive.
Hardware Requirements
| Model size | Minimum RAM | Recommended | Performance |
|---|---|---|---|
| 7B models | 8GB VRAM | 16GB | Fast, good quality |
| 13B models | 16GB VRAM | 24GB | Medium speed, better quality |
| 34B models | 32GB VRAM | 48GB | Slow without GPU |
| 70B models | 64GB+ or Apple M2 Max+ | M3 Max/M4 | Best open-source quality |
Apple Silicon (M1/M2/M3/M4): Apple Silicon Macs are exceptional for local AI — unified memory means even 70B models run well on a 64GB M4 Max.
NVIDIA GPU: RTX 4090 (24GB VRAM) handles up to 34B models at good speed. RTX 3090 (24GB) is similar. Consumer GPUs with 8GB VRAM handle 7B models comfortably.
CPU-only: Works for small 7B models, but slow. Not recommended for development use.
Step 1: Install Ollama
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or download from ollama.com
# Verify
ollama --version
Step 2: Download a Model
Best models by hardware:
For M3/M4 Max Mac or NVIDIA 24GB+:
# Best open-source coding model (72B)
ollama pull qwen2.5-coder:72b
# Best general model (70B)
ollama pull llama3.3:70b
For M2 Mac / NVIDIA 16GB:
# Excellent quality coding model (32B)
ollama pull qwen2.5-coder:32b
# Good general model
ollama pull llama3.2:latest
For 8GB VRAM / 16GB RAM:
# Best 7B coding model
ollama pull qwen2.5-coder:7b
# General 7B
ollama pull llama3.2:8b
Step 3: Test the Model
# Run interactively
ollama run qwen2.5-coder:32b
# Or via API (Ollama runs on port 11434)
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5-coder:32b",
"prompt": "Write a Python function that reads a CSV file and returns the top 5 rows by a specified column",
"stream": false
}'
Step 4: Connect to Continue (VS Code)
Continue is an open-source IDE extension that supports local models:
- Install Continue from the VS Code marketplace
- Open
~/.continue/config.json - Add your local model:
{
"models": [
{
"title": "Qwen 2.5 Coder 32B (Local)",
"provider": "ollama",
"model": "qwen2.5-coder:32b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Qwen 7B Autocomplete",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}
Now Continue’s chat and autocomplete use your local model — no API calls, no cost, no data leaving your machine.
Step 5: Connect to Aider (Terminal)
Aider works with Ollama via the --model flag:
aider --model ollama/qwen2.5-coder:32b
Or set it as the default:
export AIDER_MODEL=ollama/qwen2.5-coder:32b
Step 6: Connect to Cline (VS Code)
- Install Cline extension
- Go to Cline settings
- Under API Provider, select “Ollama”
- Set base URL to
http://localhost:11434 - Select your model
OpenAI-Compatible API
Ollama provides an OpenAI-compatible API endpoint, which means anything that works with OpenAI’s API also works with local Ollama models:
from openai import OpenAI
# Point to local Ollama instead of OpenAI
client = OpenAI(
api_key="not-needed", # Ollama doesn't require a key
base_url="http://localhost:11434/v1"
)
response = client.chat.completions.create(
model="qwen2.5-coder:32b",
messages=[
{"role": "user", "content": "Explain async/await in Python"}
]
)
print(response.choices[0].message.content)
Best Models by Use Case
Coding: qwen2.5-coder:32b — Qwen 2.5 Coder 32B is specifically optimized for code generation, code completion, and programming questions. Consistently ranks at the top of coding benchmarks for open-source models.
General use: llama3.3:70b — Meta’s Llama 3.3 70B is the best general-purpose model at this size class.
Fastest / smallest: qwen2.5-coder:7b — Very fast for autocomplete. Good enough for most routine coding tasks on any hardware.
Privacy-sensitive (medical, legal): Any of the above — they all run locally with zero data leaving your machine.
Quality Comparison: Qwen 2.5 Coder 32B vs Claude Sonnet 3.7
On simple to medium coding tasks (write a function, debug this code, explain this error): comparable quality. You’d have to look closely to notice a difference.
On complex tasks (design a distributed system, analyze this 2,000-line codebase, complex algorithmic optimization): Claude Sonnet 3.7 is noticeably better. The reasoning quality at scale favors the frontier models.
For most day-to-day development tasks, local Qwen 2.5 Coder is a realistic substitute, especially if:
- Privacy requires local processing
- Cost matters (high-volume usage)
- You’re willing to occasionally step up to Claude API for complex problems
Managing Multiple Models
# List all downloaded models
ollama list
# Remove a model (to free disk space)
ollama rm qwen2.5-coder:72b
# Pull the latest version of a model
ollama pull llama3.3:latest
Models are stored in ~/.ollama/models and can be several GB each. Plan your storage accordingly (70B models are ~40GB).
Conclusion
Local AI models are now practical for serious development work. The workflow: use local Qwen 2.5 Coder for daily coding tasks (zero cost, private), escalate to Claude API for complex reasoning or architecture work (small cost, maximum quality).
This hybrid approach gives you the best of both — privacy and zero cost for routine work, frontier model quality when it matters.