Claude’s API is powerful but not free. Whether you’re building a product, running experiments, or automating workflows, API costs can grow faster than expected. This guide covers the techniques that actually move the needle.
Understand Your Current Spend First
Before optimizing, measure. Add logging to track:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1024,
messages=[{"role": "user", "content": "Your message"}]
)
# Log token usage
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Estimated cost: ${(response.usage.input_tokens * 3 + response.usage.output_tokens * 15) / 1_000_000:.4f}")
After a week of logging, you’ll know which calls are expensive and why.
Strategy 1: Use the Right Model
The single biggest lever. Claude models have very different price points:
| Model | Input ($/1M) | Output ($/1M) | Use when |
|---|---|---|---|
| Claude Haiku 4.5 | $0.80 | $4 | Simple tasks, high-volume |
| Claude Sonnet 3.5 | $3 | $15 | Most tasks |
| Claude Sonnet 3.7 | $3 | $15 | Complex coding/reasoning |
| Claude Opus 4.7 | $15 | $75 | Maximum capability needed |
Rule of thumb: Use the least powerful model that produces acceptable quality. For classification, summarization, data extraction, and simple Q&A, Haiku is often indistinguishable from Sonnet in quality at 1/4 the cost.
Routing pattern: Build a router that sends simple requests to Haiku and complex requests to Sonnet:
def select_model(task_complexity: str) -> str:
if task_complexity == "simple":
return "claude-haiku-4-5-20251001"
elif task_complexity == "medium":
return "claude-3-5-sonnet-20241022"
else:
return "claude-3-7-sonnet-20250219"
Strategy 2: Use Prompt Caching
Prompt caching is the most impactful cost-reduction technique for applications with repeated system prompts or large context documents. Anthropic charges 90% less for cached tokens on subsequent requests.
If your system prompt is 5,000 tokens and you make 1,000 requests/day:
- Without caching: 5,000 × 1,000 × $3/1M = $15/day just in system prompt tokens
- With caching: First request at full price, subsequent requests at 10%: ~$1.50/day
Implementing prompt caching:
response = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1024,
system=[
{
"type": "text",
"text": "Your long system prompt here...",
"cache_control": {"type": "ephemeral"} # Cache this block
}
],
messages=[{"role": "user", "content": user_message}]
)
Cache control tells Claude to cache everything before this point. The cache is maintained for 5 minutes and refreshed on each access.
When caching helps most:
- Long system prompts (1,000+ tokens)
- Applications that inject the same large document into every request
- RAG systems where the same retrieved content is used across requests
Strategy 3: Reduce Context Window Bloat
Input tokens are often wasted on content Claude doesn’t need. Common bloat sources:
Over-injecting context: If you’re doing RAG (retrieval-augmented generation), inject only the most relevant retrieved chunks, not all of them.
# Instead of injecting all 20 retrieved chunks:
chunks = retrieve_all(query) # 20 chunks × 500 tokens = 10,000 tokens
# Inject only the most relevant 3-5:
chunks = retrieve_top_k(query, k=4) # 4 chunks × 500 tokens = 2,000 tokens
Including full conversation history unnecessarily: For long conversations, summarize older turns instead of including the full text.
Redundant instructions: Remove boilerplate from system prompts that doesn’t affect output quality. Test which instructions actually matter.
Strategy 4: Optimize Output Length
Output tokens typically cost more than input. Control response length with explicit instructions:
# Add to your system prompt:
"Respond concisely. Avoid unnecessary preamble, repeated information, or filler phrases."
# Or with max_tokens:
response = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=512, # Force shorter responses
...
)
For structured tasks (data extraction, classification), use structured output formats that require fewer words:
# Instead of "Classify the sentiment of this review..."
# which gets a paragraph explanation...
# Use: "Classify sentiment. Reply with only: POSITIVE, NEGATIVE, or NEUTRAL"
Strategy 5: Batch Processing
If you have many independent items to process, batch them in a single request rather than calling the API once per item.
Less efficient (N API calls):
results = []
for item in items: # 100 items = 100 API calls
result = client.messages.create(...)
results.append(result)
More efficient (1-10 API calls):
# Batch 20 items per call
for batch in chunks(items, 20):
batch_prompt = f"Process each of these items: {batch}"
result = client.messages.create(...)
# Parse the batch result
Batching typically reduces total tokens by 30-50% because you pay for the system prompt once instead of N times. Anthropic also offers a Batch API with 50% discounts for non-real-time workloads.
Strategy 6: Use the Anthropic Batch API
For non-real-time workloads (data processing, report generation, bulk analysis), the Batch API offers 50% discounts with 24-hour turnaround:
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"request-{i}",
"params": {
"model": "claude-3-7-sonnet-20250219",
"max_tokens": 1024,
"messages": [{"role": "user", "content": item}]
}
}
for i, item in enumerate(items)
]
)
If your task doesn’t need real-time response, always use the Batch API. 50% off is significant at scale.
Strategy 7: Cache Results at Your Layer
For repeated identical inputs, cache the Claude response in your application:
import hashlib
import json
response_cache = {}
def call_claude_cached(messages: list, model: str) -> str:
cache_key = hashlib.md5(json.dumps(messages).encode()).hexdigest()
if cache_key in response_cache:
return response_cache[cache_key]
response = client.messages.create(
model=model,
max_tokens=1024,
messages=messages
)
result = response.content[0].text
response_cache[cache_key] = result
return result
For production, use Redis or another persistent cache. Even 20-30% cache hit rates significantly reduce costs.
Strategy 8: Evaluate Quality at Lower Cost
Before committing expensive calls to production, validate that cheaper models produce acceptable quality for your use case.
Run 100 examples through both Haiku and Sonnet, compare quality scores, and only upgrade to Sonnet when Haiku’s quality drops below your threshold.
def evaluate_model_for_task(task_examples: list):
haiku_scores = []
sonnet_scores = []
for example in task_examples:
haiku_response = call_claude(example, model="claude-haiku-4-5-20251001")
sonnet_response = call_claude(example, model="claude-3-7-sonnet-20250219")
# Your quality evaluation logic here
haiku_scores.append(evaluate_quality(haiku_response, example["expected"]))
sonnet_scores.append(evaluate_quality(sonnet_response, example["expected"]))
print(f"Haiku: {sum(haiku_scores)/len(haiku_scores):.2%}")
print(f"Sonnet: {sum(sonnet_scores)/len(sonnet_scores):.2%}")
For many tasks, Haiku is within 5% of Sonnet’s quality at 4x lower cost. That’s worth checking before defaulting to Sonnet for everything.
Expected Savings by Strategy
| Strategy | Potential savings |
|---|---|
| Model downgrading | 50-80% |
| Prompt caching | 20-60% (for cached content) |
| Context reduction | 20-40% |
| Output length control | 10-30% |
| Batching | 30-50% |
| Batch API | 50% |
| Application caching | 20-40% (depends on repetition) |
These stack. A well-optimized application using all strategies can reduce API costs by 70-90% vs. a naive implementation with the same functionality.