AI agent cost optimization: 10 tips to reduce your LLM bill
Practical strategies to cut AI agent costs by 40-60%. caching, prompt compression, model tiering, batch processing, and monitoring.
TL;DR: My first production agent cost ₹12,000/month in API calls. After applying these 10 strategies, semantic caching, prompt compression, model tiering, batching, context budgeting, tool result caching, rate limiting, cost monitoring, open-source models, and pruning unused tools, the same agent runs on ₹4,500/month (62% reduction) with zero quality loss.
OpenAI’s prompt caching guide describes how caching frequent input prefixes can significantly reduce both cost and latency. Anthropic’s prompt caching documentation shows how caching reduces API costs by up to 50% for repeated context prefixes.
My first production agent cost ₹12,000/month in API calls. I almost killed the project right there.
The agent was doing legitimate work: processing support tickets, generating reports, automating workflows. But the costs were eating the margins. At ₹12,000/month, the agent was more expensive than the intern it was supposed to replace.
I spent the next month improving. Here’s what I learned.
After applying these 10 strategies, the same agent runs on ₹4,500/month: a 62% reduction. The work output is the same. The quality is the same. The only difference is we stopped wasting tokens.
Key takeaways:
- Semantic caching alone saves 25-35%: cache LLM responses for similar queries
- Model tiering saves 15-25%: cheap model for simple tasks, expensive model for complex
- Prompt compression, batching, and context budget each save 10-20%
- Composite savings of 60-70% are achievable without quality loss
How much does semantic caching save on LLM costs?
The biggest waste in most agent deployments: answering the same question repeatedly. A semantic cache stores previous LLM responses and returns them for similar queries:
import numpy as np
from openai import OpenAI
class SemanticCache:
def __init__(self, similarity_threshold=0.92):
self.cache = [] # [{embedding, prompt, response, cost, timestamp}]
self.threshold = similarity_threshold
self.embedding_client = OpenAI()
self.hits = 0
self.misses = 0
def get(self, prompt: str) -> dict | None:
prompt_embedding = self._embed(prompt)
for entry in self.cache:
similarity = self._cosine_similarity(prompt_embedding, entry["embedding"])
if similarity >= self.threshold:
self.hits += 1
return {
"response": entry["response"],
"cached": True,
"similarity": similarity,
"savings": entry["cost"]
}
self.misses += 1
return None
def set(self, prompt: str, response: str, cost: float):
self.cache.append({
"embedding": self._embed(prompt),
"prompt": prompt,
"response": response,
"cost": cost,
"timestamp": time.time()
})
def stats(self):
total = self.hits + self.misses
return {
"hit_rate": self.hits / total if total > 0 else 0,
"total_savings": sum(
entry["cost"] for entry in self.cache
) * self.hits / max(self.hits, 1)
}
def _embed(self, text: str) -> list[float]:
response = self.embedding_client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def _cosine_similarity(self, a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Expected savings: 25-35% for support agents, 15-20% for coding agents. Support queries have high repetition (same billing question from different users). Coding agents produce more unique outputs, so cache hit rates are lower.
Gotcha: Cache invalidation. If your knowledge base changes, cached responses become stale. Solution: add a TTL (time-to-live): typically 24 hours for support, 1 hour for time-sensitive queries.
How does prompt compression reduce token usage?
Long system prompts eat your input token budget. Every token in the system prompt is multiplied by every call:
# Before: 750 tokens in system prompt
SYSTEM_PROMPT_VERBOSE = """You are a helpful AI agent that assists users with their tasks.
Your job is to understand what the user wants and help them accomplish it.
You have access to the following tools: read_file, write_file, run_command, search_web.
When you use a tool, make sure to provide the correct arguments.
If you're not sure about something, ask the user for clarification.
..
""" # ~500 more words
# After: 180 tokens in system prompt
SYSTEM_PROMPT_COMPRESSED = """You are a coding agent with tools: read_file, write_file, run_command, search_web.
Rules:
- Provide correct tool arguments
- Ask for clarification when unsure
- One task at a time
- Report results concisely"""
Expected savings: 10-20% from shorter system prompts. More importantly, shorter prompts reduce latency: fewer tokens means faster first-token generation.
How does model tiering cut API costs?
The biggest line item savings strategy: don’t use your most expensive model for every task:
import re
MODEL_TIERS = {
"cheap": {
"model": "claude-haiku-3-20240307",
"cost_per_call": 0.002, # $0.002 per call
"use_for": ["classify", "extract", "simple_qa", "summarize"]
},
"standard": {
"model": "claude-sonnet-4-20250514",
"cost_per_call": 0.015, # $0.015 per call
"use_for": ["generate", "analyze", "code_review"]
},
"expensive": {
"model": "claude-opus-4-20250514",
"cost_per_call": 0.075, # $0.075 per call
"use_for": ["complex_reasoning", "debug", "plan"]
}
}
def select_model(task_type: str, complexity: str = "low") -> str:
if complexity == "high" or task_type in MODEL_TIERS["expensive"]["use_for"]:
return MODEL_TIERS["expensive"]
elif task_type in MODEL_TIERS["standard"]["use_for"]:
return MODEL_TIERS["standard"]
else:
return MODEL_TIERS["cheap"]
Expected savings: 15-25% overall. In my experience, about 60% of agent tasks are simple enough for Haiku-level models. Only 10% need Opus-level reasoning. The remaining 30% work fine on Sonnet. And sometimes a cheaper model outperforms a frontier one entirely. I ran three months of tests to confirm this.
Real example from my stack:
- Intent classification → Haiku (₹0.03/call)
- Code generation → Sonnet (₹0.30/call)
- Debug analysis → Opus (₹1.50/call)
- Average cost per run: ₹0.50 instead of ₹1.50 if everything used Sonnet
How does batching LLM calls save money?
Many agent workflows make multiple independent LLM calls. If they don’t depend on each other, batch them:
import asyncio
from anthropic import AsyncAnthropic
async_client = AsyncAnthropic()
async def batch_llm_calls(calls: list[dict]) -> list[str]:
"""Execute independent LLM calls in parallel."""
async def single_call(call):
response = await async_client.messages.create(**call)
return response.content[0].text
results = await asyncio.gather(*[single_call(c) for c in calls])
return results
# Before: 3 sequential calls = 3x latency, 3x overhead
# classify = llm.call(..)
# extract = llm.call(..)
# summarize = llm.call(..)
# After: 3 parallel calls = 1x latency, 1x overhead
calls = [
{"model": "claude-haiku-3-20240307", "max_tokens": 100, "messages": [..]}, # classify
{"model": "claude-haiku-3-20240307", "max_tokens": 200, "messages": [..]}, # extract
{"model": "claude-haiku-3-20240307", "max_tokens": 150, "messages": [..]}, # summarize
]
results = await batch_llm_calls(calls)
Expected savings: 15-20% reduction in both cost and latency. Batching reduces the overhead of API connection setup and token processing overhead.
When NOT to batch: If one call depends on another’s output (e.g., you need to classify before you can retrieve), batching doesn’t apply. Only batch truly independent calls.
How does context budgeting reduce costs?
Most agents stuff everything into the context window without thinking about what’s needed. Budget your context:
MAX_CONTEXT_BUDGET = 32000 # tokens
def prepare_context(messages, max_context=MAX_CONTEXT_BUDGET):
"""Trim context to fit within budget, prioritizing recent and important messages."""
# Count tokens in current messages
total_tokens = count_tokens(messages)
if total_tokens <= max_context:
return messages
# Strategy: keep system prompt, last 2 turns, truncated middle
system_prompt = [m for m in messages if m["role"] == "system"]
recent = messages[-4:] # Last 2 user + 2 assistant
middle = messages[len(system_prompt):-4]
# Summarize middle messages
if middle:
summary = summarize_messages(middle)
# Keep only the summary and recent messages
budget_messages = system_prompt + [
{"role": "system", "content": f"Previous context: {summary}"}
] + recent
if count_tokens(budget_messages) <= max_context:
return budget_messages
# If still over budget, keep only recent messages
return system_prompt + recent[-2:] # Just last turn
Expected savings: 10-15% reduction in input tokens per call. More importantly, shorter context windows produce faster responses and lower latency.
How does tool result caching lower expenses?
Tool calls are often repeated. The same file is read multiple times. The same API is called with the same parameters:
from functools import lru_cache
import hashlib
class ToolResultCache:
def __init__(self, ttl_seconds=300): # 5 minute TTL
self.cache = {}
self.ttl = ttl_seconds
def get_or_execute(self, tool_name: str, args: dict, tool_fn):
cache_key = self._make_key(tool_name, args)
if cache_key in self.cache:
entry = self.cache[cache_key]
if time.time() - entry["timestamp"] < self.ttl:
return entry["result"]
result = tool_fn(**args)
self.cache[cache_key] = {
"result": result,
"timestamp": time.time(),
"tool": tool_name,
"args": args
}
return result
def _make_key(self, tool_name, args):
serialized = json.dumps({"tool": tool_name, "args": args}, sort_keys=True)
return hashlib.md5(serialized.encode()).hexdigest()
Expected savings: 5-10%. This is situational: high for file-reading agents, low for agents that make unique API calls.
How does rate limiting control agent costs?
Background agents that run on a schedule don’t need instant responses. Rate-limit them to off-peak hours:
import schedule
import time
class RateLimitedAgent:
def __init__(self, agent_fn, max_calls_per_hour=60):
self.agent_fn = agent_fn
self.max_calls = max_calls_per_hour
self.call_times = []
def should_throttle(self) -> bool:
# Clean old entries
now = time.time()
self.call_times = [t for t in self.call_times if now - t < 3600]
if len(self.call_times) >= self.max_calls:
sleep_time = 3600 - (now - self.call_times[0])
if sleep_time > 0:
print(f"Throttling: sleeping {sleep_time:0f}s")
time.sleep(sleep_time)
self.call_times.append(time.time())
def run(self, *args, **kwargs):
self.should_throttle()
return self.agent_fn(*args, **kwargs)
# Schedule non-critical agents for off-peak hours (10 PM - 6 AM IST)
def run_nightly_reports():
agent = RateLimitedAgent(generate_report, max_calls_per_hour=30)
# .. process reports
schedule.every().day.at("22:00").do(run_nightly_reports)
Expected savings: 10-15% from using lower-cost models during off-peak and spreading out API calls to avoid burst pricing (on some providers).
How does cost monitoring reduce spending?
You can’t improve what you don’t measure. Cost tracking is the foundation:
class CostMonitor:
def __init__(self, daily_budget=500): # ₹500/day default
self.daily_budget = daily_budget
self.daily_cost = 0.0
self.alerts = []
def track(self, run_id, model, input_tokens, output_tokens):
RATES = {
"claude-sonnet": {"input": 0.25, "output": 1.25}, # per 1K tokens, in INR
"claude-haiku": {"input": 0.03, "output": 0.15},
"claude-opus": {"input": 1.50, "output": 7.50},
"gpt-4o": {"input": 0.20, "output": 0.80},
"gpt-4o-mini": {"input": 0.01, "output": 0.04},
}
rate = RATES.get(model, RATES["claude-sonnet"])
cost = (input_tokens / 1000 * rate["input"]) + (output_tokens / 1000 * rate["output"])
self.daily_cost += cost
if self.daily_cost > self.daily_budget * 0.8:
self.alerts.append({
"type": "budget_warning",
"cost": self.daily_cost,
"budget": self.daily_budget,
"run_id": run_id
})
return cost
def get_report(self):
return {
"daily_cost": f"₹{self.daily_cost:2f}",
"budget_remaining": f"₹{max(0, self.daily_budget - self.daily_cost):2f}",
"alerts_count": len(self.alerts),
"alerts": self.alerts[-5:] # Last 5 alerts
}
Expected savings: The awareness alone saves 5-10%. When you see which runs cost the most, you naturally find optimizations. I found a buggy agent loop this way that was burning ₹200/day in retries.
How do open-source models save on internal task costs?
For internal tools and batch processing, running open-source models locally can be cheaper than API calls:
# Run Llama 3 70B locally
ollama run llama3:70b
# Or use a cloud GPU
# T4 GPU: ~₹30/hour, serves ~500 requests/hour
# Cost: ₹0.06 per request vs ₹0.30 for Sonnet
Where it works: Internal code review, batch document processing, data extraction, classification at scale.
Where it doesn’t: Customer-facing agents, complex tool use, tasks requiring high reliability.
My setup: API models for customer-facing, Llama 3 70B (via Ollama) for internal batch jobs. This cut my API bill by 20%.
How does pruning unused tools reduce agent costs?
Every tool you give an agent increases the prompt size (tools are serialized into the prompt) and adds decision complexity:
# Before: 15 tools, each with detailed descriptions
ALL_TOOLS = [
read_file, write_file, list_directory, search_files,
run_command, install_package, run_tests, build_project,
search_web, fetch_url, scrape_page, call_api,
query_database, send_email, create_ticket
] # ~2,500 tokens of tool definitions
# After: Only the tools this agent uses
CORE_TOOLS = [
read_file, write_file, run_command, search_files
] # ~600 tokens of tool definitions
Expected savings: 5-10% reduction in input tokens per call. More importantly: fewer tools means fewer wrong tool choices. The agent doesn’t accidentally call send_email when it meant write_file.
Related: AI agent error handling patterns: retry strategies, cost spikes, and graceful degradation for production agents.
If you can only implement two strategies today, make it semantic caching and model tiering. Those two alone will save 40-50% with zero quality impact. Add the rest as your agent scales and you identify specific cost drivers. Most importantly, track your costs before and after: the numbers will tell you which optimization to tackle next.
FAQ
How much can I realistically reduce AI agent costs? 40-60% reduction is achievable for most production agents. Semantic caching alone saves 25-35%. Model tiering saves 15-25%. Prompt compression adds 10-20%. The savings compound: applying all strategies together can reduce costs by 60-70% from baseline.
What’s the single most effective cost optimization? Semantic caching: cache LLM responses for similar queries. For a support agent handling billing questions, the same question comes in by different users dozens of times. A semantic cache catches repeated intents and returns cached responses, saving 25-35% on API costs immediately.
How does model tiering work for cost savings? Route simple tasks (classification, extraction, intent detection) to cheap models like Claude Haiku or GPT-4o-mini ($0.002-0.03 per call). Route complex tasks (code generation, multi-step reasoning) to expensive models like Claude Sonnet or GPT-4o ($0.10-0.30 per call). This saves 15-25% overall with zero quality loss.
Will using open-source models save me money? For internal tools and batch processing, yes. Running Llama 3 70B on a T4 GPU costs about ₹30/hour: cheaper than API calls for high-volume internal tasks. For customer-facing agents, open-source models often lack tool-calling quality. A hybrid approach works best: open-source for internal, API models for customer-facing.
Related Posts
- How to build your first AI agent in 2026 (tutorial). A step-by-step tutorial from scratch, building the core loop and tools
- AI agent error handling patterns. Retry strategies, cost spike prevention, and graceful degradation for production agents
- AI agent deployment guide. From localhost to production: containerization, monitoring, and cost control
- Best Open-Source LLMs for Coding 2026. DeepSeek, Qwen, Gemma, and Llama compared for coding tasks
This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev