What should I log for debugging agent errors?

Every LLM API call with token counts and latencies, every tool call with inputs and outputs, every routing decision with the state snapshot, and every retry or fallback with the reason. Log to a structured format (JSON) for queryability.

AI agent error handling patterns

A practical guide to error handling for AI agents. retry strategies, fallback behaviors, cost spike prevention, and graceful degradation patterns.

OpenAI’s best practices for GPT applications cover prompt engineering patterns that reduce errors: including the structured output and retry strategies recommended in this guide.

TL;DR: Your agent will fail: the difference between demo and production is how you handle failures. This post covers 6 battle-tested patterns: retry with exponential backoff, model fallback, circuit breakers, cost caps, graceful degradation, and structured logging. Implement these to cut your incident rate from weekly to monthly.

Your agent will fail. Not sometimes. Regularly.

I learned this the hard way. My first production agent ran beautifully in testing. In production, it failed within an hour: infinite loop on a tool call, burned through ₹800 in API costs before I noticed.

The difference between a demo agent and a production agent isn’t the model or the prompt. It’s the error handling. (Though picking the right model helps: a consistent model causes fewer retries.) A demo agent works when everything goes right. A production agent works because it handles everything going wrong.

Here are the error handling patterns I’ve battle-tested across shipping agents for clients. These patterns cut my incident rate from weekly to monthly.

Key takeaways:

Three categories of agent failures: LLM API errors, tool execution failures, and logic/reasoning errors

Retry with exponential backoff and model fallback for API errors

Circuit breaker pattern prevents cascading failures

Structured logging with context is essential for debugging

What three types of failures do AI agents face?

Every agent failure I’ve seen falls into one of three categories:

1. LLM API errors

The API is down. You hit a rate limit. The model is overloaded. Your request timed out.

These are the easiest to handle because they’re predictable. Every LLM provider has documented error codes and rate limits.

2. Tool execution failures

Your agent tries to read a file that doesn’t exist. The API it calls returns a 500. The database query times out. The shell command fails.

These are harder because the agent has to interpret the error and decide what to do next.

3. Logic/reasoning errors

The agent loops on the same tool call. It misreads the tool output and picks the wrong branch. It hallucinates a tool that doesn’t exist. It goes off on a tangent and never comes back to the task.

These are the hardest to catch because nothing technically fails: the agent just produces wrong or useless output.

How does retry with exponential backoff work?

For LLM API errors, the simplest and most effective pattern is exponential backoff with jitter:

import time
import random
from anthropic import Anthropic

def call_llm_with_retry(client, messages, max_retries=3, base_delay=1.0):
 last_error = None

 for attempt in range(max_retries):
 try:
 response = client.messages.create(
 model="claude-sonnet-4-20250514",
 max_tokens=4096,
 messages=messages,
 )
 return response

 except Exception as e:
 last_error = e
 error_code = getattr(e, 'status_code', 0)

 # Don't retry client errors
 if error_code in (400, 401, 403, 404):
 raise

 # Retry on rate limits and server errors
 if error_code in (429, 500, 502, 503, 529):
 if attempt < max_retries - 1:
 delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
 print(f"API error {error_code}, retrying in {delay:1f}s (attempt {attempt + 1}/{max_retries})")
 time.sleep(delay)
 continue

 # Timeout errors: retry
 if isinstance(e, TimeoutError):
 if attempt < max_retries - 1:
 delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
 time.sleep(delay)
 continue

 raise

 raise last_error

Expected savings: This pattern alone catches about 90% of transient API errors. Rate limits are almost always resolved within 3 retries.

How does model fallback handle API failures?

Some errors are model-specific. The model might be overloaded, or a specific model might consistently fail on a particular task. Fall back to a different model:

MODEL_PRIORITY = [
 "claude-sonnet-4-20250514",
 "claude-haiku-3-20240307",
 "gpt-4o",
 "gpt-4o-mini",
]

def call_with_fallback(client, messages, tools=None):
 errors = []

 for model in MODEL_PRIORITY:
 try:
 return client.messages.create(
 model=model,
 max_tokens=4096,
 messages=messages,
 tools=tools,
 )
 except Exception as e:
 errors.append(f"{model}: {str(e)[:100]}")
 print(f"Model {model} failed, falling back..")
 continue

 raise Exception(f"All models failed:\n" + "\n".join(errors))

When to use: This is critical for high-availability agents. If you’re running a customer-facing agent, you can’t have it go down because Claude is having an outage. Fall back to GPT-4o and keep running.

Cost implication: Your fallback model might be more expensive or less capable. Track which model was used and log any fallback events for later review.

How does a circuit breaker protect production agents?

The circuit breaker pattern prevents a failing agent from repeatedly hitting the same error. After N consecutive failures, the circuit “opens” and subsequent calls fail fast without hitting the LLM:

import time
from datetime import datetime, timedelta

class AgentCircuitBreaker:
 def __init__(self, failure_threshold=3, recovery_timeout=60):
 self.failure_threshold = failure_threshold
 self.recovery_timeout = recovery_timeout
 self.failure_count = 0
 self.last_failure_time = None
 self.state = "closed" # closed, open, half-open

 def call(self, agent_fn, *args, **kwargs):
 if self.state == "open":
 if datetime.now() - self.last_failure_time > timedelta(seconds=self.recovery_timeout):
 self.state = "half-open"
 print(f"Circuit breaker: half-open, trying one request")
 else:
 raise Exception("Circuit breaker is open. Agent is unavailable.")

 try:
 result = agent_fn(*args, **kwargs)
 if self.state == "half-open":
 self.state = "closed"
 self.failure_count = 0
 print(f"Circuit breaker: closed again")
 return result

 except Exception as e:
 self.failure_count += 1
 self.last_failure_time = datetime.now()

 if self.failure_count >= self.failure_threshold:
 self.state = "open"
 print(f"Circuit breaker: OPEN after {self.failure_count} failures")

 raise e

Expected savings: A circuit breaker caught a bug in one of my agents where a malformed tool response was causing the agent to retry every 30 seconds. Without the breaker, that agent would have run ₹200/day in failed attempts. With it, it failed fast and I got alerted within minutes.

How do cost caps prevent runaway agent spending?

Cost spikes are the silent killer of agent deployments. An agent that loops without producing useful output can burn through your API budget before you notice:

import os

class CostTracker:
 def __init__(self, max_cost_per_run=100): # ₹100 max per run
 self.total_cost = 0.0
 self.max_cost_per_run = max_cost_per_run
 self.per_run_limits = {}

 def track(self, run_id, input_tokens, output_tokens, model="claude-sonnet"):
 # Pricing in INR per token (approximate)
 rates = {
 "claude-sonnet": {"input": 0.25, "output": 1.25}, # per 1K tokens
 "claude-haiku": {"input": 0.03, "output": 0.15},
 "gpt-4o": {"input": 0.20, "output": 0.80},
 "gpt-4o-mini": {"input": 0.01, "output": 0.04},
 }

 rate = rates.get(model, rates["claude-sonnet"])
 cost = (input_tokens / 1000 * rate["input"]) + (output_tokens / 1000 * rate["output"])

 self.total_cost += cost

 # Track per-run cost
 if run_id not in self.per_run_limits:
 self.per_run_limits[run_id] = 0
 self.per_run_limits[run_id] += cost

 # Alert if per-run exceeds limit
 if self.per_run_limits[run_id] > self.max_cost_per_run:
 raise Exception(
 f"Run {run_id} exceeded cost limit of ₹{self.max_cost_per_run}. "
 f"Current cost: ₹{self.per_run_limits[run_id]:2f}"
 )

 return cost

Four cost guards I use on every production agent:

Per-run token budget: max 50,000 tokens per agent run
Circuit breaker: stops after 3 consecutive failures
Concurrent run cap: max 5 concurrent agent executions
Daily cost alert: email/Slack if daily cost exceeds ₹500

These four guards have stopped every cost spike I’ve encountered in the last 6 months.

How does graceful degradation keep agents running?

When a tool fails, the agent needs to decide what to do next. Should it retry? Try a different tool? Report the failure to the user? The answer depends on the tool and the context:

def safe_tool_call(tool_fn, *args, context=None, **kwargs):
 """Execute a tool call with graceful degradation."""
 try:
 result = tool_fn(*args, **kwargs)
 return {
 "success": True,
 "result": result,
 "error": None
 }
 except FileNotFoundError:
 return {
 "success": False,
 "result": None,
 "error": {
 "type": "file_not_found",
 "message": f"File not found: {args}",
 "suggested_fix": "Check if the path is correct and the file exists",
 "degradation": "skip" # Skip this tool and continue
 }
 }
 except PermissionError:
 return {
 "success": False,
 "result": None,
 "error": {
 "type": "permission_denied",
 "message": f"No permission to access: {args}",
 "suggested_fix": "Check file permissions or use a different path",
 "degradation": "fallback" # Try alternative approach
 }
 }
 except ConnectionError:
 return {
 "success": False,
 "result": None,
 "error": {
 "type": "network_error",
 "message": f"Could not connect to service",
 "suggested_fix": "Check network connectivity or retry later",
 "degradation": "retry_later" # Can't proceed without this tool
 }
 }
 except Exception as e:
 return {
 "success": False,
 "result": None,
 "error": {
 "type": "unexpected_error",
 "message": str(e)[:200],
 "suggested_fix": "Check logs for details",
 "degradation": "report" # Report to user
 }
 }

The key insight: return a structured error object that the LLM can understand and act on. The degradation field tells the LLM how to proceed:

skip: this tool failed, but the agent can continue without it
fallback: try a different approach or tool
retry_later: this step is essential but can be retried
report: critical failure, inform the user

What should I log for debugging production agents?

You can’t debug agent failures without logs. But agent logs are different from regular application logs: they need to capture the decision-making process:

import json
from datetime import datetime

class AgentLogger:
 def __init__(self, agent_name, run_id):
 self.agent_name = agent_name
 self.run_id = run_id
 self.events = []

 def log_llm_call(self, model, messages, response, latency_ms):
 self.events.append({
 "type": "llm_call",
 "timestamp": datetime.now().isoformat(),
 "model": model,
 "input_tokens": response.usage.input_tokens,
 "output_tokens": response.usage.output_tokens,
 "latency_ms": latency_ms,
 "has_tool_calls": any(b.type == "tool_use" for b in response.content),
 "run_id": self.run_id,
 })

 def log_tool_call(self, tool_name, args, result, latency_ms, success):
 self.events.append({
 "type": "tool_call",
 "timestamp": datetime.now().isoformat(),
 "tool": tool_name,
 "args": args,
 "result_truncated": str(result)[:500],
 "latency_ms": latency_ms,
 "success": success,
 "run_id": self.run_id,
 })

 def log_routing_decision(self, from_node, to_node, reason):
 self.events.append({
 "type": "routing",
 "timestamp": datetime.now().isoformat(),
 "from": from_node,
 "to": to_node,
 "reason": reason,
 "run_id": self.run_id,
 })

 def flush(self):
 # Write to file or send to logging service
 with open(f"logs/agent_{self.agent_name}_{self.run_id}.jsonl", "a") as f:
 for event in self.events:
 f.write(json.dumps(event) + "\n")

What to log in every agent run:

Every LLM API call: model, tokens, latency, whether tools were called
Every tool call: tool name, arguments, result (truncated), success/failure
Every routing decision: which node is next and why
Every retry or fallback: what failed and what the recovery action was
Start and end timestamps: to calculate total cost and duration

With these logs, you can replay any agent run and understand exactly what happened. Without them, debugging is guessing.

How do these error handling patterns compose?

Here’s the skeleton of a production agent with all error handling patterns applied:

class ProductionAgent:
 def __init__(self, agent_name):
 self.agent_name = agent_name
 self.llm_client = Anthropic()
 self.circuit_breaker = AgentCircuitBreaker()
 self.cost_tracker = CostTracker()

 def run(self, task, run_id):
 logger = AgentLogger(self.agent_name, run_id)

 try:
 result = self.circuit_breaker.call(
 self._execute_agent, task, logger
 )
 logger.flush()
 return result

 except Exception as e:
 logger.log_routing_decision("agent", "failed", str(e)[:200])
 logger.flush()
 return {
 "status": "failed",
 "error": str(e)[:500],
 "run_id": run_id,
 "logs": f"logs/agent_{self.agent_name}_{run_id}.jsonl"
 }

 def _execute_agent(self, task, logger):
 # The core agent loop with error handling
 messages = [{"role": "user", "content": task}]

 for turn in range(10):
 start = time.time()
 response = call_llm_with_retry(self.llm_client, messages)
 latency = (time.time() - start) * 1000

 logger.log_llm_call("claude-sonnet", messages, response, latency)
 self.cost_tracker.track(run_id, response.usage.input_tokens, response.usage.output_tokens)

 messages.append({"role": "assistant", "content": response.content})

 tool_uses = [b for b in response.content if b.type == "tool_use"]
 if not tool_uses:
 return response.content[0].text

 for tool_use in tool_uses:
 start = time.time()
 result = safe_tool_call(execute_tool, tool_use.name, tool_use.input)
 latency = (time.time() - start) * 1000

 logger.log_tool_call(tool_use.name, tool_use.input, result, latency, result["success"])

 if not result["success"]:
 if result["error"]["degradation"] == "skip":
 continue
 elif result["error"]["degradation"] == "report":
 return {"status": "partial", "error": result["error"]["message"]}

 messages.append({
 "role": "user",
 "content": [{"type": "tool_result", "tool_use_id": tool_use.id, "content": str(result)}]
 })

 return {"status": "max_turns_reached"}

What is the debugging workflow for production agents?

When an agent fails in production, here’s my process:

Check the logs first. Find the run_id, read the JSONL file, and trace the execution path. What did the LLM decide? Which tool was called? What did it return?
Reproduce the failure. Run the same input against your agent in development. Is it deterministic or does the LLM respond differently each time?
Add guardrails. Based on what went wrong, add one of the patterns above: a retry, a cost cap, a structured error handler.
Monitor the fix. Watch the next 100 runs to confirm the pattern works. If the failure doesn’t reoccur within 100 runs, the fix is probably solid.

Related: AI agent cost optimization: 10 tips to reduce your LLM bill: strategies for keeping agent costs under control.

Related: How to build an AI customer support agent (that works): error handling patterns for customer support agents including escalation logic and confidence thresholds.

FAQ

What are the most common types of AI agent failures? Three main categories: LLM API errors (rate limits, timeouts, model overloaded), tool execution failures (network errors, invalid inputs, permission denied), and logic/reasoning errors (infinite loops, hallucinated tool calls, wrong branch decisions).

How do I prevent cost spikes from a buggy agent loop? Set a per-run token budget (max 50,000 tokens), add a circuit breaker that stops retries after 3 consecutive failures, cap concurrent agent runs, and set up alerts for any single run exceeding ₹100 ($1.20). These four guards have stopped every cost spike I’ve encountered.

What’s the best retry strategy for LLM API calls? Exponential backoff with jitter: start at 1s, multiply by 2 each retry, cap at 60s, add random jitter of ±500ms. Retry up to 3 times for 429 (rate limit) and 503 (service unavailable) errors. Don’t retry 400 (bad request) or 401 (auth) errors.

How do I handle tool execution failures gracefully? Wrap every tool call in try/except and return a structured error object with error_type, message, and suggested_fix. Let the LLM decide whether to retry with modified arguments, try an alternative tool, or report the failure to the user.

AI agent logging and monitoring. How to log, trace, and monitor agent runs for debugging production failures
The policy gate every agent needs before production. Fail-closed policy checks that prevent unauthorized tool calls and cost spikes
Build a state machine for your AI agent in a weekend. The 6-state FSM with error handlers and circuit breakers for reliable agents
AI agent multi-step workflows. Error handling across sequential chains, parallel execution, and conditional branching

Start with logging

If you only implement one pattern from this article, make it structured logging. You can't fix what you can't see. Add logging today, add the rest as you encounter each failure mode. Every production agent I've built started with logging and grew the error handling patterns organically as each failure type appeared.

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev