AI agent logging and monitoring: seeing inside your agent
How to log, trace, and monitor AI agents in production. what to track, which tools to use, and how to debug agents that behave unexpectedly.
The LangSmith documentation shows how tracing, logging, and monitoring integrate with LangGraph agents: covering the observability patterns recommended in this post.
TL;DR: Agent observability isn’t optional: it’s the difference between shipping with confidence and shipping with hope. This guide covers what to log (every LLM call, tool execution, decision, and error), how to structure it as JSON Lines, and how to debug agents by replaying logged runs. You can implement a working setup in one hour.
The first time I deployed an agent to production, it worked perfectly for three days. Then a user asked it a question that triggered a six-minute loop of the same tool call, racking up $12 in API costs before I noticed.
I had no logs. No idea what happened. No way to replay the run.
That experience taught me something fundamental: agent observability is not optional. It’s the difference between shipping with confidence and shipping with hope.
Here’s a complete guide to agent logging and monitoring: what to log, why, and how to build it without spending weeks on tooling.
Key takeaways:
- Log every LLM call (input, output, cost, latency), every tool execution (name, params, result, duration), every decision, and every error
- Structured JSON logging to rotating files covers 90% of debugging needs: you don’t need a fancy platform to start
- Track: cost per session, error rate, latency percentiles, tool success rate, agent loop count
- The most powerful debugging technique: replay a logged agent run step by step
- A working logging setup takes 1 hour to implement: do it before your first production deployment
What to log
Everything. But in a structured way. Here’s the minimum set of events you should log for every agent run:
import json
import logging
from dataclasses import dataclass, asdict
from datetime import datetime
@dataclass
class LLMCallEvent:
event_type: str = "llm_call"
agent_id: str = ""
session_id: str = ""
step: int = 0
model: str = ""
input_messages: list = None # The messages sent to the LLM
output_content: str = "" # The model's text response
tool_calls: list = None # [{name, args}] if any
cost: float = 0.0 # Cost of this call
prompt_tokens: int = 0
completion_tokens: int = 0
latency_ms: int = 0
timestamp: str = ""
@dataclass
class ToolExecutionEvent:
event_type: str = "tool_execution"
agent_id: str = ""
session_id: str = ""
step: int = 0
tool_name: str = ""
parameters: dict = None
result: str = "" # Truncated to 500 chars
duration_ms: int = 0
success: bool = True
error: str = ""
timestamp: str = ""
@dataclass
class AgentDecisionEvent:
event_type: str = "agent_decision"
agent_id: str = ""
session_id: str = ""
step: int = 0
decision: str = "" # What the agent decided
reasoning: str = "" # Why it decided this
context_snapshot: dict = None # Key context at decision point
timestamp: str = ""
Every event gets logged as a single JSON line. This is important. JSON Lines format means each line is self-contained, easy to stream, and easy to query with grep or jq.
import logging
# Structured JSON logging
class StructuredLogger:
def __init__(self, log_dir: str = "agent_logs"):
self.logger = logging.getLogger("agent")
handler = logging.FileHandler(f"{log_dir}/agent.log", mode="a")
handler.setFormatter(logging.Formatter("%(message)s"))
self.logger.addHandler(handler)
self.logger.setLevel(logging.INFO)
def log_event(self, event: dict):
event["timestamp"] = datetime.utcnow().isoformat()
self.logger.info(json.dumps(event, default=str))
def log_llm_call(self, **kwargs):
self.log_event({"event_type": "llm_call", **kwargs})
def log_tool_call(self, **kwargs):
self.log_event({"event_type": "tool_execution", **kwargs})
def log_decision(self, **kwargs):
self.log_event({"event_type": "agent_decision", **kwargs})
How do I add logging to my AI agent?
The cleanest way: wrap your agent loop with a logger that intercepts every significant event.
class LoggedAgent:
def __init__(self, agent, logger: StructuredLogger):
self.agent = agent
self.logger = logger
self.session_id = str(uuid4())
async def run(self, user_input: str) -> str:
self.logger.log_decision(
agent_id=self.agent.id,
session_id=self.session_id,
step=0,
decision="start",
reasoning="Agent received user input",
context_snapshot={"input": user_input[:200]}
)
step = 0
while True:
step += 1
start = time.time()
# Log the LLM call (intercept by wrapping the LLM client)
response = await self._logged_llm_call(step, self.agent.messages)
latency = (time.time() - start) * 1000
if response.tool_calls:
# Log each tool call
for tc in response.tool_calls:
t_start = time.time()
try:
result = await self._execute_tool(tc)
success = True
error = ""
except Exception as e:
result = {"error": str(e)}
success = False
error = str(e)
self.logger.log_tool_call(
agent_id=self.agent.id,
session_id=self.session_id,
step=step,
tool_name=tc.name,
parameters=tc.args,
result=json.dumps(result)[:500],
duration_ms=(time.time() - t_start) * 1000,
success=success,
error=error
)
else:
self.logger.log_decision(
agent_id=self.agent.id,
session_id=self.session_id,
step=step,
decision="return_result",
reasoning="Model returned text response, no tool calls",
context_snapshot={"response_preview": response.content[:200]}
)
return response.content
Don't log full message histories in every LLM call event: they get huge fast. Log message summaries (first 200 chars of user messages, first 500 chars of assistant responses) and provide a replay_id that lets you reconstruct the full conversation from step-by-step logs.
What metrics to track
Logging individual events is for debugging. Metrics are for understanding system health over time.
Here are the metrics I track for every agent in production:
| Metric | What it measures | Alert threshold |
|---|---|---|
| Cost per session | API cost for one complete agent run | > $0.50 |
| Cost per user | Total cost divided by active users | Weekly trend up |
| Error rate | Failed tool calls / total tool calls | > 5% |
| Latency p50 | Median time per agent run | > 15s |
| Latency p95 | Slowest 5% of agent runs | > 45s |
| Latency p99 | Worst-case agent runs | > 120s |
| Tool success rate | Successful tool executions / total | < 95% |
| Loop iterations | Number of LLM calls per agent run | > 15 |
| Stuck agents | Agents running > 30 iterations | Any |
Computing metrics from logs
import json
from collections import defaultdict
from statistics import median
class MetricsCalculator:
def __init__(self, log_file: str):
self.log_file = log_file
def compute_session_metrics(self, session_id: str) -> dict:
events = self._load_session_events(session_id)
llm_calls = [e for e in events if e["event_type"] == "llm_call"]
tool_calls = [e for e in events if e["event_type"] == "tool_execution"]
decisions = [e for e in events if e["event_type"] == "agent_decision"]
total_cost = sum(e.get("cost", 0) for e in llm_calls)
total_latency = sum(e.get("latency_ms", 0) for e in llm_calls)
tool_failures = [e for e in tool_calls if not e.get("success", True)]
return {
"session_id": session_id,
"total_steps": len(llm_calls),
"total_cost": round(total_cost, 4),
"total_latency_ms": total_latency,
"tool_calls": len(tool_calls),
"tool_failures": len(tool_failures),
"tool_success_rate": round((len(tool_calls) - len(tool_failures)) / len(tool_calls) * 100, 1)
if tool_calls else 100.0,
"decisions_made": len(decisions),
}
def compute_aggregate_metrics(self, time_window_hours: int = 24) -> dict:
events = self._load_time_window(time_window_hours)
# Group by session
sessions = defaultdict(list)
for e in events:
sessions[e.get("session_id")].append(e)
session_costs = []
session_latencies = []
session_steps = []
for sid, session_events in sessions.items():
llm_calls = [e for e in session_events if e["event_type"] == "llm_call"]
total_cost = sum(e.get("cost", 0) for e in llm_calls)
total_latency = sum(e.get("latency_ms", 0) for e in llm_calls)
session_costs.append(total_cost)
session_latencies.append(total_latency)
session_steps.append(len(llm_calls))
latencies_sorted = sorted(session_latencies)
n = len(latencies_sorted)
return {
"total_sessions": len(sessions),
"avg_cost": round(sum(session_costs) / len(session_costs), 4) if session_costs else 0,
"p50_latency_ms": median(session_latencies) if session_latencies else 0,
"p95_latency_ms": latencies_sorted[int(n * 0.95)] if n > 0 else 0,
"p99_latency_ms": latencies_sorted[int(n * 0.99)] if n > 0 else 0,
"avg_steps": round(sum(session_steps) / len(session_steps), 1) if session_steps else 0,
}
How do I set up dashboards for agent monitoring?
You don’t need a complex observability platform to start. Here’s my progression:
Phase 1. File-based (day 1). JSON Lines to rotating files. Query with jq and grep. This covers 90% of debugging needs.
# Find all sessions where cost exceeded $1
jq 'select(.event_type == "llm_call" and .cost > 1.0) | .session_id' agent_logs/agent.log | sort -u
# Get p95 latency for last 1000 LLM calls
tail -10000 agent_logs/agent.log | jq -s 'map(select(.event_type == "llm_call") | .latency_ms) | sort | .[length * 0.95 | floor]'
Phase 2. SQLite (week 1). Write logs to SQLite instead of flat files. Query with SQL.
-- Average cost per session in last 24 hours
SELECT session_id, COUNT(*) as steps, SUM(cost) as total_cost, AVG(latency_ms) as avg_latency
FROM llm_calls
WHERE timestamp > datetime('now', '-1 day')
GROUP BY session_id
ORDER BY total_cost DESC;
Phase 3. Grafana (month 1). Ship logs to a structured logging service (Grafana Loki, Axiom, or similar) and build dashboards. Only do this when you need to share metrics with a team or track trends over weeks.
What debugging patterns help with production agents?
Here’s how I debug agents using logs:
Pattern 1: Replay the run. Take the logged events from a session and replay them step by step. This is the single most powerful debugging technique.
class AgentReplayer:
def __init__(self, log_file: str):
self.log_file = log_file
def replay(self, session_id: str):
events = self._load_session(session_id)
for i, event in enumerate(events):
event_type = event.get("event_type")
step = event.get("step", 0)
if event_type == "llm_call":
print(f"\n{'='*60}")
print(f"Step {step}. LLM Call ({event.get('model')})")
print(f"Cost: ${event.get('cost', 0):4f} | Latency: {event.get('latency_ms', 0)}ms")
print(f"Tokens: {event.get('prompt_tokens', 0)} in / {event.get('completion_tokens', 0)} out")
print(f"\nInput (truncated): {json.dumps(event.get('input_messages', [])[:2])[:300]}")
print(f"\nOutput: {event.get('output_content', '')[:500]}")
elif event_type == "tool_execution":
status = "✓" if event.get("success") else "✗"
print(f" {status} Tool: {event.get('tool_name')}({json.dumps(event.get('parameters', {}))})")
print(f" Duration: {event.get('duration_ms', 0)}ms")
if not event.get("success"):
print(f" Error: {event.get('error', '')}")
elif event_type == "agent_decision":
print(f" → Decision: {event.get('decision')}")
input("Press Enter for next step..") # Step through manually
Pattern 2: Compare two runs. When the same input produces different outputs, compare the decision paths side by side.
def compare_runs(session_a: str, session_b: str):
events_a = load_session(session_a)
events_b = load_session(session_b)
decisions_a = [e for e in events_a if e["event_type"] == "agent_decision"]
decisions_b = [e for e in events_b if e["event_type"] == "agent_decision"]
for i, (da, db) in enumerate(zip(decisions_a, decisions_b)):
if da.get("decision") != db.get("decision"):
print(f"Divergence at step {i}:")
print(f" Run A: {da.get('decision')}")
print(f" Run B: {db.get('decision')}")
print(f" Context A: {da.get('context_snapshot')}")
print(f" Context B: {db.get('context_snapshot')}")
Pattern 3: The stuck agent detector. Monitor for agents that are looping (repeating the same tool call with similar parameters).
def detect_stuck_agents(events: list, max_iterations: int = 15):
sessions = defaultdict(list)
for e in events:
sessions[e.get("session_id")].append(e)
stuck = []
for sid, sess_events in sessions.items():
tool_calls = [e for e in sess_events if e["event_type"] == "tool_execution"]
# Check for repeating tool calls
if len(tool_calls) > max_iterations:
# Check if it's repeating the same tool
tool_names = [t.get("tool_name") for t in tool_calls]
if len(set(tool_names)) <= 2: # Using only 1-2 tools repeatedly
stuck.append({
"session_id": sid,
"iterations": len(tool_calls),
"tools_used": tool_names[:10],
"total_cost": sum(t.get("cost", 0) for t in sess_events
if t.get("event_type") == "llm_call")
})
return stuck
How do I set up agent logging in under 1 hour?
If you do nothing else, implement this. It takes one hour and covers 90% of debugging needs:
import json
import os
from datetime import datetime
from pathlib import Path
class QuickLogger:
"""Minimal agent logger. 50 lines, one file, zero dependencies."""
def __init__(self, log_dir: str = "logs"):
Path(log_dir).mkdir(parents=True, exist_ok=True)
self.log_file = Path(log_dir) / f"agent-{datetime.now().strftime('%Y%m%d')}.jsonl"
def log(self, event: dict):
event["_timestamp"] = datetime.utcnow().isoformat()
with open(self.log_file, "a") as f:
f.write(json.dumps(event, default=str) + "\n")
def llm_call(self, session: str, step: int, model: str, prompt_tokens: int,
completion_tokens: int, cost: float, latency_ms: int,
response: str = ""):
self.log({
"type": "llm_call", "session": session, "step": step,
"model": model, "prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens, "cost": round(cost, 6),
"latency_ms": latency_ms, "response_preview": response[:200]
})
def tool_call(self, session: str, step: int, tool: str, params: dict,
duration_ms: int, success: bool, error: str = ""):
self.log({
"type": "tool_call", "session": session, "step": step,
"tool": tool, "params": params, "duration_ms": duration_ms,
"success": success, "error": error
})
def decision(self, session: str, step: int, decision: str, context: dict = None):
self.log({
"type": "decision", "session": session, "step": step,
"decision": decision, "context": context
})
def error(self, session: str, step: int, error: str, traceback: str = ""):
self.log({
"type": "error", "session": session, "step": step,
"error": error, "traceback": traceback[:500]
})
Usage:
logger = QuickLogger("agent_logs")
logger.llm_call(session_id, step, "gpt-4o", 1500, 400, 0.015, 1200, response_text)
logger.tool_call(session_id, step, "search_web", {"q": "Bengaluru weather"}, 800, True)
logger.decision(session_id, step, "use_weather_tool", {"confidence": 0.85})
That’s it. Rotate the log file daily. Query with jq. You’re now in the top 10% of agent developers who know what their agents are doing.
How do I set up alerting for production agents?
Logs are for after something goes wrong. Alerts are for catching it as it happens.
class AlertManager:
def __init__(self, cost_threshold: float = 0.50,
error_rate_threshold: float = 0.1,
max_loops: int = 20):
self.cost_threshold = cost_threshold
self.error_rate_threshold = error_rate_threshold
self.max_loops = max_loops
async def check_session(self, session_id: str, events: list):
alerts = []
# Cost spike
total_cost = sum(e.get("cost", 0) for e in events
if e.get("event_type") == "llm_call")
if total_cost > self.cost_threshold:
alerts.append(f"Cost spike: ${total_cost:2f} for session {session_id}")
# Error rate
tool_events = [e for e in events if e.get("event_type") == "tool_execution"]
if tool_events:
failures = [e for e in tool_events if not e.get("success")]
if len(failures) / len(tool_events) > self.error_rate_threshold:
alerts.append(f"High error rate: {len(failures)}/{len(tool_events)} tool failures")
# Stuck loop
if len(tool_events) > self.max_loops:
alerts.append(f"Agent stuck: {len(tool_events)} tool calls without resolution")
return alerts
For production, I route alerts to a Telegram bot. The format is simple:
⚠️ Agent Alert
Session: abc-123
Type: Cost spike
Detail: $1.24 in last 2 minutes
Agent: content-writer-v2
What tools are available for agent logging and monitoring?
Here’s what I’ve used and what I’d recommend:
| Tool | Best for | Cost | Setup time |
|---|---|---|---|
| File-based + jq | Solo developers, startups | Free | 1 hour |
| SQLite + Metabase | Small teams | Free | 1 day |
| LangSmith | LangGraph users | Pay per trace | 30 min |
| Grafana Loki + Promtail | Teams with existing Grafana | Free tier | 2 days |
| Axiom | Easy hosted solution | Free tier (50GB) | 1 hour |
| OpenTelemetry | Distributed tracing across services | Free (host it) | 2–3 days |
| Sentry | Error tracking specifically | Free tier | 30 min |
My recommendation: start with file-based logging. When you need more, add SQLite for queryability. When you need team dashboards, add Grafana or Axiom. Don’t over-invest in observability before you have users to observe.
Related: AI agent multi-step workflows: orchestrating complex agent pipelines: how to design workflows that benefit from good observability.
Logging won’t make your agent perfect. But it will make failures visible, debuggable, and, eventually, preventable. And that’s the difference between agent development that feels like guessing and agent development that feels like engineering.
FAQ
What should I log in an AI agent system? Every LLM call (input messages, output, cost, latency, model name), every tool execution (tool name, parameters, result, duration, success status), the agent’s reasoning steps and decisions, and all errors with stack traces and context. Log as structured JSON.
What tools should I use for AI agent monitoring? LangSmith is the best tool for LangGraph traces. For custom agents, a combination of structured JSON logging to files or a database works well. OpenTelemetry is good for distributed tracing across microservices. Start simple : log to files with rotation, add a dashboard later.
What metrics should I track for AI agents in production? Cost per user/session, error rate (failed tool calls, LLM errors), latency p50/p95/p99, tool success rate, agent loop iterations per task, and cost per agent run. Track these over time and set up alerts for spikes.
How do I debug an agent that behaves unexpectedly? Replay the agent run with the same inputs and compare decision paths. Log the full reasoning trace at each step. Set up ‘decision point logging’ : before each LLM call, log the context the agent is working with. After the call, log what it decided and why.
Related Posts
- AI agent error handling patterns. Retry strategies, circuit breakers, and graceful degradation for production agents
- AI agent deployment guide: from localhost to production. Docker, cost controls, monitoring, and the full production deployment checklist
- The 15 jobs every agent harness must do. Reference architecture for the observability jobs every production agent needs
This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev.