BUILD · Jun 1, 2026

AI agent logging and monitoring: seeing inside your agent

How to log, trace, and monitor AI agents in production. what to track, which tools to use, and how to debug agents that behave unexpectedly.

Agent-ready: drop this post into Claude Code or Codex

The LangSmith documentation shows how tracing, logging, and monitoring integrate with LangGraph agents: covering the observability patterns recommended in this post.

TL;DR: Agent observability isn’t optional: it’s the difference between shipping with confidence and shipping with hope. This guide covers what to log (every LLM call, tool execution, decision, and error), how to structure it as JSON Lines, and how to debug agents by replaying logged runs. You can implement a working setup in one hour.

The first time I deployed an agent to production, it worked perfectly for three days. Then a user asked it a question that triggered a six-minute loop of the same tool call, racking up $12 in API costs before I noticed.

I had no logs. No idea what happened. No way to replay the run.

That experience taught me something fundamental: agent observability is not optional. It’s the difference between shipping with confidence and shipping with hope.

Here’s a complete guide to agent logging and monitoring: what to log, why, and how to build it without spending weeks on tooling.

Key takeaways:

  • Log every LLM call (input, output, cost, latency), every tool execution (name, params, result, duration), every decision, and every error
  • Structured JSON logging to rotating files covers 90% of debugging needs: you don’t need a fancy platform to start
  • Track: cost per session, error rate, latency percentiles, tool success rate, agent loop count
  • The most powerful debugging technique: replay a logged agent run step by step
  • A working logging setup takes 1 hour to implement: do it before your first production deployment

What to log

Everything. But in a structured way. Here’s the minimum set of events you should log for every agent run:

import json
import logging
from dataclasses import dataclass, asdict
from datetime import datetime

@dataclass
class LLMCallEvent:
 event_type: str = "llm_call"
 agent_id: str = ""
 session_id: str = ""
 step: int = 0
 model: str = ""
 input_messages: list = None # The messages sent to the LLM
 output_content: str = "" # The model's text response
 tool_calls: list = None # [{name, args}] if any
 cost: float = 0.0 # Cost of this call
 prompt_tokens: int = 0
 completion_tokens: int = 0
 latency_ms: int = 0
 timestamp: str = ""

@dataclass
class ToolExecutionEvent:
 event_type: str = "tool_execution"
 agent_id: str = ""
 session_id: str = ""
 step: int = 0
 tool_name: str = ""
 parameters: dict = None
 result: str = "" # Truncated to 500 chars
 duration_ms: int = 0
 success: bool = True
 error: str = ""
 timestamp: str = ""

@dataclass
class AgentDecisionEvent:
 event_type: str = "agent_decision"
 agent_id: str = ""
 session_id: str = ""
 step: int = 0
 decision: str = "" # What the agent decided
 reasoning: str = "" # Why it decided this
 context_snapshot: dict = None # Key context at decision point
 timestamp: str = ""

Every event gets logged as a single JSON line. This is important. JSON Lines format means each line is self-contained, easy to stream, and easy to query with grep or jq.

import logging

# Structured JSON logging
class StructuredLogger:
 def __init__(self, log_dir: str = "agent_logs"):
 self.logger = logging.getLogger("agent")
 handler = logging.FileHandler(f"{log_dir}/agent.log", mode="a")
 handler.setFormatter(logging.Formatter("%(message)s"))
 self.logger.addHandler(handler)
 self.logger.setLevel(logging.INFO)

 def log_event(self, event: dict):
 event["timestamp"] = datetime.utcnow().isoformat()
 self.logger.info(json.dumps(event, default=str))

 def log_llm_call(self, **kwargs):
 self.log_event({"event_type": "llm_call", **kwargs})

 def log_tool_call(self, **kwargs):
 self.log_event({"event_type": "tool_execution", **kwargs})

 def log_decision(self, **kwargs):
 self.log_event({"event_type": "agent_decision", **kwargs})

How do I add logging to my AI agent?

The cleanest way: wrap your agent loop with a logger that intercepts every significant event.

class LoggedAgent:
 def __init__(self, agent, logger: StructuredLogger):
 self.agent = agent
 self.logger = logger
 self.session_id = str(uuid4())

 async def run(self, user_input: str) -> str:
 self.logger.log_decision(
 agent_id=self.agent.id,
 session_id=self.session_id,
 step=0,
 decision="start",
 reasoning="Agent received user input",
 context_snapshot={"input": user_input[:200]}
 )

 step = 0
 while True:
 step += 1
 start = time.time()

 # Log the LLM call (intercept by wrapping the LLM client)
 response = await self._logged_llm_call(step, self.agent.messages)
 latency = (time.time() - start) * 1000

 if response.tool_calls:
 # Log each tool call
 for tc in response.tool_calls:
 t_start = time.time()
 try:
 result = await self._execute_tool(tc)
 success = True
 error = ""
 except Exception as e:
 result = {"error": str(e)}
 success = False
 error = str(e)

 self.logger.log_tool_call(
 agent_id=self.agent.id,
 session_id=self.session_id,
 step=step,
 tool_name=tc.name,
 parameters=tc.args,
 result=json.dumps(result)[:500],
 duration_ms=(time.time() - t_start) * 1000,
 success=success,
 error=error
 )
 else:
 self.logger.log_decision(
 agent_id=self.agent.id,
 session_id=self.session_id,
 step=step,
 decision="return_result",
 reasoning="Model returned text response, no tool calls",
 context_snapshot={"response_preview": response.content[:200]}
 )
 return response.content
Pro tip

Don't log full message histories in every LLM call event: they get huge fast. Log message summaries (first 200 chars of user messages, first 500 chars of assistant responses) and provide a replay_id that lets you reconstruct the full conversation from step-by-step logs.

What metrics to track

Logging individual events is for debugging. Metrics are for understanding system health over time.

Here are the metrics I track for every agent in production:

MetricWhat it measuresAlert threshold
Cost per sessionAPI cost for one complete agent run> $0.50
Cost per userTotal cost divided by active usersWeekly trend up
Error rateFailed tool calls / total tool calls> 5%
Latency p50Median time per agent run> 15s
Latency p95Slowest 5% of agent runs> 45s
Latency p99Worst-case agent runs> 120s
Tool success rateSuccessful tool executions / total< 95%
Loop iterationsNumber of LLM calls per agent run> 15
Stuck agentsAgents running > 30 iterationsAny

Computing metrics from logs

import json
from collections import defaultdict
from statistics import median

class MetricsCalculator:
 def __init__(self, log_file: str):
 self.log_file = log_file

 def compute_session_metrics(self, session_id: str) -> dict:
 events = self._load_session_events(session_id)

 llm_calls = [e for e in events if e["event_type"] == "llm_call"]
 tool_calls = [e for e in events if e["event_type"] == "tool_execution"]
 decisions = [e for e in events if e["event_type"] == "agent_decision"]

 total_cost = sum(e.get("cost", 0) for e in llm_calls)
 total_latency = sum(e.get("latency_ms", 0) for e in llm_calls)
 tool_failures = [e for e in tool_calls if not e.get("success", True)]

 return {
 "session_id": session_id,
 "total_steps": len(llm_calls),
 "total_cost": round(total_cost, 4),
 "total_latency_ms": total_latency,
 "tool_calls": len(tool_calls),
 "tool_failures": len(tool_failures),
 "tool_success_rate": round((len(tool_calls) - len(tool_failures)) / len(tool_calls) * 100, 1)
 if tool_calls else 100.0,
 "decisions_made": len(decisions),
 }

 def compute_aggregate_metrics(self, time_window_hours: int = 24) -> dict:
 events = self._load_time_window(time_window_hours)

 # Group by session
 sessions = defaultdict(list)
 for e in events:
 sessions[e.get("session_id")].append(e)

 session_costs = []
 session_latencies = []
 session_steps = []

 for sid, session_events in sessions.items():
 llm_calls = [e for e in session_events if e["event_type"] == "llm_call"]
 total_cost = sum(e.get("cost", 0) for e in llm_calls)
 total_latency = sum(e.get("latency_ms", 0) for e in llm_calls)
 session_costs.append(total_cost)
 session_latencies.append(total_latency)
 session_steps.append(len(llm_calls))

 latencies_sorted = sorted(session_latencies)
 n = len(latencies_sorted)

 return {
 "total_sessions": len(sessions),
 "avg_cost": round(sum(session_costs) / len(session_costs), 4) if session_costs else 0,
 "p50_latency_ms": median(session_latencies) if session_latencies else 0,
 "p95_latency_ms": latencies_sorted[int(n * 0.95)] if n > 0 else 0,
 "p99_latency_ms": latencies_sorted[int(n * 0.99)] if n > 0 else 0,
 "avg_steps": round(sum(session_steps) / len(session_steps), 1) if session_steps else 0,
 }

How do I set up dashboards for agent monitoring?

You don’t need a complex observability platform to start. Here’s my progression:

Phase 1. File-based (day 1). JSON Lines to rotating files. Query with jq and grep. This covers 90% of debugging needs.

# Find all sessions where cost exceeded $1
jq 'select(.event_type == "llm_call" and .cost > 1.0) | .session_id' agent_logs/agent.log | sort -u

# Get p95 latency for last 1000 LLM calls
tail -10000 agent_logs/agent.log | jq -s 'map(select(.event_type == "llm_call") | .latency_ms) | sort | .[length * 0.95 | floor]'

Phase 2. SQLite (week 1). Write logs to SQLite instead of flat files. Query with SQL.

-- Average cost per session in last 24 hours
SELECT session_id, COUNT(*) as steps, SUM(cost) as total_cost, AVG(latency_ms) as avg_latency
FROM llm_calls
WHERE timestamp > datetime('now', '-1 day')
GROUP BY session_id
ORDER BY total_cost DESC;

Phase 3. Grafana (month 1). Ship logs to a structured logging service (Grafana Loki, Axiom, or similar) and build dashboards. Only do this when you need to share metrics with a team or track trends over weeks.

What debugging patterns help with production agents?

Here’s how I debug agents using logs:

Pattern 1: Replay the run. Take the logged events from a session and replay them step by step. This is the single most powerful debugging technique.

class AgentReplayer:
 def __init__(self, log_file: str):
 self.log_file = log_file

 def replay(self, session_id: str):
 events = self._load_session(session_id)

 for i, event in enumerate(events):
 event_type = event.get("event_type")
 step = event.get("step", 0)

 if event_type == "llm_call":
 print(f"\n{'='*60}")
 print(f"Step {step}. LLM Call ({event.get('model')})")
 print(f"Cost: ${event.get('cost', 0):4f} | Latency: {event.get('latency_ms', 0)}ms")
 print(f"Tokens: {event.get('prompt_tokens', 0)} in / {event.get('completion_tokens', 0)} out")
 print(f"\nInput (truncated): {json.dumps(event.get('input_messages', [])[:2])[:300]}")
 print(f"\nOutput: {event.get('output_content', '')[:500]}")

 elif event_type == "tool_execution":
 status = "✓" if event.get("success") else "✗"
 print(f" {status} Tool: {event.get('tool_name')}({json.dumps(event.get('parameters', {}))})")
 print(f" Duration: {event.get('duration_ms', 0)}ms")
 if not event.get("success"):
 print(f" Error: {event.get('error', '')}")

 elif event_type == "agent_decision":
 print(f" → Decision: {event.get('decision')}")

 input("Press Enter for next step..") # Step through manually

Pattern 2: Compare two runs. When the same input produces different outputs, compare the decision paths side by side.

def compare_runs(session_a: str, session_b: str):
 events_a = load_session(session_a)
 events_b = load_session(session_b)

 decisions_a = [e for e in events_a if e["event_type"] == "agent_decision"]
 decisions_b = [e for e in events_b if e["event_type"] == "agent_decision"]

 for i, (da, db) in enumerate(zip(decisions_a, decisions_b)):
 if da.get("decision") != db.get("decision"):
 print(f"Divergence at step {i}:")
 print(f" Run A: {da.get('decision')}")
 print(f" Run B: {db.get('decision')}")
 print(f" Context A: {da.get('context_snapshot')}")
 print(f" Context B: {db.get('context_snapshot')}")

Pattern 3: The stuck agent detector. Monitor for agents that are looping (repeating the same tool call with similar parameters).

def detect_stuck_agents(events: list, max_iterations: int = 15):
 sessions = defaultdict(list)
 for e in events:
 sessions[e.get("session_id")].append(e)

 stuck = []
 for sid, sess_events in sessions.items():
 tool_calls = [e for e in sess_events if e["event_type"] == "tool_execution"]

 # Check for repeating tool calls
 if len(tool_calls) > max_iterations:
 # Check if it's repeating the same tool
 tool_names = [t.get("tool_name") for t in tool_calls]
 if len(set(tool_names)) <= 2: # Using only 1-2 tools repeatedly
 stuck.append({
 "session_id": sid,
 "iterations": len(tool_calls),
 "tools_used": tool_names[:10],
 "total_cost": sum(t.get("cost", 0) for t in sess_events
 if t.get("event_type") == "llm_call")
 })

 return stuck

How do I set up agent logging in under 1 hour?

If you do nothing else, implement this. It takes one hour and covers 90% of debugging needs:

import json
import os
from datetime import datetime
from pathlib import Path

class QuickLogger:
 """Minimal agent logger. 50 lines, one file, zero dependencies."""

 def __init__(self, log_dir: str = "logs"):
 Path(log_dir).mkdir(parents=True, exist_ok=True)
 self.log_file = Path(log_dir) / f"agent-{datetime.now().strftime('%Y%m%d')}.jsonl"

 def log(self, event: dict):
 event["_timestamp"] = datetime.utcnow().isoformat()
 with open(self.log_file, "a") as f:
 f.write(json.dumps(event, default=str) + "\n")

 def llm_call(self, session: str, step: int, model: str, prompt_tokens: int,
 completion_tokens: int, cost: float, latency_ms: int,
 response: str = ""):
 self.log({
 "type": "llm_call", "session": session, "step": step,
 "model": model, "prompt_tokens": prompt_tokens,
 "completion_tokens": completion_tokens, "cost": round(cost, 6),
 "latency_ms": latency_ms, "response_preview": response[:200]
 })

 def tool_call(self, session: str, step: int, tool: str, params: dict,
 duration_ms: int, success: bool, error: str = ""):
 self.log({
 "type": "tool_call", "session": session, "step": step,
 "tool": tool, "params": params, "duration_ms": duration_ms,
 "success": success, "error": error
 })

 def decision(self, session: str, step: int, decision: str, context: dict = None):
 self.log({
 "type": "decision", "session": session, "step": step,
 "decision": decision, "context": context
 })

 def error(self, session: str, step: int, error: str, traceback: str = ""):
 self.log({
 "type": "error", "session": session, "step": step,
 "error": error, "traceback": traceback[:500]
 })

Usage:

logger = QuickLogger("agent_logs")
logger.llm_call(session_id, step, "gpt-4o", 1500, 400, 0.015, 1200, response_text)
logger.tool_call(session_id, step, "search_web", {"q": "Bengaluru weather"}, 800, True)
logger.decision(session_id, step, "use_weather_tool", {"confidence": 0.85})

That’s it. Rotate the log file daily. Query with jq. You’re now in the top 10% of agent developers who know what their agents are doing.

How do I set up alerting for production agents?

Logs are for after something goes wrong. Alerts are for catching it as it happens.

class AlertManager:
 def __init__(self, cost_threshold: float = 0.50,
 error_rate_threshold: float = 0.1,
 max_loops: int = 20):
 self.cost_threshold = cost_threshold
 self.error_rate_threshold = error_rate_threshold
 self.max_loops = max_loops

 async def check_session(self, session_id: str, events: list):
 alerts = []

 # Cost spike
 total_cost = sum(e.get("cost", 0) for e in events
 if e.get("event_type") == "llm_call")
 if total_cost > self.cost_threshold:
 alerts.append(f"Cost spike: ${total_cost:2f} for session {session_id}")

 # Error rate
 tool_events = [e for e in events if e.get("event_type") == "tool_execution"]
 if tool_events:
 failures = [e for e in tool_events if not e.get("success")]
 if len(failures) / len(tool_events) > self.error_rate_threshold:
 alerts.append(f"High error rate: {len(failures)}/{len(tool_events)} tool failures")

 # Stuck loop
 if len(tool_events) > self.max_loops:
 alerts.append(f"Agent stuck: {len(tool_events)} tool calls without resolution")

 return alerts

For production, I route alerts to a Telegram bot. The format is simple:

⚠️ Agent Alert
Session: abc-123
Type: Cost spike
Detail: $1.24 in last 2 minutes
Agent: content-writer-v2

What tools are available for agent logging and monitoring?

Here’s what I’ve used and what I’d recommend:

ToolBest forCostSetup time
File-based + jqSolo developers, startupsFree1 hour
SQLite + MetabaseSmall teamsFree1 day
LangSmithLangGraph usersPay per trace30 min
Grafana Loki + PromtailTeams with existing GrafanaFree tier2 days
AxiomEasy hosted solutionFree tier (50GB)1 hour
OpenTelemetryDistributed tracing across servicesFree (host it)2–3 days
SentryError tracking specificallyFree tier30 min

My recommendation: start with file-based logging. When you need more, add SQLite for queryability. When you need team dashboards, add Grafana or Axiom. Don’t over-invest in observability before you have users to observe.


Related: AI agent multi-step workflows: orchestrating complex agent pipelines: how to design workflows that benefit from good observability.

Logging won’t make your agent perfect. But it will make failures visible, debuggable, and, eventually, preventable. And that’s the difference between agent development that feels like guessing and agent development that feels like engineering.

FAQ

What should I log in an AI agent system? Every LLM call (input messages, output, cost, latency, model name), every tool execution (tool name, parameters, result, duration, success status), the agent’s reasoning steps and decisions, and all errors with stack traces and context. Log as structured JSON.

What tools should I use for AI agent monitoring? LangSmith is the best tool for LangGraph traces. For custom agents, a combination of structured JSON logging to files or a database works well. OpenTelemetry is good for distributed tracing across microservices. Start simple : log to files with rotation, add a dashboard later.

What metrics should I track for AI agents in production? Cost per user/session, error rate (failed tool calls, LLM errors), latency p50/p95/p99, tool success rate, agent loop iterations per task, and cost per agent run. Track these over time and set up alerts for spikes.

How do I debug an agent that behaves unexpectedly? Replay the agent run with the same inputs and compare decision paths. Log the full reasoning trace at each step. Set up ‘decision point logging’ : before each LLM call, log the context the agent is working with. After the call, log what it decided and why.


This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev.

Newsletter

Get the brief on AI agents

Practical posts on shipping agents, automating work, and building in public. No hype, no fluff.

Contact: hello@agenticup.dev