BUILD · Jun 1, 2026

AI agent context window: keeping your agent from forgetting

Strategies for managing LLM context windows in agents. sliding windows, summarization, structured memory, and when to use each approach.

Agent-ready: drop this post into Claude Code or Codex

TL;DR: I watched my agent forget the user’s name mid-conversation. Then it forgot instructions from 5 turns ago. Context window management is the difference between an agent that works reliably and one that degrades over time. Here are 4 strategies that keep agents running on long conversations.

I’ve built agents that start strong and deteriorate over a 30-minute conversation. The answers get shorter. The reasoning gets sloppier. The agent “forgets” instructions you gave at the start.

The culprit? Context window mismanagement.

Every LLM has a context window: a limited number of tokens it can process at once. As your agent accumulates conversation history, tool outputs, and intermediate results, it fills that window. Once it’s full, something has to give. Either the model truncates early content, or the cost becomes absurd (you’re paying for thousands of tokens of stale context on every call).

Context window management isn’t a nice-to-have. It’s the difference between an agent that works reliably and one that degrades over time.

Key takeaways:

  • Every token in the context costs money and attention: be intentional about what stays
  • Sliding windows work for short conversations but lose long-term context
  • Summarization preserves key information but introduces compression loss
  • Structured memory (vector DB, key-value store) is the most reliable for persistent facts
  • Hybrid approaches beat any single strategy for production agents

OpenAI’s prompt caching guide describes how caching reduces latency and cost for repeated context prefixes, making sliding-window approaches more efficient.

How context windows work

When you send a message to an LLM, you send the entire conversation history plus the new message. The model processes every token in parallel (thanks to the attention mechanism) and generates a response.

This means:

  • Cost scales with total tokens: every API call costs proportionally to your entire context, not just the new input
  • Attention degrades: models focus less on tokens in the middle of long contexts (the “lost-in-the-middle” problem)
  • Latency increases: more tokens means more computation per generation

Here’s the math for a 10-turn conversation with tool calls:

ComponentApproximate tokens
System prompt500
10 user messages (avg 100 tokens each)1,000
10 assistant responses (avg 300 tokens each)3,000
20 tool calls with results (avg 200 tokens each)4,000
Total8,500 tokens

At Claude Sonnet pricing ($3/M input tokens), a single turn costs $0.026. After 100 turns across a day, that’s $2.55 just in input costs: and more if you exceed the window and retry.

How does a sliding window strategy work?

Keep the last N messages, discard everything older.

def apply_sliding_window(messages: list, max_messages: int = 20) -> list:
 """
 Keep the system prompt (index 0) and the last N-1 messages.
 """
 system_prompt = messages[0]
 recent = messages[-max_messages + 1:]
 return [system_prompt] + recent

When it works: Short conversations where only recent context matters: customer support, simple Q&A agents, short task execution.

When it fails: Long-running research agents that need early findings. If an agent discards the research brief it compiled 50 turns ago, it can’t write the final report.

Problem case: I had a code review agent that used a sliding window of 30 messages. After reviewing 5 files, it had already forgotten its review guidelines from the first exchange. It started contradicting its earlier feedback.

How does summarization manage context windows?

Periodically compress old conversation history into a summary, replace the compressed content with the summary.

import instructor
from pydantic import BaseModel

class ConversationSummary(BaseModel):
 key_decisions: list[str]
 completed_tasks: list[str]
 pending_items: list[str]
 important_context: str

async def summarize_history(messages: list, llm) -> ConversationSummary:
 """Compress conversation history into a structured summary."""
 history_text = format_messages_for_summary(messages)
 
 summary = await llm.chat.completions.create(
 model="claude-sonnet-4-20250514",
 messages=[{
 "role": "user",
 "content": f"Summarize this conversation history:\n\n{history_text}"
 }],
 response_model=ConversationSummary,
 )
 return summary

def compress_context(messages: list, llm, summary_frequency: int = 20):
 """
 If message count exceeds threshold, summarize and compress.
 """
 if len(messages) <= summary_frequency + 1:
 return messages
 
 # Messages to summarize (exclude system prompt and recent N)
 to_summarize = messages[1:-10]
 recent = messages[-10:]
 
 summary = summarize_history(to_summarize, llm)
 summary_message = {
 "role": "system",
 "content": f"[Compressed History]\n{summary.model_dump_json()}"
 }
 
 return [messages[0], summary_message] + recent

When it works: Long research sessions, multi-turn analysis, agents that need to reference early decisions. The summary preserves key information at ~5-10% of the original token count.

When it fails: When the summary itself becomes too large after multiple compressions. Compressing summaries of summaries leads to information loss. I’ve seen agents lose critical edge cases after 3-4 summarization rounds.

Problem case: A compliance-checking agent that needed to track every rule it had verified. Summarization dropped “forgot to check certificate expiry” from the third compression. The agent signed off on a non-compliant deployment.

How does structured memory preserve context?

Store important facts in a separate database. Retrieve relevant facts when needed.

from typing import TypedDict, List
import chromadb

class MemoryEntry(TypedDict):
 key: str
 content: str
 timestamp: str
 metadata: dict

class StructuredMemory:
 """Key-value memory with semantic search for agent state."""
 
 def __init__(self, collection_name: str = "agent_memory"):
 self.client = chromadb.Client()
 self.collection = self.client.create_collection(collection_name)
 
 async def remember(self, key: str, content: str, metadata: dict = None):
 """Store a fact in memory."""
 self.collection.add(
 documents=[content],
 metadatas=[metadata or {}],
 ids=[key]
 )
 
 async def recall(self, query: str, n: int = 5) -> List[MemoryEntry]:
 """Retrieve relevant memories based on semantic similarity."""
 results = self.collection.query(
 query_texts=[query],
 n_results=n,
 )
 return [
 MemoryEntry(
 key=results["ids"][0][i],
 content=results["documents"][0][i],
 metadata=results["metadatas"][0][i],
 )
 for i in range(len(results["ids"][0]))
 ]
 
 async def update(self, key: str, content: str):
 """Update an existing memory."""
 self.collection.update(ids=[key], documents=[content])

# Agent loop with structured memory
async def agent_loop_with_memory(task: str, memory: StructuredMemory):
 context = {
 "instruction": "You are a research agent with structured memory.",
 "recent_history": [],
 "recalled_facts": await memory.recall(task, n=3),
 }
 
 for step in range(10):
 # Build prompt with recalled facts
 prompt = build_prompt(task, context)
 response = await llm_call(prompt)
 
 # Extract and store important facts
 facts = extract_facts(response)
 for fact in facts:
 await memory.remember(
 key=fact["id"],
 content=fact["content"],
 metadata={"step": step, "source": task}
 )
 
 context["recent_history"].append(response)
 context["recalled_facts"] = await memory.recall(task, n=5)

When it works: Long-running agents, multi-session conversations, any agent that needs to remember specific facts across restarts. This is the most reliable approach for production.

When it fails: When the retrieval retrieves irrelevant content and pollutes the context. Bad retrieval = bad context = bad agent behavior. Requires tuning the embedding model, chunk size, and retrieval count.

How do hybrid approaches combine memory strategies?

The best production systems combine all three:

class ContextManager:
 """
 Hybrid context management:
 - Sliding window for recent interaction
 - Summarization for mid-term history
 - Structured memory for persistent facts
 """
 
 def __init__(
 self,
 recent_window: int = 15,
 summary_threshold: int = 25,
 ):
 self.recent_window = recent_window
 self.summary_threshold = summary_threshold
 self.memory = StructuredMemory()
 self.summary: ConversationSummary | None = None
 
 async def build_context(self, task: str) -> dict:
 # 1. Get persistent facts from structured memory
 memories = await self.memory.recall(task, n=5)
 
 # 2. Use summary for mid-term history
 summary_text = (
 self.summary.model_dump_json()
 if self.summary
 else "No prior context"
 )
 
 # 3. Recent history uses sliding window
 recent = self.recent_history[-self.recent_window:]
 
 return {
 "persistent_facts": memories,
 "history_summary": summary_text,
 "recent_exchanges": recent,
 }
 
 async def after_turn(self, turn_messages: list):
 # Decide whether to summarize
 if len(turn_messages) > self.summary_threshold:
 self.summary = await summarize_history(
 turn_messages[:-self.recent_window],
 self.llm
 )
 
 # Extract and store persistent facts
 facts = extract_key_facts(turn_messages[-1])
 for fact in facts:
 await self.memory.remember(
 key=fact["id"],
 content=fact["content"],
 metadata={"timestamp": datetime.now().isoformat()}
 )

This is what I run in production. The context stay manageable, under 6K tokens for most turns, and the agent has access to exactly the information it needs.

How should I budget my context window tokens?

Beyond choosing a strategy, you need to think about how you allocate tokens within the window. I call this context budgeting:

ComponentAllocationWhy
Instructions15%System prompt, task definition, rules
Tools10%Tool definitions, function schemas
Conversation history40%Recent exchanges + compressed history
Memory retrieval20%Facts from structured memory
Output buffer15%Room for the model to generate

This ratios change based on the agent type:

  • Code review agent: Allocate more to tools (file reading, git operations) and less to conversation history
  • Research agent: Allocate more to memory retrieval (stored findings) and conversation history
  • Customer support agent: Allocate more to recent history, less to tools
  • Data analysis agent: Allocate more to output buffer (for large result sets)

When each strategy fails

StrategyFailure modeSymptom
Sliding windowLoses early contextAgent contradicts earlier decisions
SummarizationCompression lossMissing critical details
Structured memoryBad retrievalAgent acts on irrelevant context
HybridConfiguration complexityOverhead outweighs benefit

Which context window strategy should I use?

For a typical production agent, here’s what I’d suggest:

  1. Start with a sliding window: simplest, and sufficient for agents with <15 turns
  2. Add structured memory when the agent needs to remember facts across sessions or turns
  3. Add summarization when conversations exceed 25 turns and you need mid-term context
  4. Context budget: measure your actual token usage and adjust allocations
  5. Monitor: log context token counts per turn, alert when averages exceed 80% of the window

The goal isn’t to maximize context use. It’s to minimize context while keeping the agent effective. The best context is the one you’re not paying for.


Related: AI agent error handling patterns, what to do when your agent breaks. Also see Preventing AI agent hallucinations, 7 techniques for more reliable agents.

Related: What is an AI agent? A complete beginner’s guide for developers: understanding the fundamentals of AI agents before diving into context management strategies. If you’re building persistent agent memory, new research on deployment-time memorization shows why deletion fidelity matters.

FAQ

What happens when an agent exceeds its context window? The LLM either truncates the oldest content (losing early context) or throws an error. Performance degrades: the model loses track of earlier instructions, tool outputs, or conversation state. You also pay for the full context in every API call.

Which context window strategy is best for production agents? A hybrid approach: sliding window for recent conversation, summarization for mid-term memory, and structured memory (vector DB or key-value store) for facts that need to persist across sessions. No single strategy handles all cases.

How many tokens should I allocate to each part of the context? A good starting budget: instructions 15%, tools/definitions 10%, conversation history 40%, structured memory retrieval 20%, output buffer 15%. Adjust based on your agent’s task profile.

Does a larger context window eliminate the need for memory management? No. Even with 200K token windows, you still need management. Larger windows are slower, more expensive, and the model’s attention degrades on tokens in the middle of the context. Memory strategies remain essential.

Anthropic documentation on context windows covers how models manage large contexts, including prompt caching and compaction.

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev.

Newsletter

Get the brief on AI agents

Practical posts on shipping agents, automating work, and building in public. No hype, no fluff.

Contact: hello@agenticup.dev