The 15 jobs every agent harness must do

Q: Where do I start if I want to build my own harness?

Start with the state machine (job 4). It's the conductor: everything else is a station it talks to. Get a turn flowing through start -> provision -> assistant_stream -> function_execute -> steering_check -> teardown with a real model call. Then add job by job: policy gate (job 8), session persistence (job 12), observability (job 15). The 15 jobs are ordered: follow the sequence.

15 jobs, not one. Agent harness architecture reference: turn requests, credentials, model catalog, FSM, skills, streaming, policy, approvals, budgets, hooks, sessions, compaction, events, tracing.

I shipped my first agent without a harness. It worked for about two weeks. Then a tool call looped 47 times, burned through ₹9,600 in API credits, and I spent a weekend rebuilding the entire stack from scratch.

The problem was not the model. It was that I had no way to swap the policy engine without touching the credential resolver, no way to upgrade the session store without rebuilding the FSM. Everything was coupled. Everything broke at once.

An agent harness is not one thing. It’s 15 separate jobs, bundled together by the frameworks because nothing underneath gave you a way to compose them.

Most teams don’t choose the bundle: they inherit it. They pick LangChain or LangGraph because it’s the obvious choice, and they accept the tradeoff: every job is in one box, and when one job needs to change, they have to change the whole box.

This post maps all 15 jobs. What each one does. Why it matters. How to replace one without touching the others. Bookmark this: it’s your checklist before any production deploy.

TL;DR: An agent harness has 15 jobs: from persisting turn requests to compacting context windows. The state machine (job 4) is the conductor; everything else is a station. Frameworks bundle all 15 into one install; the composition model lets you swap any one without touching the others. Start with job 4, add jobs incrementally.

Key takeaways:

15 jobs, not 1: each has a different change frequency and replacement risk

The state machine (job 4) is the conductor: it orchestrates all other jobs

Frameworks bundle all 15; composition lets you replace any one independently

Thin harness = jobs 1-4, 7 (autonomous agents). Thick harness = all 15 (production workflows)

Every job has a kitchen analogy: use this guide to explain the harness to anyone

The 15 jobs at a glance

#	Job	Kitchen equivalent	Deep dive
1	Accept and persist turn request	Hostess writes reservation in booking book	Logging & monitoring
2	Resolve credentials per provider	Vendor accounts, know which login to use	-
3	Look up model capabilities	Equipment reference sheet: what can each station do?	Context window
4	Drive the per-turn state machine	Head chef at the pass: orchestrates the kitchen	State machine
5	Load and serve skill bodies	Recipe cards at each station	Function calling
6	Assemble the system prompt	Pre-service team briefing, mode, identity, available skills	-
7	Stream tokens back to client	Pass window: plates go out as components are ready	Function calling
8	Policy-check every tool call	”Are we allowed to use that ingredient?”	Policy gates
9	Pause for human approval	Sommelier approval for expensive wine	Policy gates
10	Track LLM spend against budgets	Kitchen accountant: food cost per dish	Error handling
11	Run hooks before and after tool calls	Quality control, before and after each plate	-
12	Persist session as branching tree	Ticket log that branches when orders change	Branching sessions
13	Compact history when context fills	Prep board consolidation: clear space, keep working	Context window
14	Emit event stream for UI	Display board: every table’s status in real time	Logging & monitoring
15	OpenTelemetry trace across every step	Kitchen CCTV: one continuous recording of everything	Logging & monitoring

The 15 jobs in detail

Job 1: Accept a turn request and persist it

What it does: Catches the incoming message, gives it a unique ID, writes it to the session store before anything else happens.

Why it matters: The request exists the moment it arrives. If you don’t persist it immediately and the agent crashes before handling it, you have no proof the request was received. Persisting first gives you a paper trail from the start. All subsequent jobs reference this ID.

Kitchen equivalent: The hostess at a restaurant takes a reservation and writes it in the booking book before the kitchen even knows about it. That log is the proof the request existed.

Links: AI agent logging and monitoring: session state is logged at the start of every turn.

Job 2: Resolve credentials for the model provider

What it does: Figures out which provider (OpenAI, Anthropic, Google, a local model) is being called for this turn, finds the right API key, and makes sure it’s available when the actual call happens.

Why it matters: You can’t hard-code one API key. Providers change, keys rotate, each workspace has different keys. The harness looks up the right credential at runtime, every time.

Kitchen equivalent: Every supplier (meat vendor, fish vendor, vegetable vendor) has their own account. When you need to order beef, you have to know which vendor account to charge.

Job 3: Look up what the chosen model can do

What it does: The harness maintains a catalog of every model available to the agent: context window size, vision capabilities, tool support, streaming support. Checks this catalog before routing a request.

Why it matters: Send a 200-page document to a model with a 4,000 token context window and it fails. Ask a text-only model to do a vision task and it fails. The catalog prevents these failures: route to a capable model or reject before it fails.

Kitchen equivalent: Before the kitchen starts cooking, someone checks: do we have a grill that can do steaks? What’s the oven’s max temperature? The equipment reference sheet.

Links: AI agent context window management: context window limits are part of the model catalog.

Job 4: Drive the per-turn state machine

What it does: The conductor. Manages the sequence of states every turn goes through: start → provision → assistant_stream → function_execute → steering_check → teardown. Knows what transitions are valid, handles errors, decides when to loop and when to stop.

Why it matters: This is the heart of the harness. Without it, the agent has no concept of “where it is.” It can run the same tool call 11 times. It doesn’t know when to stop. It doesn’t recover from crashes. The FSM delivers reliable agent behavior.

Kitchen equivalent: The head chef at the pass. Knows exactly what stage every dish is at. Calls out the transitions, “risotto out, steak on the plate.” Decides what to do when something goes wrong, remake, substitute, or tell the table.

Links: Build a state machine for your AI agent in a weekend: how to build the 6-state turn FSM.

Job 5: Load and serve skill bodies

What it does: Maintains the catalog of every tool the agent can call: the function schema, the inputs it needs, the errors it might return, when to use it, when not to. Serves these skill bodies on demand so the model knows how to call each tool correctly.

Why it matters: The model doesn’t inherently know how to use your tools. The skill body makes a tool discoverable and correctly callable. If the skill body for “send email” is wrong, the model calls it with wrong parameters or doesn’t call it at all.

Kitchen equivalent: Recipe cards at each station. The grill station has a card for ribeye, temperature, resting time, finishing butter. The sauce station has a card for béchamel, roux thickness, when to add milk. Every procedure, documented.

Links: OpenAI function calling tutorial: tool schemas and function calling patterns.

Job 6: Assemble the system prompt

What it does: Builds the instruction block sent to the model every turn. Assembles it from pieces: the mode paragraph (plan/ask/agent), the identity preamble (who the agent is, how to use tools), the list of available skills, the working directory context.

Why it matters: The system prompt shapes the model’s behavior. Get it wrong and the model doesn’t know it’s an agent, doesn’t know how to use tools, doesn’t know what mode it’s in. The harness has to assemble the right prompt for every turn.

Kitchen equivalent: Pre-service team briefing. “Tonight we’re doing a tasting menu: that’s the mode. You’re the team at Restaurant XYZ: that’s the identity. We have 12 courses planned: here’s the menu. The sommelier is on call if you need wine pairings: that’s a skill available on demand.”

Job 7: Stream tokens back to the client

What it does: Catches the model’s streaming response and pushes it to the client (browser, CLI) in real time. The user sees the response as it’s being generated, not after it’s fully done.

Why it matters: Seeing text appear gradually feels responsive. Waiting for the whole response feels slow. Streaming is standard for modern AI interfaces. The harness has to handle the connection, manage disconnects, and route the stream to the right client.

Kitchen equivalent: The pass window: plates go out as components are ready. The sauce is done, it goes out. The garnish is placed, it goes out. The customer sees the dish being assembled in front of them.

Links: OpenAI function calling tutorial: streaming with tool call deltas.

Job 8: Check every tool call against a policy before it runs

What it does: Every tool the model wants to call goes through one chokepoint: consultBefore. The policy rules say what’s allowed, what’s denied, what needs human approval. The gate returns allow, deny, or needs_approval before any tool executes.

Why it matters: This is the safety gate. Without it, any tool the model decides to call runs immediately: delete files, send emails, spend money. The policy check stops the agent from doing things it shouldn’t.

Kitchen equivalent: Every time a station wants to use a restricted ingredient, “we want to use the truffle, it’s $200 for the portion”, the policy check is “does the ticket allow premium ingredients? Is there a budget for this?”

Links: The policy gate every agent needs before production: the fail-closed pattern, consultBefore pattern, three outcomes.

Job 9: Pause tool calls that need human decision and route the answer back

What it does: Some tool calls pass the policy check but still need a human to say “yes, do this.” These get parked: the turn pauses, the human is notified, the answer routes back into the right turn and the turn resumes exactly where it left off.

Why it matters: Not everything can be fully automated. Customer-facing actions, destructive actions, expensive actions: these need a human in the loop. The harness has to support this without breaking the turn’s state.

Kitchen equivalent: The chef wants to use the restaurant’s last bottle of a rare wine for a table’s order. The policy says “allowed” but the sommelier has to physically approve it. The kitchen doesn’t proceed until the sommelier says yes: and when they do, the kitchen continues without re-cooking anything.

Links: The policy gate every agent needs before production: the reactive approval trigger (turn::on_approval) pattern.

Job 10: Track LLM spend against per-workspace or per-agent budgets

What it does: Every LLM call costs money. The harness tracks spending against budgets set per workspace, per agent, per customer. When a workspace approaches its limit, the harness throttles requests or alerts someone.

Why it matters: Without this, you have no financial visibility. You don’t know which agent is burning through budget, which customer is accidentally running expensive loops, when you’re going to hit a surprise bill.

Kitchen equivalent: The restaurant accountant tracks every dish’s food cost against the menu price. If a table orders 10 portions of the expensive tasting menu, the accountant knows that bill is going to be high. If the monthly ingredient budget is running low, the chef gets alerted.

Links: AI agent error handling patterns: circuit breakers and cost caps as budget enforcement mechanisms.

Job 11: Run hooks before and after tool calls

What it does: Hooks are side effects that run at specific points: before a tool executes (log it, redact sensitive data) and after it executes (check for errors, update a counter). The harness provides the before/after pattern so you can add behavior without modifying the tool itself.

Why it matters: This is how you add custom behavior, logging, redaction, metrics, custom side effects, without touching the tool code. Hooks are composable: add as many as you want.

Kitchen equivalent: Quality control checks. Before each plate goes out: “Did the chef wash their hands? Is the temperature right?” After: “Was the plate returned clean? Was there a complaint?” These checks happen around every action, not as part of the action.

Job 12: Persist the session as a branching tree

What it does: Stores the full conversation history as a tree, not a line. Each turn is a node with optional children. The user can ask “what if we tried X instead of Y?”, a new branch forks from the last common node. The original branch stays intact.

Why it matters: Linear sessions break the moment you want to explore a branch. With a branching tree, you never lose the main thread. You can fork, explore, and return: or keep both branches and compare.

Kitchen equivalent: The kitchen’s ticket log with a twist. When a customer changes an order mid-way, the kitchen writes a new ticket that branches off the original. The original order, what was started before the change, is still in the log. The kitchen can go back to it.

Links: Why your agent forgets conversations (and how to fix it with a branching tree): the branching tree model, fork and resume pattern.

Job 13: Compact session history when the context window fills up

What it does: When the conversation gets long enough that the context window starts filling, the harness compacts the history, summarization, selective forgetting, compression, so the agent can keep running without hitting the wall.

Why it matters: Without compaction, the agent hits the context window limit and either drops old history (losing context) or refuses new requests (breaking the agent). Compaction lets the agent run indefinitely on long conversations.

Kitchen equivalent: The kitchen’s prep board holds a limited number of orders. When it’s full, the chef reviews the board, consolidates similar tickets, clears space. “we’re still working on the same 5 tables, just more efficiently.” The kitchen keeps running.

Links: AI agent context window management: sliding windows, summarization, structured memory, and when to use each.

What it does: The UI needs to know what’s happening inside the agent in real time: tool calls, results, approval requests, turn endings. The harness emits events on topics, and the UI subscribes to the events it needs.

Why it matters: Without this, the UI is blind. It sends a request and waits for a final response with no visibility into what’s happening in the middle. With an event stream, the UI shows “the agent is calling the email tool” in real time.

Kitchen equivalent: The kitchen display board. “Table 7: steak is being cooked, table 12: soup is plated, table 3: waiting for manager approval on wine choice.” Events appear as they happen, not only at the end.

Links: AI agent logging and monitoring: structured event logging, JSON Lines format, replay patterns.

Job 15: Carry one OpenTelemetry trace across every step

What it does: Every operation in the turn is tagged with the same session/message/function IDs. When something goes wrong, you can see the full chain: which session, which turn, which function call, how long it took, what it returned.

Why it matters: Without tracing, debugging a failing agent is like finding a leak without knowing which floor the pipe is on. You know something went wrong but you can’t see the path. With tracing, you can pinpoint exactly where the failure happened.

Kitchen equivalent: Full CCTV recording of every shift. When a dish goes wrong, you can rewind and see exactly what happened. “the sous chef added the sauce at the wrong time.” You follow one order’s journey from placement to service.

Links: AI agent logging and monitoring: decision point logging, structured JSON, replay debugging.

Why is the state machine the conductor of the harness?

All 15 jobs are connected by one: job 4, the per-turn state machine.

The FSM is the conductor. Everything else is a station the conductor talks to.

Turn comes in
 → Job 1: Persist request
 → Job 2: Resolve credentials
 → Job 3: Look up model capabilities
 → Job 4: push FSM (orchestrates everything below)
 → Job 5: Load skill bodies
 → Job 6: Assemble system prompt
 → Job 7: Stream tokens back to client
 → Job 8: Policy check every tool call
 → Job 9: Pause for human approval if needed
 → Job 10: Track spend against budget
 → Job 11: Run before/after hooks
 → Job 12: Persist session as branching tree
 → Job 13: Compact history when context fills
 → Job 14: Emit events for UI
 → Job 15: OTel trace across everything

The FSM transitions from assistant_stream to function_execute, triggering the policy gate (job 8). When the FSM transitions to steering_check, it evaluates whether to continue the loop. When the FSM transitions to stopped or failed, it calls teardown which triggers job 12 (persist) and job 14 (emit agent_end event).

Every job is a station. The FSM is the train that moves between them.

When should I use a thin vs thick harness?

The 15 jobs aren’t all-or-nothing. You can run a thin harness or a thick one by adding or removing jobs from your config:

Thin harness: Jobs 1, 2, 3, 4, 7. No approvals, no budgets, no hooks, no compaction, no tracing. For autonomous research agents where you trust the model. The agent runs fast and loose.

Thick harness: All 15. For production customer-facing workflows where every tool call needs to be auditable, every dollar tracked, every action logged and traceable. The agent runs with guardrails.

The distance between thin and thick isn’t a rewrite. It’s a config change. Same wire protocol, same trace shape, same observability story. The slider moves by adding and removing workers from your config.

Why do frameworks trap you into one architecture?

The reason the 15 jobs exist as a list is that most teams discover them by hitting them: one by one, in production, when something breaks.

The framework trap is this: you pick a framework (LangChain, LangGraph, CrewAI) and it ships all 15 jobs in one box. It works great for the first few months. Then you need to replace the policy engine (job 8) because your security requirements changed. You find out you can’t just swap it: it’s baked into the framework’s loop. You have two choices: fight the framework, or rewrite the harness from scratch.

The alternative is the composition model: each job is a separate worker on a shared bus. The policy engine is a worker. The credential resolver is a worker. The session store is a worker. Replace any one by writing a new worker that registers the same function IDs. The rest of the stack doesn’t change.

That’s the architectural bet underneath everything in this post. The 15 jobs are not a design choice. They’re a fact about what an agent harness has to do. The design choice is whether you bundle them or compose them.

Start here

If you’re building your own harness, start with job 4: the state machine. Get a turn flowing through the 6 states with a real model call. Everything else builds on top of that foundation.

The state machine post has the complete build guide. The policy gates post has the safety layer. The branching sessions post has the session persistence model.

This post is the map. Those posts are the trailheads.

Agent mode:The 15 jobs are the complete picture of what an agent harness does. Bookmark this reference. Use it as a checklist when evaluating frameworks, when designing your own stack, and before any production deployment. Every job is a place where something can go wrong: and every job is a place where something can be replaced when it does.

FAQ

What’s the difference between an agent framework and an agent harness? A framework (LangChain, LangGraph, CrewAI) bundles all 15 jobs into one install: you get everything or nothing. An agent harness is decomposed into separate, independently replaceable components. When you need to swap the policy engine, you swap just the policy engine: the other 14 jobs stay unchanged. The framework model locks you in; the composition model keeps you flexible.

Why are there 15 jobs and not fewer? Each job has a different concern: credential resolution vs. context compaction vs. event emission vs. policy checking. They don’t all need to change at the same rate: the policy rules might change weekly, the model catalog monthly, the streaming implementation never. Bundling them means every change touches everything. Decomposing them means you can upgrade one without touching the others.

What’s the single most important job in the harness? The per-turn state machine (job 4). It orchestrates all the others: it calls the credential resolver, the model catalog, the policy gate, the session store. If the FSM is wrong, everything downstream is wrong. Everything else is a station the conductor talks to.

Can I run a thin harness with just some of these jobs? Yes. A thin harness might run just jobs 1-4 and 7 (persist, credentials, model catalog, FSM, stream): no approvals, no budgets, no hooks. That’s appropriate for autonomous research agents where you trust the model. A thick harness adds jobs 8-15 for production customer-facing workflows where every tool call needs to be auditable. The slider is a config change, not a rewrite.

Read Build a state machine for your AI agent in a weekend for the full FSM implementation: the 6 states, valid transitions, error handling, teardown.

Read The policy gate every agent needs before production for jobs 8 and 9: fail-closed policy checks, the three outcomes, the reactive approval trigger.

Read Why your agent forgets conversations for job 12: the branching tree model for session persistence that doesn’t lose context.

Read AI agent context window management for job 13: compaction strategies that keep the agent running on long conversations.

Read AI agent logging and monitoring for jobs 14 and 15: event streams, structured logging, and replay debugging.

Read AI agent multi-step workflows for how workflow patterns layer on top of the state machine: sequential, parallel, conditional, human-in-the-loop.

A 2026 survey on AI agent architectures maps the production agent stack across tools, memory, and guardrails. The Reddit AI Agents community discusses real-world agent harness configurations.

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev

The 15 jobs every agent harness must do

The 15 jobs at a glance

The 15 jobs in detail

Job 1: Accept a turn request and persist it

Job 2: Resolve credentials for the model provider

Job 3: Look up what the chosen model can do

Job 4: Drive the per-turn state machine

Job 5: Load and serve skill bodies

Job 6: Assemble the system prompt

Job 7: Stream tokens back to the client

Job 8: Check every tool call against a policy before it runs

Job 9: Pause tool calls that need human decision and route the answer back

Job 10: Track LLM spend against per-workspace or per-agent budgets

Job 11: Run hooks before and after tool calls

Job 12: Persist the session as a branching tree

Job 13: Compact session history when the context window fills up

Job 14: Emit an event stream for the UI to subscribe to

Job 15: Carry one OpenTelemetry trace across every step

Why is the state machine the conductor of the harness?

When should I use a thin vs thick harness?

Why do frameworks trap you into one architecture?

Start here

FAQ

Related Posts

Get the brief on AI agents