PICKS

My AI model picks

First-person picks from someone who ships AI agents for a living. These are the models I actually use, ranked by what I'd reach for first when starting a new project. Not a leaderboard mirror. Updated .

I spent three years building agents with whatever model was trending on Hacker News. I switched models every quarter, chasing benchmarks, and every switch cost me a week of prompt rewrites. Then I stopped following leaderboards and started tracking what actually worked for the agents I shipped. This page is that list. It only changes when a model demonstrably beats the current pick in real agent loops over 2+ weeks of shipping. Not based on launch hype or benchmark scores.

How to read this page. #1 is the model I'd reach for first. #2 is the fallback when #1 is rate-limited, blocked, or stylistically wrong for the task. #3 is a specialist pick for a specific scenario. The picks are for AI agent development specifically: tool use, long context, code generation, structured output, and retry resilience. If you want a general-purpose chat ranking, the LMArena leaderboard and OpenCode production data are better signals.

Text models

What I use for the agent loop — picking tools, calling functions, recovering from errors, and generating structured output.

#1 Opus 4.8 #2 DeepSeek V4-Pro #3 GPT-5.5
Provider Anthropic DeepSeek (open-weights) OpenAI
Best for Everything. 50+ step agent loops, long context, tool use Cost-sensitive pipelines, batch coding, high volume Codex CLI, terminal workflows, creative tasks
Cost per M tokens $5 / $25 $0.44 / $0.87 $2.50 / $10
OpenCode session cost N/A (no Anthropic) $0.55/session $2.50/session (est.)
Context window 200K (strong at 1M) 1M 128K
Open weights No Yes (MIT) No
1

Claude Opus 4.8

Anthropic

My daily driver for every agent build. The most capable model still available after the Fable 5 suspension. If I could only use one model, this would be it.

I have shipped more agents on Opus 4.8 than any other model. It produces 35% fewer output tokens than 4.7 — getting to the point faster without losing quality. In my testing, it handles 50+ step tool-use loops without drifting from the system prompt, which no other model at this price point does reliably. Its long-context GraphWalks score of 68.1% at 1M tokens beats GPT-5.5 by 22 points. When a task needs sustained reasoning across a large codebase, Opus is the one I trust.

What about Fable 5? Claude Fable 5 (Mythos-class, launched June 9) scored 95% on SWE-bench Verified and should be #1. On June 12, the US government issued an export control directive under national security authorities, forcing Anthropic to suspend access for all customers. It will return to #1 if access is restored. The lesson: never couple your agent to one provider. See how to build model-agnostic agents.

2

DeepSeek V4-Pro

DeepSeek (open-weights, MIT license)

The best open-weights model for agentic coding. Costs a third of Opus and comes close on most coding tasks. My go-to for high-volume pipelines.

On OpenCode, DeepSeek models process billions of tokens daily with a 97% cache hit rate — meaning the effective cost is even lower than the listed $0.08 per million tokens. I routes most of my batch coding workloads and cost-sensitive agent pipelines through V4-Pro. It falls behind Opus on long-context reliability past 500K tokens, where tool-use consistency starts to fade. For anything under 500K and anything where cost per call matters, V4-Pro is the pick.

3

GPT-5.5

OpenAI

Best for Codex CLI and terminal-heavy workflows. I keep it around for creative tasks where Claude is too literal.

GPT-5.5 leads Terminal-Bench 2.1 at 82.7%, and its Codex CLI integration gives it a clear advantage for scaffolding, shell scripts, and DevOps automation. On LMArena, it scores competitively on creative writing and brainstorming tasks. But on pure agent-loop reliability past 200K tokens, it trails both Claude models significantly. I use GPT-5.5 when the task is terminal-first or style-sensitive. For structured agent work, I reach for Opus first.

Embeddings

What I use to embed documents for retrieval-augmented agents and semantic search.

#1 voyage-3 #2 text-embedding-3-large
Provider Voyage AI OpenAI
Best for English and code retrieval. RAG pipelines Safe default. Azure/OpenAI stacks
Cost per M tokens $0.06 $0.13
1

voyage-3

Voyage AI

I switched to voyage-3 after running my own retrieval benchmarks and have not looked back. It beats OpenAI on code and technical docs at half the price.

2

text-embedding-3-large

OpenAI

Use this when the client has an Azure OpenAI contract and Voyage is not approved. It works everywhere and is well-documented.

Vision

For agents that take screenshots, read diagrams, or need vision understanding.

1

Claude Opus 4.8 vision

Anthropic

The only vision model I ship with. It handles complex diagrams, dense UI screenshots, and architectural drawings reliably.

Quick reference

Not sure which to pick? Start here.

Your priority Pick Why
Maximum capability, no budget constraint #1 Opus 4.8 Best tool-use reliability, long-context, and agent-loop stability I have shipped with
Best value, high volume #2 DeepSeek V4-Pro Third the cost, close on benchmarks, open weights, 97% cache hit rate
Codex CLI / shell-heavy workflow #3 GPT-5.5 Best Terminal-Bench score and Codex CLI integration
RAG / document retrieval voyage-3 Best retrieval quality per dollar for English and code
Vision / UI parsing Opus 4.8 vision Only vision model I ship with

What I don't pick

Categories I deliberately don't rank. Picking a model I don't ship with would be guessing.

Where this data comes from

Related reading

Want the full stack, not just the models?

See what tools and infrastructure I use to ship agents →
Newsletter

Get the brief on AI agents

Practical posts on shipping agents, automating work, and building in public. No hype, no fluff.

Contact: hello@agenticup.dev