My AI model picks
First-person picks from someone who ships AI agents for a living. These are the models I actually use, ranked by what I'd reach for first when starting a new project. Not a leaderboard mirror. Updated .
I spent three years building agents with whatever model was trending on Hacker News. I switched models every quarter, chasing benchmarks, and every switch cost me a week of prompt rewrites. Then I stopped following leaderboards and started tracking what actually worked for the agents I shipped. This page is that list. It only changes when a model demonstrably beats the current pick in real agent loops over 2+ weeks of shipping. Not based on launch hype or benchmark scores.
How to read this page. #1 is the model I'd reach for first. #2 is the fallback when #1 is rate-limited, blocked, or stylistically wrong for the task. #3 is a specialist pick for a specific scenario. The picks are for AI agent development specifically: tool use, long context, code generation, structured output, and retry resilience. If you want a general-purpose chat ranking, the LMArena leaderboard and OpenCode production data are better signals.
Text models
What I use for the agent loop — picking tools, calling functions, recovering from errors, and generating structured output.
| #1 Opus 4.8 | #2 DeepSeek V4-Pro | #3 GPT-5.5 | |
|---|---|---|---|
| Provider | Anthropic | DeepSeek (open-weights) | OpenAI |
| Best for | Everything. 50+ step agent loops, long context, tool use | Cost-sensitive pipelines, batch coding, high volume | Codex CLI, terminal workflows, creative tasks |
| Cost per M tokens | $5 / $25 | $0.44 / $0.87 | $2.50 / $10 |
| OpenCode session cost | N/A (no Anthropic) | $0.55/session | $2.50/session (est.) |
| Context window | 200K (strong at 1M) | 1M | 128K |
| Open weights | No | Yes (MIT) | No |
Claude Opus 4.8
Anthropic
My daily driver for every agent build. The most capable model still available after the Fable 5 suspension. If I could only use one model, this would be it.
I have shipped more agents on Opus 4.8 than any other model. It produces 35% fewer output tokens than 4.7 — getting to the point faster without losing quality. In my testing, it handles 50+ step tool-use loops without drifting from the system prompt, which no other model at this price point does reliably. Its long-context GraphWalks score of 68.1% at 1M tokens beats GPT-5.5 by 22 points. When a task needs sustained reasoning across a large codebase, Opus is the one I trust.
What about Fable 5? Claude Fable 5 (Mythos-class, launched June 9) scored 95% on SWE-bench Verified and should be #1. On June 12, the US government issued an export control directive under national security authorities, forcing Anthropic to suspend access for all customers. It will return to #1 if access is restored. The lesson: never couple your agent to one provider. See how to build model-agnostic agents.
DeepSeek V4-Pro
DeepSeek (open-weights, MIT license)
The best open-weights model for agentic coding. Costs a third of Opus and comes close on most coding tasks. My go-to for high-volume pipelines.
On OpenCode, DeepSeek models process billions of tokens daily with a 97% cache hit rate — meaning the effective cost is even lower than the listed $0.08 per million tokens. I routes most of my batch coding workloads and cost-sensitive agent pipelines through V4-Pro. It falls behind Opus on long-context reliability past 500K tokens, where tool-use consistency starts to fade. For anything under 500K and anything where cost per call matters, V4-Pro is the pick.
GPT-5.5
OpenAI
Best for Codex CLI and terminal-heavy workflows. I keep it around for creative tasks where Claude is too literal.
GPT-5.5 leads Terminal-Bench 2.1 at 82.7%, and its Codex CLI integration gives it a clear advantage for scaffolding, shell scripts, and DevOps automation. On LMArena, it scores competitively on creative writing and brainstorming tasks. But on pure agent-loop reliability past 200K tokens, it trails both Claude models significantly. I use GPT-5.5 when the task is terminal-first or style-sensitive. For structured agent work, I reach for Opus first.
Embeddings
What I use to embed documents for retrieval-augmented agents and semantic search.
| #1 voyage-3 | #2 text-embedding-3-large | |
|---|---|---|
| Provider | Voyage AI | OpenAI |
| Best for | English and code retrieval. RAG pipelines | Safe default. Azure/OpenAI stacks |
| Cost per M tokens | $0.06 | $0.13 |
voyage-3
Voyage AI
I switched to voyage-3 after running my own retrieval benchmarks and have not looked back. It beats OpenAI on code and technical docs at half the price.
text-embedding-3-large
OpenAI
Use this when the client has an Azure OpenAI contract and Voyage is not approved. It works everywhere and is well-documented.
Vision
For agents that take screenshots, read diagrams, or need vision understanding.
Claude Opus 4.8 vision
Anthropic
The only vision model I ship with. It handles complex diagrams, dense UI screenshots, and architectural drawings reliably.
Quick reference
Not sure which to pick? Start here.
| Your priority | Pick | Why |
|---|---|---|
| Maximum capability, no budget constraint | #1 Opus 4.8 | Best tool-use reliability, long-context, and agent-loop stability I have shipped with |
| Best value, high volume | #2 DeepSeek V4-Pro | Third the cost, close on benchmarks, open weights, 97% cache hit rate |
| Codex CLI / shell-heavy workflow | #3 GPT-5.5 | Best Terminal-Bench score and Codex CLI integration |
| RAG / document retrieval | voyage-3 | Best retrieval quality per dollar for English and code |
| Vision / UI parsing | Opus 4.8 vision | Only vision model I ship with |
What I don't pick
Categories I deliberately don't rank. Picking a model I don't ship with would be guessing.
- Image generation (FLUX, DALL-E, Midjourney). I don't ship image-generation agents. See the LMArena leaderboard.
- Video generation (Sora, Veo, Runway). Not my use case.
- Audio / speech (Whisper, ElevenLabs). Whisper dominates its category. Ranking alternatives would be theatre.
- Other open-weights models (Qwen, Kimi, GLM, Xiaomi). I use DeepSeek V4-Pro daily. The rest are covered in the open-source landscape post.
Where this data comes from
- OpenCode production data — real token usage, session costs, and cache ratios across hundreds of thousands of agent sessions. The cache ratio numbers and session costs are from here.
- LMArena — community preference rankings for chat quality across text, code, and vision. Cross-reference for general capability.
Related reading
- How to build model-agnostic agents — written the day Fable 5 was banned. Architecture for switching models when a provider disappears.
- The open-source AI model landscape: June 2026 — 9 open-weights models ranked for production use.
- Best open-source LLMs for coding 2026 — deeper dive on DeepSeek, Qwen, Kimi, and Gemma.
- The full stack I use to build agents — not just the models, but the tools and infrastructure.
Want the full stack, not just the models?
See what tools and infrastructure I use to ship agents →