How do you pick these models?

These are the models I actually reach for when building AI agents for clients or my own projects. The ranking reflects production experience: tool-use reliability, latency on agent loops, cost per run, and how often each model produces a usable agent without babysitting. I update this quarterly.

Why is Fable 5 not on the list?

Claude Fable 5 would have held #1. On June 12, the US government issued an export control directive, forcing Anthropic to suspend access for all customers globally. It will return to #1 if restored. Opus 4.8 is now the practical top pick.

How often do you update these picks?

Quarterly. The picks only change when a model demonstrably beats the current top pick on agent-loop tasks over 2+ weeks of real shipping. Not based on leaderboard rankings or marketing launches.

What about open-source models?

I use DeepSeek V4-Pro daily as my #2 pick. It is the best open-weights model for agentic coding at roughly a third the cost of Opus. See my open-source AI model landscape post for a full ranking of 9 open-weights models.

PICKS

My AI model picks

First-person picks from someone who ships AI agents for a living. These are the models I actually use, ranked by what I'd reach for first when starting a new project. Not a leaderboard mirror. UpdatedJune 2026.

I spent my first year building agents with whatever model was trending on Hacker News. I switched models every quarter, chasing benchmarks, and every switch cost me a week of prompt rewrites. Then I stopped following leaderboards and started tracking what actually worked for the agents I shipped. This page is that list. It only changes when a model demonstrably beats the current pick in real agent loops over 2+ weeks of shipping. Not based on launch hype or benchmark scores.

How to read this page.

#1 Opus 4.8 is what I use for everything. It handles 50+ step loops, 200K context, and tool use better than anything else. Start here.

#2 DeepSeek V4-Pro is what I switch to when cost matters. It costs a third of Opus and comes close on most coding tasks. Use this for batch pipelines and high-volume agents.

#3 GPT-5.5 is for Codex CLI and terminal-heavy workflows. I keep it around for creative tasks where Claude is too literal.

These picks are for AI agent development — tool use, long context, code gen, structured output, retry resilience. For general-purpose chat, checkLMArenaor OpenCode.

Text models

What I use for the agent loop — picking tools, calling functions, recovering from errors, and generating structured output.

	#1 Opus 4.8	#2 DeepSeek V4-Pro	#3 GPT-5.5
Provider	Anthropic	DeepSeek (open-weights)	OpenAI
Best for	Everything. 50+ step agent loops, long context, tool use	Cost-sensitive pipelines, batch coding, high volume	Codex CLI, terminal workflows, creative tasks
Cost per M tokens	₹415 / ₹2,075 ($5 / $25)	₹37 / ₹72 ($0.44 / $0.87)	₹208 / ₹830 ($2.50 / $10)
OpenCode session cost	N/A (no Anthropic)	₹46/session ($0.55)	₹208/session ($2.50, est.)
Context window	200K (strong at 1M)	1M	128K
Open weights	No	Yes (MIT)	No

Claude Opus 4.8

Anthropic

My daily driver for every agent build. Handles 50+ step tool-use loops, 200K context, and sustained reasoning better than anything else I have shipped with.

I have shipped more agents on Opus 4.8 than any other model. It produces 35% fewer output tokens than 4.7 — getting to the point faster without losing quality. In my testing, it handles 50+ step tool-use loops without drifting from the system prompt, which no other model at this price point does reliably. Its long-context GraphWalks score of 68.1% at 1M tokens beats GPT-5.5 by 22 points. When a task needs sustained reasoning across a large codebase, Opus is the one I trust.

What about Fable 5? Claude Fable 5 should be #1. It scored 95% on SWE-bench Verified. On June 12, the US government forced Anthropic to suspend access for all customers. If access is restored, it goes back to #1. The lesson: never couple your agent to one provider. See how to build model-agnostic agents.

DeepSeek V4-Pro

DeepSeek (open-weights, MIT license)

The best open-weights model for agentic coding. Costs a third of Opus and comes close on most coding tasks. My go-to for high-volume pipelines.

On OpenCode, DeepSeek models process billions of tokens daily with a 97% cache hit rate — meaning the effective cost is even lower than the listed $0.08 per million tokens. I routes most of my batch coding workloads and cost-sensitive agent pipelines through V4-Pro. It falls behind Opus on long-context reliability past 500K tokens, where tool-use consistency starts to fade. For anything under 500K and anything where cost per call matters, V4-Pro is the pick.

GPT-5.5

OpenAI

Best for Codex CLI and terminal-heavy workflows. I keep it around for creative tasks where Claude is too literal.

GPT-5.5 leads Terminal-Bench 2.1 at 82.7%, and its Codex CLI integration gives it a clear advantage for scaffolding, shell scripts, and DevOps automation. On LMArena, it scores competitively on creative writing and brainstorming tasks. But on pure agent-loop reliability past 200K tokens, it trails both Claude models significantly. I use GPT-5.5 when the task is terminal-first or style-sensitive. For structured agent work, I reach for Opus first.

Embeddings

What I use to embed documents for retrieval-augmented agents and semantic search.

	#1 voyage-3	#2 text-embedding-3-large
Provider	Voyage AI	OpenAI
Best for	English and code retrieval. RAG pipelines	Safe default. Azure/OpenAI stacks
Cost per M tokens	₹5 ($0.06)	₹11 ($0.13)

voyage-3

Voyage AI

I switched to voyage-3 after running my own retrieval benchmarks and haven't looked back. It beats OpenAI on code and technical docs at half the price.

text-embedding-3-large

OpenAI

Use this when the client has an Azure OpenAI contract and Voyage is not approved. It works everywhere and is well-documented.

Vision

For agents that take screenshots, read diagrams, or need vision understanding.

Claude Opus 4.8 vision

Anthropic

The only vision model I ship with. It handles complex diagrams, dense UI screenshots, and architectural drawings reliably.

Quick reference

Not sure which to pick? Start here.

Your priority	Pick	Why
Maximum capability, no budget constraint	#1 Opus 4.8	Best tool-use reliability, long-context, and agent-loop stability I have shipped with
Best value, high volume	#2 DeepSeek V4-Pro	Third the cost, close on benchmarks, open weights, 97% cache hit rate
Codex CLI / shell-heavy workflow	#3 GPT-5.5	Best Terminal-Bench score and Codex CLI integration
RAG / document retrieval	voyage-3	Best retrieval quality per dollar for English and code
Vision / UI parsing	Opus 4.8 vision	Only vision model I ship with

What I don't pick

Categories I deliberately don't rank. Picking a model I don't ship with would be guessing.

Image generation (FLUX, DALL-E, Midjourney). I don't ship image-generation agents. See the LMArena leaderboard.
Video generation (Sora, Veo, Runway). Not my use case.
Audio / speech (Whisper, ElevenLabs). Whisper dominates its category. Ranking alternatives would be theatre.
Other open-weights models (Qwen, Kimi, GLM, Xiaomi). I use DeepSeek V4-Pro daily. The rest are covered in the open-source landscape post.

Where this data comes from

OpenCode production data — real token usage, session costs, and cache ratios across hundreds of thousands of agent sessions. The cache ratio numbers and session costs are from here.
LMArena — community preference rankings for chat quality across text, code, and vision. Cross-reference for general capability.