Skip to content

Dial model, thinking, and speed per task to control cost

The race is fixed, the copy tweak shipped hours ago, and if you scroll back through the afternoon you can see the two dials moving in opposite directions across a single session — light-and-shallow for the error string, capable-and-deep for the ledger bug. That’s not two settings you stumbled into; it’s the actual skill this chapter teaches: treating model and reasoning as a per-task spend, not a session-wide default you set once and forget. This last lesson is about making that a habit, and about the one case where you deliberately pay more.

It helps to see clearly that you’ve been turning two separate knobs, and they don’t have to move together:

  • Model — which brain answers. Capable for hard reasoning, lighter and faster for mechanical work. Set with /model.
  • Effort / thinking — how hard that brain reasons before it acts. Up for ambiguity, down for the obvious. Set with /effort and the thinking toggle.

The cost map across the four corners is the whole point:

light model capable model
low effort copy tweak, rename simple change on the hard model
(cheapest) (rarely what you want)
high effort thinking hard on a concurrency bug, design fork
weak brain (wasteful) (most expensive — reserve it)

The diagonal is where most people lose money. Running the capable model at high effort on a rename is paying top dollar for a task with one obvious answer. Running deep effort on a light model is spending thinking tokens a weaker brain can’t fully cash. You want the matched corners: cheap-and-shallow for mechanical work, expensive-and-deep for the genuinely hard problem — and you want to move between them as the work changes, which on a lumpy day like this one is often.

There’s a third lever that isn’t about capability at all — it’s about latency. Some sessions are interactive in a way where waiting hurts: live debugging, rapid iteration where you’re firing a prompt every thirty seconds and the agent’s response time is the bottleneck. For those, Claude Code has fast mode, toggled with /fast:

/fast # toggle fast mode on (or off again)

Fast mode runs the capable model markedly quicker at a higher per-token cost — same quality, lower latency, more money. It’s currently a preview feature and the exact tradeoff and availability are documented in fast-mode, so check there before you lean on it. The judgment is narrow and worth stating plainly:

  • Worth it — interactive, latency-sensitive work where your waiting time is the expensive resource. Rapid iteration, live debugging with a deadline.
  • Not worth it — long autonomous runs, CI, batch jobs, anything you walk away from. There, nobody’s waiting on latency, so you’d be paying the speed premium for nothing.

A practical note from the docs worth carrying: switching into fast mode mid-conversation re-prices the whole context at the faster rate, so if you know a session wants speed, turn it on at the start rather than flipping it on partway through. And fast mode stacks with the other dials — a lower effort level on top of fast mode is maximum speed for straightforward interactive work.

Strip away the commands and this chapter is a single context-engineering move. You already learned to spend the context window on purpose in Sessions & context — watch it, fill it with signal, reclaim it before it bloats. Model and thinking are the same instinct aimed at the agent’s effort: spend the capable model and deep reasoning where the context is genuinely hard — the race no one could solve by pattern-matching — and pull both down the moment the work goes mechanical. The gap this whole site is about is the agent’s missing context; the cost lever is making sure you pay for closing that gap on the hard problems, not for over-powering the easy ones.

Do that by reflex and a lumpy day costs what it should: pennies on the copy tweak, real budget on the ledger bug, and nothing wasted on the diagonal between them.

Everything so far has been one agent, with one model and one reasoning budget, working a task in front of you. That’s the base case — and it has a ceiling. When a job is genuinely big, or when one noisy step (reading a giant log, churning through a test suite) would flood the very context window you just learned to guard, the next move isn’t a bigger gear on a single agent. It’s more agents: pushing that noisy work into an isolated context of its own, and fanning a hard problem out across parallel workers that each carry their own window and report back a summary.

That’s the whole idea of Subagents — and it’s where the course goes next.