Reach for the Agents SDK when exec runs out of room

Everything in this chapter has run through codex exec: one prompt in, one result out, one exit code the calling script reads. That shape carries an enormous amount — the monthly report, the transaction check, both reproducible in CI. For most automation it’s all you’ll ever need. This last lesson is about the cases where it isn’t, and about not reaching for heavier machinery before you have to.

The line where a shell script buckles

codex exec is stateless per call and composes through the shell: you chain runs with &&, branch on exit codes, pipe output between them. That works beautifully right up to the point where the logic between the stages gets real. Picture budgetcli’s check growing up: recategorise the new transactions, then if anything crossed a budget, draft an alert; if a test failed, open a fix and re-run only the affected suite; gate each stage on the previous one actually producing the artefact it promised. Express that in shell and you’re soon writing fragile glue — parsing prose to decide what happens next, hoping a stage finished before the next one starts, with no record of why the run took the path it did.

The heuristic to carry: if the job fits in a shell script, keep it in codex exec. Once you’re writing more than a few stages with real conditional logic between them — typed validation at the boundaries, structured handoffs, a trace you can read after the fact — you’ve outgrown the CLI, and that’s when the SDK earns its setup cost.

What the SDK buys you beyond `exec`

The Agents SDK is a programmatic framework for composing multi-stage agent pipelines — the thing you move to when shell glue stops scaling. Instead of stringing exec calls together by hand, you describe the pipeline in code and let the framework run it. The capabilities that justify the switch:

Stages as first-class agents. Each role — a categoriser, a budget-checker, a test-runner — is its own agent with its own instructions and tools, instead of one mega-prompt trying to be all of them at once. That’s the same isolation instinct you used with subagents, made programmatic.
Typed handoffs and gates between stages. Control passes from one stage to the next as a structured, recorded transfer rather than a fragile pipe — and you can gate a handoff on a concrete signal. The durable rule here: gate on evidence, not on status. “Advance only when the categorised file exists on disk” is far more reliable than “advance when the categoriser says it’s done.”
Quality gates in code, not in the prompt. Where exec trusts the agent’s own reading of “did the tests pass,” the SDK lets you enforce it with a real function the framework calls — a programmatic check that won’t be talked out of its verdict.
Traces. The framework records what each stage did, what it handed off, and how long it took. When a multi-stage run goes wrong, a trace tells you where and why — the question a single exec exit code can’t answer.

Under the hood, Codex itself does the code work inside each stage while the SDK handles the orchestration around it — the planning, gating, and sequencing between stages. Critically, the SDK does not widen your safety surface: each Codex stage inherits the sandbox and approval policy of the process that launched it, exactly as codex exec does. The SDK orchestrates; it doesn’t escalate. Everything you learned in Approvals & sandboxing still applies, one layer down.

When not to reach for it

The SDK is more setup than a shell script: a language runtime, the framework wired in, the pipeline expressed in code. Don’t pay that cost early. Two cheaper tools usually win first:

If the workflow is a linear sequence the agent just follows — do this, then this, then this, no branching — that’s a Skill, a markdown procedure, not a pipeline. You already have that pattern in your toolkit from the CSV-import skill.
If it’s a single self-contained task, however involved, it’s still a codex exec call. One prompt, one result. The report and the transaction check both live here, and most of your automation will too.

Reach for the SDK only when the job genuinely needs conditional routing between stages, structured validation at the boundaries, or a trace you can audit — the things shell and skills can’t give you. Start with one agent and fragment only when a real boundary demands it; an extra stage you didn’t need just inflates the prompts, the traces, and the surface you have to reason about, with nothing to show for it.

The week, made reflex

That closes the automation chapter — and very nearly the week. You started with an inherited budgetcli you didn’t trust, and you’ve now got it producing its own monthly report and checking its own transactions, unattended, reproducibly, with the sandbox and approval discipline carrying the full weight that used to sit on you watching the screen. The headless command, the CI posture, the determinism flags, and the line where you’d graduate to the SDK — those are the whole automation toolkit.

What’s left is to make all of it second-nature — to stop assembling these pieces from scratch each time and turn them into the handful of habits you reach for without thinking: the daily profiles, the prompting reflexes, the editing ergonomics that make Codex feel like part of how you work rather than a thing you operate. That’s the final chapter.