For the last few weeks I've been building a computer-use agent: type a prompt in plain English, the agent opens a fresh Linux desktop in the cloud, points and clicks like a person, and finishes the task. Sign in to a site. Fill a form. Read a doc. The kind of thing you'd hand a junior contractor for.
The obvious way to build this is to throw the screenshot at one big multimodal model and ask it to spit out clicks. It works. It's also slow, expensive, and surprisingly fragile.
What works better — at least so far — is splitting the agent in two:
- A small reasoning LLM that decides what to do next, in words.
- A vision-grounding model that takes those words and turns them into pixel coordinates.
Two cheap calls per step instead of one expensive one. And the steps are more reliable, not less. Here's why.
The two-model loop
Each step is a tight cycle. The agent captures a screenshot, asks the small reasoning LLM what to do next, hands that decision to the grounding model to translate into pixels, dispatches the action through xdotool, and captures a fresh screenshot to feed back into the next iteration. The same loop runs end-to-end without either model needing to know what the other one does.
The reasoning LLM never sees pixel coordinates. It outputs strings:
{
"action": "click",
"target": "the blue Sign In button at top-right of the page",
"reasoning": "Form is loaded, login flow needs to start here"
}The grounding model never sees the user's prompt or the plan tree. It sees a screenshot and a string, and returns a normalised (x, y). The sandbox multiplies by screen width/height and fires xdotool click.
That separation is the entire trick.
Why split it?
Three reasons, in order of how much they surprised me.
1. Each part does one thing well
A multimodal model that has to plan AND look has two jobs at once, in the same forward pass. It hallucinates clicks at coordinates that look right but aren't. It also "drifts" — if the planning is wrong, the click is wrong, and you can't tell which broke.
The grounding model only does one thing: given an image and a noun phrase, produce the pixel where that thing lives. Single-task models are smaller and more accurate at their thing than a generalist twice their size.
The reasoning LLM only has to pick the next action and describe the target in plain English. That's something even sub-billion-parameter models do well. You don't need a flagship to write "click the blue Sign In button".
2. The cost math is upside-down
You'd expect two calls to be more expensive than one. They're not.
A single big multimodal call costs roughly:
big_call_cost = (image_tokens + prompt_tokens + output_tokens) × big_model_price
Image tokens dominate. A 1280×800 screenshot ends up north of 1000 tokens on most multimodal models. Send 30 of those over a session and you're shipping 30k+ image tokens per task at flagship prices.
Two specialised calls cost:
small_call_cost = (image_tokens + prompt_tokens + output_tokens) × small_model_price
grounding_cost = fixed_per_call_price // tiny model, often a flat fee
The small model still pays for the image tokens, but at a much lower per-token price. The grounding model is small, dedicated, and many providers charge a flat per-call fee that's an order of magnitude cheaper than per-image flagship pricing.
Net result on the agent I'm running: roughly 5-10× cheaper per session than a single-flagship-model setup, for tasks of equivalent difficulty.
3. You can retry the cheap part
When the grounding model picks the wrong icon — and it sometimes does, especially on dense UIs with two visually similar buttons — the fix is to retry with a more specific noun phrase. "the Submit button" → "the green Submit button at the bottom of the order form".
If your planner and your grounder are the same model, retrying means another full multimodal call, which means another flagship-priced image. If they're separate, the retry is a sub-cent grounding-only call.
This shows up as a real reliability number. With the split architecture I can ask the grounder for candidates: give me up to 5 plausible matches. If two candidates are visually far apart on the screen, the target description was ambiguous, and the planner gets a soft "this was unclear, try again with a more specific phrase" signal on the next step. With one model, you don't get to ask that question without paying again.
The action set
The reasoning LLM only needs a small fixed vocabulary. The agent I'm running uses:
| Action | What it does |
|---|---|
click / double_click / right_click | Routed through the grounder; takes a target string |
type | Literal text typed at current focus |
key | Single key or combo (Return, ctrl+l, etc.) |
scroll | Direction + amount |
navigate | Open a URL in the browser |
wait | Pause N ms (used while pages settle) |
done | Final summary string |
Nine actions cover ~all everyday computer tasks. The constraint is liberating — the planner can't invent novel actions, can't drag-and-drop in weird ways, can't fall into the "let me write a Python script that automates this" trap that the bigger models love.
Memory: scratchpad and plan tree
A 30-step task cannot fit in any reasonable context window if every screenshot is kept verbatim, so the agent has to be opinionated about what it remembers and what it discards. The structure that works for me keeps a small core of durable state at the top of every prompt and a sliding window of recent steps at the bottom, with everything older replaced by the planner's own captions.
The scratchpad is a free-form note the planner can append to on any step, and once written, that line becomes part of every future system prompt regardless of how many steps later the run continues. Lines like "user confirmed checkout, total was $42" or "noted that the search returned no results for the original query" stay live across the entire session and act as the model's long-term recollection. The plan tree is the matching mechanism for outstanding work: a list of pending, in-progress, and done sub-tasks that the planner revises in place per step, which stops the model from prematurely declaring done after the first step of a five-step plan because the system prompt now contains an explicit "you said you'd do X, and you haven't yet" reminder. The third channel is cross-session memory that the user can pre-populate before any session starts ("my GitHub username is paras"), loaded once at session start and writable by the planner whenever it learns something durable that should outlive the current run.
The combined effect is that the agent remains coherent over long sessions even though only the most recent screenshot is actually in the message list. Older screenshots are replaced by short captions the planner generates, and the loss of fidelity matters far less in practice than I expected.
Verification: closing the loop
Every action the planner emits is a hypothesis about what should happen next, and every step's verifier check is the test of whether that hypothesis held against the screenshot the action actually produced:
When verification fails, the planner sees the failure in the next step's history and can recover by reframing the target or trying a different element. When verification succeeds, the resulting record carries a much stronger signal of progress than the bare fact that the action didn't throw an exception, because the verifier confirms that the screen now reflects the intended outcome rather than merely that no error bubbled up through xdotool.
A second verification fires on done. Before declaring victory, ask the verifier: did the agent actually complete the original prompt? It's a 500ms call and it catches the silent-regression failure mode where the planner thinks it's done but the page is wrong (auth wall, error toast, half-loaded state). The done-step still terminates as done, but the verify result lands on the step record so the user can see "soft uncertain" successes flagged visually.
Update: who verifies the verifier
The initial version of the architecture used the grounding model itself as the verifier, on the reasoning that the same upstream call had already loaded the screenshot and could answer a yes-or-no question about it for marginal additional cost. That setup worked well enough on the simple pages I tested it against during development, where the verifier's question was usually about coarse-grained scene properties like "did a new dialog appear" or "did the URL bar change", and the grounding model's general visual understanding was sufficient to answer them reliably.
The architecture started losing accuracy as soon as I pointed it at denser real-world pages, particularly forms with several visually similar buttons placed in a row, search results pages where every card looks identical down to the typography, and settings panels where related toggles are clustered into nested groups. On these pages the verifier began producing false negatives at a rate that mattered, saying that a goal had not been met on screenshots where the goal had clearly been met to my eye. The asymmetric cost of a false negative was the part that surprised me: a false positive on the verifier wastes one step on a done that should have been a retry, but a false negative triggers a replan, and a replan costs a fresh planner call, a fresh grounder call, a fresh sandbox action, and a fresh verifier round, which adds up to roughly four times the cost of the symmetric error in the other direction. The user-visible part of that cost is also worse, since the agent hesitates and re-tries on a page the user can see is correct.
The fix that worked was to promote the verifier from a side-effect of the grounding model into its own logical role on the planner's reasoning model, while leaving the grounder as the dedicated point-to-pixel specialist that does its specific job well:
The reason a reasoning model makes a better verifier than a grounding model is that the verifier's question is fundamentally about scene understanding rather than pixel localisation, and a question like "does this screenshot show the article body with no overlay covering the text" requires a model that has been trained to reason about what is on screen rather than where to point on screen. Point-pose grounding models are excellent at the latter and undertrained on the former, while small generalist multimodal models are competent at both and considerably better than the grounder on the verification axis specifically. The economics of the swap also worked out cleanly, because the verifier is the same small reasoning model the planner already calls, only with a different prompt and a stricter output schema, so the per-call cost stays at sub-flagship pricing rather than introducing a separate flagship-priced verifier into the pipeline.
What this changes in the original story is the granularity at which the architecture decomposes. The blog post above presented a two-way split between planning and grounding, and the update is that the same reasoning model can usefully play more than one role across the loop, with the verifier role differing from the planner role only in the prompt and the output schema rather than in the underlying model. The principle generalises beyond two models: decompose the agent into the roles its work actually requires, then assign each role to whichever model wins on that role's specific axis, and let the orchestration code carry the differences in prompt and evaluation criteria. For computer-use today this means a small reasoning model handling planning and verification, a dedicated point-pose grounding model handling localisation with a refinement pass for ambiguous targets, and no flagship-priced model anywhere in the loop.
Two more grounding upgrades that landed
Once the verifier moved to the reasoning model, two improvements to the grounding pipeline became practical to ship that would have been counter-productive with the noisier verifier in place. The first is iterative-refine grounding, which addresses the case where the grounder returns several plausible candidates because the target description matched multiple visually distinct elements on the screen. The refinement step crops a roughly thirty-percent window of the screenshot centred on the first candidate, re-points inside the crop where the target now occupies a much larger share of the input pixels, and maps the refined coordinates back to the original frame, which is the same pattern that appears in research projects under names like OS-Atlas and OmniParser-V2 and converges on a similar crop-and-re-point loop in each case. I do not think anyone has the optimal version of this technique yet, but the floor is materially higher than running the grounder once and accepting whatever the first call returned, and the instrumented counter tracks the four outcomes I care about: refinement that shifted the target by more than five percent on at least one axis (the win case), refinement that confirmed the first guess, refinement that was skipped because the first call was unambiguous, and refinement that returned nothing usable.
The second upgrade is a verifier-driven replan nudge that fires when the verifier says no on a click or type action. The next planner turn now receives an explicit instruction to retry the action with a different target rather than the previous soft hint that the planner was free to ignore, and the loop is bounded by a small per-session cap (three nudges in current practice) so that a genuinely broken page cannot trap the agent in an infinite agreed-on-wrong loop. The cap is the load-bearing piece of this design: without it, a stuck verifier produces a self-reinforcing failure mode where the planner accepts the verifier's no, picks a marginally different target, fails the same way, and repeats; with it, the planner gets a small number of real second chances and then falls back to a done with a soft-uncertain flag that surfaces in the user-facing UI.
Neither of these upgrades would have been worth shipping with the old grounder-as-verifier setup, because a noisy verifier means the cap-three replan path becomes three guaranteed-wasted retries on every borderline page rather than a real recovery mechanism, which would have made the agent slower without making it more accurate. With the reasoning-model verifier in place, both upgrades pay off immediately, and the verify-fail-replan counter has become one of the more informative signals on the dashboard for spotting upstream UI changes that the planner has not yet adapted to.
Cutting cost without changing models: prefix caching
The cost section above made the case that two cheap calls beat one expensive call, but that was the static analysis. The dynamic improvement came from realising that almost every planner call inside one session sends an identical multi-thousand-token system prompt, and that providers offering prompt caching will discount that repeated portion by ninety percent on every call after the first if the prefix is held byte-stable and the call routes to a cache-warm replica. The two preconditions are independent, and missing either one means the cache silently fails to engage even though everything looks like it should be working.
The byte-stable side is the easier of the two to get wrong. The original architecture interleaved a half-dozen mutable fields into the system prompt itself, including the history trail of older actions, the current plan tree, and the durable scratchpad notes the planner appends to per step. All three of those mutate every step, which means the system prefix changed on every call, which means the cache was never going to engage no matter how predictable the rest of the prompt was. The fix was to split the system prompt into a stable component (BASE_SYSTEM_PROMPT plus the user's cross-session memories, which only change when the user's profile updates) and a mutable component (history, plan, scratchpad) that lives in a transient user message pushed onto the messages array right before the planner call and popped immediately after, so it never enters the prefix consumed on the next step.
The replica-routing side is less obvious because it does not show up at all when you read the SDK code. Provider caches are local to the backend instance that processed the original prefix, and without a routing hint a series of calls with identical prefixes will scatter across the load balancer and miss the cache on every replica that has not seen the prefix yet. Both OpenAI and Anthropic accept a customer-supplied cache key (the prompt_cache_key field in the OpenAI API, similar mechanisms elsewhere) that the provider hashes for sticky routing, so passing the user's id keys their session's calls onto the same cache-warm replica for the duration of the session. The planner and verifier need different keys because their prefixes are different, and sharing one key would round-robin-evict the prefix you actually wanted to keep.
Once both pieces were in place I instrumented the planner to write each call's inputT, cachedInputT, outputT to a JSONL trace and ran five representative tasks through the harness end-to-end, paying real billed dollars on the OpenAI side. The trace numbers are below, and they are real measurements rather than estimates:
| task | planner calls | input tokens | cached tokens | cache % | cost USD |
|---|---|---|---|---|---|
| open example.com, read heading | 2 | 7,783 | 4,608 | 59% | $0.0040 |
| open Linux Wikipedia, read title | 1 | 3,938 | 2,304 | 59% | $0.0020 |
| open HN, confirm 10 headlines | 1 | 3,966 | 2,304 | 58% | $0.0021 |
| HN → click first story → click comments | 4 | 18,370 | 10,240 | 56% | $0.0105 |
| Wikipedia Computer → scroll to History → quote | 2 | 8,900 | 5,120 | 58% | $0.0049 |
| Total across 5 sessions | 10 | 42,957 | 24,576 | 57.2% | $0.0236 |
The mean per-task cost is just under half a cent at gpt-5.4-mini's published rates of 0.075 per million cached input tokens. The 57 percent cached ratio matches the theoretical ceiling pretty closely, since the stable prefix is roughly 2,300 tokens out of every 3,950-token planner call and the first call in any session is always uncached by definition. Tasks that ran for more steps (the four-call HN navigation) saw the same hit ratio as one-call tasks, which is the most useful confirmation that the prefix is genuinely byte-stable across step boundaries rather than drifting in some subtle way that breaks the cache after step three. Anything below thirty percent in production would mean either the prefix is mutating or the routing hint is not doing its job.
There is a third lever sitting next to those two that I have only partly used so far. Providers offer extended cache retention, with OpenAI's prompt_cache_retention: "24h" flag pushing the in-memory five-to-ten-minute TTL up to about a day. For the schedule-driven user pattern (an agent that runs the same set of tasks every morning) extended retention turns prompt caching from a within-session optimisation into a within-day optimisation, which means even step one of the next session can hit cache against the previous session's prefix. I shipped the 24h flag on the planner alone and left the verifier on default retention because the verifier prefix is below the 1024-token cache eligibility floor anyway and would not benefit. The boot sequence now also estimates the prefix token count (chars divided by four for English prose, plus a 75-token safety margin above 1024) and emits a structured warning if a future trimming refactor drops it below the floor, since that kind of regression is invisible in tests and only shows up as a slow cost increase weeks later.
Reliability: retry every transient
Sandbox action ── one retry on flaky xdotool ──┐
Grounding call ── one retry on 5xx / network ─┤
Planner call ── one retry on 5xx / 429 ─┴── otherwise propagate
One retry is the right number for almost every transient I have seen, because zero retries gives users miserable ergonomics on flaky residential networks while three retries turns short upstream blips into thirty-second visible stalls during which the agent appears wedged. The budget is bounded per-attempt rather than per-session, and that bound is enforced by the underlying twenty-second timeout, which means there is no need for a separate circuit breaker on the happy path: the call either completes within the budget, fails fast with a typed error the planner can react to, or runs out of time and surfaces the failure as a step error.
Why not just one bigger model?
The single-multimodal architecture is the comparison every reader asks about, because the two-model split looks like additional complexity until the reasons not to take it become concrete. The reason that ended up mattering most for me is cost scaling: image tokens dominate every multimodal call, and sending higher-resolution screenshots to the planner so that it could click better would cause the per-session budget to grow much faster than the marginal accuracy gain justifies. The second reason is that two specialist models give you a soft-signal channel a single model cannot: the grounder's number of plausible candidates for a given target description is itself useful information, since two candidates that are visually far apart on the screen mean the description was ambiguous and the planner should be told to retry with more specificity, whereas a single multimodal model simply returns coordinates or refuses, with no analogous "by the way, this was a coin flip" channel. The third reason is operational rather than technical: when a better grounding model ships I can swap it in by changing one HTTP client, but with a monolithic agent every model upgrade is a full migration of prompts, evaluation harnesses, and downstream consumers. Finally, small reasoning models are improving at a pace that makes the bet structurally favourable, since a planning role that required a flagship two years ago is now well within reach of sub-flagship reasoning models, and the trend continues to favour the side of the architecture that splits the work into specialists.
Where it falls down
The split is not free, and it is worth being honest about the costs. Operating two model providers means maintaining two sets of timeouts, retry policies, and cost dashboards, and every architecture that adds a second upstream also adds a second class of outage to monitor. Right-tail latency is materially worse than the single-call alternative, because two sequential roundtrips mean the step's wall-clock time is dominated by whichever of the two upstreams is slower at the ninety-ninth percentile; the average call is faster than a flagship multimodal call but the worst-case call is reliably slower. The third cost is that the entire architecture's reliability depends on the planner writing specific noun phrases, since "click the button" is hopeless and only descriptions like "the green Submit button at the bottom of the order form" reliably localise, which makes prompt engineering a more load-bearing concern here than it would be in a single-model setup.
Even with those costs accounted for, the trade has been worth taking on every task I have measured.
Closing
The pattern seems likely to generalise outside computer-use to any agent task that decomposes cleanly into a "what should we do" question and a separate "where exactly does that happen" perception task, because perception tasks with clear inputs and outputs are exactly the workloads where a small dedicated specialist beats a generalist running the same model weights on a wider job description. The same shape applies to a planner that talks to a separate retriever, a separate code executor, or a separate graph-walker, and the unifying principle in every case is that single-task models at the leaves are smaller, faster, and more accurate at their specific task than a generalist trying to play every role in one forward pass.
The flagship-everything era of agent design will end the day enough sub-tasks have a good open small model that beats the generalist on the specific axis the sub-task cares about, and we are most of the way there already for the perception side of computer-use, with the planning and verification sides catching up next.