From Anthropic Computer Use to a Focused Browser Test CLI

Name: BrowserBash
Author: The Testing Academy

If you built a web QA harness on top of Anthropic Computer Use, you already know the feeling. You wanted a tool that logs in, fills a form, and confirms an order went through, and instead you wired up a full desktop agent that stares at screenshots and guesses pixel coordinates. This article is the practical guide to an anthropic computer use browser testing alternative: why a general computer-use agent is more than web QA needs, and how to move those same prompts into BrowserBash, a browser-scoped CLI with deterministic exit codes and recorded traces. I work on BrowserBash, so treat the product sections as the vendor talking — but the analysis of where Computer Use is the right tool is honest, including the cases where you should keep it.

The short version: Computer Use is a real capability and a fair primitive. It can drive anything with a screen — a native installer, a spreadsheet app, a legacy enterprise client, a browser. That breadth is the whole point for desktop automation. For testing a web app specifically, that same breadth becomes overhead you pay for on every turn, and it leaves you building the verdict logic, the retry harness, and the CI plumbing yourself.

What Computer Use actually does, and why it is too much for web QA

Computer Use is a model capability exposed through the Anthropic API. You run an agent loop — Anthropic ships a reference implementation, usually in a Docker container with a virtual display. The loop captures a screenshot, sends the image plus your instruction to a Claude model, and the model replies with an action: click(x, y), type("..."), key("Return"), or another screenshot. Repeat until the task finishes or you cut it off. It is vision-first and coordinate-based. The model reasons about pixels on a screen, not about the structure of the page beneath them.

For desktop work, that design is exactly right. There is no DOM behind a native installer dialog, so a vision agent is the only general way to drive it. But web QA is a narrower problem with a richer substrate. Every page you test has a readable DOM, an accessibility tree, and stable text labels. Using a pixel-coordinate agent to click a button whose text is right there in the markup is solving an easy problem with a hard tool.

Here is where the mismatch shows up when you are doing browser testing:

Every turn ships an image. A real checkout flow is many steps. Screenshot-in, reason-out, screenshot-in again is neither cheap nor fast, and there is no free local tier in the reference loop. Your token bill scales with the length of the flow.
Pixel coordinates are brittle. A layout shift, a different viewport, a moved CTA — the model has to re-find everything visually. A DOM-aware tool sidesteps a whole category of this flakiness.
It is a primitive, not a test runner. Out of the box there is no pass/fail verdict, no exit code, no committable test file, no session recording. You build the harness yourself.

So the useful question is not "what replaces Computer Use" but "which part of it did your web tests actually use." If you needed full cross-application desktop control, keep a computer-use-class agent. If all you ever did was point it at a browser to exercise a web app, you want a tool scoped to that job. That is the gap an anthropic computer use browser testing alternative fills.

What "browser-scoped" buys you

Scoping the agent to a browser is not a downgrade. It is a different set of trade-offs, and for QA the trade is in your favor.

A browser-scoped tool reads the page. It can see the DOM, the roles and labels in the accessibility tree, and the actual text of links and buttons. That means it acts on "click the Checkout button" by finding the button, not by computing where the pixels for that button probably are. When a designer moves the button 40 pixels down or your CI runner has a different default font size, a coordinate-based agent can drift; a DOM-aware one usually does not notice.

It also means a smaller, better-defined action space. There is no "switch to another application" branch to reason about, because the world is one browser tab. A constrained action space is easier for a model to navigate reliably, which is part of why a browser-scoped agent can stay on task with smaller models than a full desktop agent needs.

And critically, a QA-focused tool gives you the things Computer Use leaves as homework: a clear verdict, a machine-readable event stream, a recorded video, and — with the right engine — a full trace you can step through afterward. Those are not nice-to-haves in a test pipeline. They are the difference between "the agent did something" and "the test passed, here is the proof, here is exactly what it clicked."

Meet BrowserBash: the same agent idea, scoped and packaged

BrowserBash is a free, open-source (Apache-2.0) natural-language browser automation CLI from The Testing Academy. You install it with one command, write a plain-English objective, and an AI agent drives a real Chrome or Chromium browser step by step — no selectors, no page objects — then returns a verdict plus structured results.

npm install -g browserbash-cli
browserbash run "Go to the demo store, log in as standard_user, add the first product to the cart, complete checkout, and verify the page shows 'Thank you for your order!'"

The model story is the part that matters most if you are migrating off a metered API. BrowserBash is Ollama-first: by default it uses free local models, needs no API keys, and nothing leaves your machine. It auto-resolves a local Ollama install first, then ANTHROPIC_API_KEY, then OPENROUTER_API_KEY. So you can run it entirely local for a guaranteed $0 model bill, bring your own Anthropic key if you want Claude driving the browser, or point it at OpenRouter — including genuinely free hosted models such as openai/gpt-oss-120b:free — for harder flows.

One honest caveat before you set expectations too high: very small local models (roughly 8B parameters and under) can get flaky on long, multi-step objectives. They lose the thread on a ten-step checkout. The sweet spot is a mid-size local model in the Qwen3 or Llama 3.3 70B class, or a capable hosted model when the flow is genuinely hard. If you were already paying Anthropic for Computer Use, keeping a Claude key wired into BrowserBash gives you the same model quality with far less per-turn overhead, because the agent acts on the DOM instead of shipping a screenshot every turn.

No account is needed to run anything. There is an optional, strictly opt-in free cloud dashboard for run history and video replay, and a fully local dashboard if you would rather keep everything on your machine.

How Computer Use compares to a browser-scoped CLI

Here is the at-a-glance version, kept to what is publicly known as of 2026; where Anthropic has not published a detail, I say so rather than invent it.

Dimension	Anthropic Computer Use	BrowserBash (browser-scoped CLI)
Scope	Whole desktop, any app with a screen	One browser tab, web apps only
How it perceives	Screenshots (pixel/vision-first)	DOM-aware (engine-dependent) + optional screenshots
Model	Claude via Anthropic API	Ollama-first local; or Anthropic / OpenRouter
Free local path	Not in the reference loop	Yes — local models, $0 bill
Built-in verdict	No — you build it	Yes — pass/fail returned
CI exit codes	Not provided by the primitive	0 passed, 1 failed, 2 error, 3 timeout
Machine-readable output	You design the schema	`--agent` emits NDJSON, one event per line
Committable test files	Not built in	`*_test.md` markdown tests with `@import` + `{{variables}}`
Recording	You add it	`--record` captures screenshot + `.webm`; builtin engine adds a Playwright trace
Best fit	Cross-app desktop automation	Web QA, smoke tests, CI gates

Read that as a map of trade-offs, not a takedown. Computer Use wins outright the moment your task leaves the browser. If you need to drive a native app or coordinate across applications, none of this replaces it. The table favors BrowserBash only inside the box marked "web QA" — which happens to be where most teams were using Computer Use anyway.

Migrating your prompts: a step-by-step move

The good news about moving from a natural-language desktop agent to a natural-language browser agent is that your prompts mostly survive. You wrote objectives in English. You keep writing objectives in English. What changes is that you can drop the visual scaffolding you needed to keep a pixel agent on track, and you gain real test ergonomics.

Step 1: Strip the coordinate and screenshot hints

Computer Use prompts often accumulate visual crutches: "the button is in the top-right," "scroll down until you see the form," "click the blue button near the logo." A DOM-aware agent does not need spatial directions. Rewrite to intent and name the target.

Before (tuned for a pixel agent):

Take a screenshot. Find the blue "Sign in" button in the upper right. Click it. Take a screenshot to confirm the login form appeared.

After (browser-scoped objective):

Open the app, click "Sign in," and confirm the login form is visible.

Shorter, more robust, and it survives a redesign that moves the button.

Step 2: Make the success condition explicit

Computer Use leaves the verdict to you, so your prompts probably ended vaguely ("then verify the order"). BrowserBash returns a real pass/fail, so spell out the exact thing the agent should check. Concrete success criteria are what let the tool give you a trustworthy exit code.

browserbash run "Log in, add the first item to the cart, complete checkout, and verify the confirmation page displays the text 'Thank you for your order!'"

That trailing assertion is the contract: the agent passes only if that text actually renders.

Step 3: Turn one-off prompts into committable tests

This is the upgrade Computer Use never gave you. BrowserBash supports committable *_test.md files where each list item is a step. You get @import composition for shared setup, {{variables}} templating, and secret-marked variables that are masked as ***** in every log line. After each run it writes a human-readable Result.md.

A checkout_test.md might look like this:

# Checkout smoke test

- Go to {{baseUrl}}
- Log in with username {{user}} and password {{password!secret}}
- Add the first product to the cart
- Proceed to checkout and complete the order
- Verify the page shows "Thank you for your order!"

Run it:

browserbash testmd run ./checkout_test.md

The {{password!secret}} marker means the password never shows up in logs. That is a meaningful difference from piping credentials through an ad-hoc desktop agent script where you have to remember to redact things yourself. If you want the full reference on test structure and variables, the BrowserBash learn docs walk through composition and templating in depth.

Step 4: Wire it into CI with exit codes instead of prose

Computer Use returns model text. Your CI does not read prose; it reads exit codes. This is where the migration pays off most. Run BrowserBash in agent mode and it emits NDJSON — one JSON event per line on stdout — and exits with a code your pipeline already understands: 0 passed, 1 failed, 2 error, 3 timeout.

browserbash run "Verify the signup page loads and the email field rejects 'not-an-email'" \
  --agent --headless

No prose parsing, no regex against a model's freeform answer. Your job step either goes green or it does not, and the NDJSON stream gives a coding agent or a log aggregator a clean event-by-event record of what happened. If you are building this into a real pipeline, the features overview lays out the agent-mode contract and the provider flags side by side.

Recorded traces: the part QA teams miss most

When a Computer Use run fails, you are often left squinting at a sequence of screenshots trying to reconstruct what the agent was thinking. BrowserBash treats evidence as a first-class output. Add --record to any run, on any engine, and it captures a screenshot plus a full .webm session video via ffmpeg. That alone changes triage: a failed nightly run leaves you a video of exactly what the agent saw and did, not a vague "step 7 failed" message.

browserbash run "Log in and verify the dashboard greets the user by name" --record

The builtin engine goes one further. BrowserBash ships two engines: stagehand (the default, MIT-licensed, from Browserbase) and builtin, an in-repo Anthropic tool-use loop. With recording on, the builtin engine additionally captures a Playwright trace you can open in the trace viewer: the timeline, DOM snapshots at each step, network activity, and the action log — the same forensic view Playwright users rely on, produced by an agent you drove in plain English.

This is the closest thing to "Computer Use, but for QA": the builtin engine is itself an Anthropic tool-use loop, so the migration is less a rewrite than a re-scoping. You keep the Claude-driven agent behavior, lose the desktop generality you were not using, and gain a deterministic verdict and a stepable trace. For teams that adopted Computer Use specifically because they trusted Claude to reason through a flow, the builtin engine is the natural landing spot.

Choosing where the browser runs: providers

Computer Use runs the screen wherever you stand up its container. BrowserBash separates where the browser runs from which model drives it, and you switch the location with a single --provider flag: local (the default, your own Chrome), cdp (any DevTools endpoint), and the cloud grids browserbase, lambdatest, and browserstack.

That separation is handy during migration. You can develop a test locally against your own Chrome for a $0 model bill, then run the exact same objective against a cloud grid for cross-browser coverage without rewriting a thing.

browserbash testmd run ./checkout_test.md --provider lambdatest --record

Same test file, same English steps, different machine running the browser. You are not maintaining two suites — one for "works on my machine" and one for the grid — the way you might if you had a bespoke Computer Use container plus a separate Selenium grid setup.

When to keep Computer Use (and when to switch)

An honest comparison has to mark the cases where the competitor wins. Computer Use is genuinely the better choice for a real set of jobs.

Keep Computer Use when:

Your automation leaves the browser. Native apps, desktop installers, OS dialogs, file pickers that escape the page, anything cross-application — that is computer-use territory and a browser-scoped CLI cannot help.
You are doing exploratory desktop agent work, not testing. If the goal is "do this fuzzy task on my computer," the generality is the feature, not the cost.
You need the agent to operate software that has no web interface at all.

Switch to a browser-scoped CLI when:

Your target is a web app. If the thing under test renders in a browser, the DOM is right there and a pixel agent is overkill.
You want a deterministic CI gate. Exit codes 0/1/2/3 and NDJSON beat parsing model prose every time.
You care about cost. Local-first models on the right hardware mean a $0 model bill for routine smoke tests, with a hosted model held in reserve for the hard flows.
You want committable tests and recorded evidence. Markdown test files, masked secrets, .webm video, and a Playwright trace are things you would otherwise build by hand.

The decision is rarely "rip out Computer Use entirely." More often it is "move the web tests, which are most of the suite, to a browser-scoped tool, and keep Computer Use for the genuine desktop edge cases." That split gives you the right tool for each job instead of one expensive hammer for both.

A note on model quality during the move

If you are nervous about reliability after relying on a frontier model, calibrate your model choice to the flow. Short, well-bounded objectives run fine on small local models. Long, branchy flows with conditional logic deserve a mid-size local model or a hosted one. The honest failure mode is a tiny local model on a long objective; avoid that pairing and the experience holds up. Because BrowserBash auto-resolves to your Anthropic key when one is present, you can keep the same Claude model you used with Computer Use and just enjoy the lower per-step overhead.

A realistic migration checklist

Pulling it together, here is the order I would actually do it in if I were moving a web QA suite off Computer Use:

Inventory your prompts. Separate the browser-only objectives (the majority) from the rare cross-app ones. Only the browser ones migrate.
De-pixelate. Rewrite each browser objective to intent and named targets. Delete the "scroll until you see" and "blue button" hints.
Add explicit assertions. Every test gets a concrete success condition the agent can verify, so the verdict means something.
Convert to *_test.md files. Move recurring flows into committable markdown tests with {{variables}} and !secret masking, using @import for shared login steps.
Pick a model per flow. Local for simple, mid-size local or hosted for hard. Wire your existing Anthropic key for Claude continuity.
Run in --agent mode in CI. Replace prose parsing with exit codes. Green or red, no interpretation.
Turn on --record. Use the builtin engine for the flows where you want a full Playwright trace to debug failures.
Add a grid provider for coverage. Point the same tests at lambdatest, browserstack, or browserbase for cross-browser runs.

You can compare the cost side of this against your current Anthropic spend on the pricing page, and there is a worked case study if you want to see the pattern end to end before committing.

FAQ

Is BrowserBash a drop-in replacement for Anthropic Computer Use?

Not a drop-in for everything, and it does not claim to be. Computer Use drives the whole desktop; BrowserBash is scoped to a browser. For web QA — the most common reason people reach for Computer Use — BrowserBash is a focused, cheaper alternative with built-in verdicts and recordings. For native-app or cross-application automation, keep Computer Use, because a browser-scoped CLI genuinely cannot do that job.

Can I keep using Claude to drive the browser after migrating?

Yes. BrowserBash auto-resolves a local Ollama install first, then your ANTHROPIC_API_KEY, then OpenRouter. If you set your Anthropic key, Claude drives the browser through the builtin Anthropic tool-use engine. You keep the model quality you had with Computer Use while shedding the per-turn screenshot overhead, since the agent acts on the DOM instead of pixels.

Do I need a paid API key or an account to run BrowserBash?

No on both counts. BrowserBash is Ollama-first and defaults to free local models, so you can run it with no API keys and nothing leaving your machine for a $0 model bill. No account is required to run anything; the cloud dashboard with run history and video replay is strictly opt-in via browserbash connect and --upload, and there is a fully local dashboard if you prefer to keep everything offline.

How does BrowserBash fit into a CI pipeline?

Run it with --agent and it emits NDJSON — one JSON event per line — and exits with a deterministic code: 0 passed, 1 failed, 2 error, 3 timeout. Your pipeline reads the exit code instead of parsing a model's prose answer, which is what makes it a reliable gate. Add --record and you also get a .webm video, plus a Playwright trace on the builtin engine, attached as evidence for any failed run.

Migrating off a desktop agent for web testing is mostly a re-scoping exercise, not a rewrite. Your English objectives carry over, and you gain exit codes, masked secrets, and recorded traces in the bargain. Install it with npm install -g browserbash-cli and point it at one of your existing flows. An account is optional, but if you want run history and video replay you can sign up free when you are ready.