An AI browser testing CLI lets you describe what a test should do in plain English, then hands the typing and clicking to an AI agent that drives a real Chrome browser and reports back a pass/fail verdict with structured results. No selectors, no page objects, no waiting code. You write the intent; the agent figures out the steps. This guide walks the whole pipeline end to end — how the objective becomes browser actions, which engine and model interpret it, where the browser actually runs, how results come back as machine-readable NDJSON with exit codes, and how the same command slots into CI and into AI coding agents. Every command here is real and runnable, and the tool doing the work — BrowserBash — is free and open source under Apache-2.0.
If you have ever written a Selenium or Playwright test, you know the unglamorous truth: most of the effort is not describing the behavior, it is wiring the locators, tuning the waits, and patching everything when the frontend shifts a class name. An AI browser testing CLI moves that work to the machine. The point of this article is not to sell you on magic — it is to show you exactly what happens at each stage so you can decide where the approach fits and run it yourself in five minutes.
What "AI browser testing from the command line" actually means
Strip away the buzzwords and the model is simple. There are four moving parts, and understanding them is the whole game:
- An objective — a sentence (or a list of sentences) describing what to do and what to verify, written the way you would brief a colleague.
- An engine — the loop that turns that objective into concrete browser actions: read the page, decide the next action, do it, repeat until the goal is met or a step fails.
- An LLM backend — the model that does the reasoning inside the engine. This can run entirely on your own machine.
- A provider — where the browser physically runs: your local Chrome, a cloud grid, or any remote DevTools endpoint.
The CLI's job is to bolt these together behind one command and then return a result that a human can read and a machine can act on. That dual audience — human-legible and machine-parseable — is the through-line of everything below.
Here is the smallest possible end-to-end example. Install the CLI and run a single objective against a live demo site:
npm install -g browserbash-cli
browserbash run "Open https://www.saucedemo.com, log in as standard_user with password secret_sauce, add the 'Sauce Labs Backpack' to the cart, open the cart, and verify the backpack is listed" \
--headless
That command runs as printed — the demo credentials are published on the login page itself. The verify clause is the assertion. If the backpack is not in the cart when the agent looks, the run fails with a non-zero exit code. There is no locator to write, no data-testid to chase, and no explicit wait to tune. The agent re-reads the page at each step and acts the way a person scanning the screen would.
Stage one: the objective becomes a plan
When you pass an objective, the engine does not blindly execute a fixed script. It enters a loop. On each turn it captures the current state of the page — an accessibility-oriented snapshot of what is visible and interactive — feeds that to the model alongside the goal, and asks for the next action. The model replies with something concrete: navigate to a URL, click a referenced element, type into a field, wait for a condition, extract a value, or declare the objective done.
This is why the approach tolerates UI churn. A traditional script says "click the element matching #add-to-cart-sauce-labs-backpack." If that id changes, the script breaks even though the product works perfectly. The agent instead reasons "find the add-to-cart control for the Sauce Labs Backpack and click it," resolving the target against whatever the page looks like right now. A renamed class or a restructured DOM is usually a non-event.
The honest tradeoff lives here too. Because the model plans at run time, two runs can take slightly different paths to the same destination. The approach is goal-deterministic, not path-deterministic. You constrain it with explicit verify steps that act as hard assertions, a step cap, and a timeout — but if you need bit-identical execution traces for a compliance audit, a code-first framework is the better fit. More on that in the comparison section.
Stage two: the engine and the model behind it
A capable AI browser testing CLI separates "the loop that drives the browser" from "the model that thinks," so you can swap either independently. BrowserBash ships two engines:
- stagehand (the default) is the MIT-licensed, open-source engine from Browserbase. It is built around resilient, self-healing automation with act/extract/observe primitives, and it adapts well to pages that change between runs.
- builtin is an in-repo Anthropic tool-use loop driving Playwright underneath. It is used automatically for cloud grids the default engine cannot attach to, and it has a bonus: when you record, it captures a full Playwright trace in addition to the screenshot and video.
You usually do not think about engines at all — the default handles local and DevTools-endpoint runs, and the CLI auto-switches to the builtin engine when a grid requires it. You can force one explicitly with --engine builtin if you want the trace artifact.
The model is the more interesting choice, because it determines both capability and cost. BrowserBash is Ollama-first: it auto-detects a local Ollama install before anything else, which means free, local inference with no API keys and nothing leaving your machine. The resolution order is Ollama, then Anthropic, then OpenRouter, so the default experience costs nothing. The fully free stack is two lines:
ollama pull qwen3 # any tool-capable local model works
browserbash run "Open https://example.com and store the page heading as 'h1'"
If you want more horsepower, you have options without changing how you write tests. OpenRouter exposes hundreds of models behind one key — including genuinely free ones such as openai/gpt-oss-120b:free — and you can bring your own Anthropic key when you want a frontier model. One practical note that will save you debugging time: very small local models (roughly 8B parameters and under) get unreliable on long multi-step objectives. A model in the Qwen3 or Llama 3.3 70B class behaves far better for real flows. This is a genuine operational consideration, not a footnote.
Stage three: where the browser runs
The same objective can drive a browser in several places, and switching between them is a single flag — the test prose never changes. This is one of the quiet superpowers of a CLI-shaped tool: the interface stays identical whether the browser is on your laptop or in a data center.
| Provider | Where the browser runs | Switch with |
|---|---|---|
local (default) |
Chrome/Chromium on your machine | nothing — it is the default |
cdp |
Any Chrome DevTools Protocol endpoint (your grid, a Docker container, a Playwright MCP-managed browser) | --cdp-endpoint ws://... |
browserbase |
Browserbase cloud browsers | --provider browserbase |
lambdatest |
LambdaTest cloud grid | --provider lambdatest |
browserstack |
BrowserStack Automate grid | --provider browserstack |
So a test you authored against your local Chrome runs unchanged on a cloud grid:
browserbash run "Open https://www.saucedemo.com, log in as standard_user with password secret_sauce, and verify the inventory page is shown" \
--provider lambdatest --headless
Cloud providers need their own credentials (set once via environment variables or browserbash login), and the cloud-grid providers use the builtin engine, which speaks the Anthropic API — so pair them with an Anthropic key or an Anthropic-compatible gateway. Locally, none of that applies: it is your Chrome and your local model, free.
Stage four: results a machine can act on
This is where a CLI built for testing pulls ahead of a chat-style demo. A run produces two outputs at once.
For a human, there is a readable verdict: what the agent did, what it found, and whether the objective held. For a machine, there is agent mode. Add --agent and stdout becomes NDJSON — one JSON object per line, with a stable schema — while all human-readable text moves to stderr. Step events stream as the run executes:
{"type":"step","step":3,"status":"passed","action":"click","remark":"Clicked ref:12"}
The final line is always a single terminal event carrying the verdict and any structured data the objective asked it to capture:
{"type":"run_end","status":"passed","summary":"Login flow verified","final_state":{"order_id":"12345"},"duration_ms":48211,"steps_executed":9,"provider":"local"}
Anything you phrase as "store ... as 'name'" lands in final_state, so you can pull it downstream without scraping prose. And because the terminal event is always the last line, the verdict is one tail -1 | jq away:
out=$(browserbash run "Open https://example.com and store the page title as 'title'" --agent --headless)
title=$(echo "$out" | tail -1 | jq -r '.final_state.title')
Underpinning all of it are exit codes as the contract: 0 passed, 1 failed, 2 error, 3 timeout. This is the part that makes the tool trustworthy in automation. A machine never has to infer success from a sentence. 1 is a real product or assertion failure that a human should investigate. 2 and 3 are environment signals — a grid hiccup, a dead endpoint, a run that outlived its budget — where one automatic retry is reasonable before failing the build. Pipelines that collapse those two categories train teams to rerun real failures until they accidentally pass.
Committable tests in markdown
One-liners are perfect for quick checks, but real suites need to live in the repository, get reviewed in pull requests, and be readable by people who do not write code. For that, BrowserBash uses markdown test files where each list item is a step:
# Checkout smoke test
- Open https://www.saucedemo.com
- Log in as {{user}} with password {{password}}
- Add the "Sauce Labs Backpack" to the cart
- Open the cart and proceed to checkout
- Fill first name "Ada", last name "Lovelace", zip "94016"
- Continue and finish the order
- Verify the page shows "Thank you for your order!"
- Store the order confirmation text as 'confirmation'
Run it, and a Result.md report lands next to the file:
browserbash testmd run ./checkout_test.md --headless \
--variables '{"user":"standard_user","password":{"value":"secret_sauce","secret":true}}'
Three features make this production-grade. The @import ./helpers/login.md directive composes shared steps, so a reusable login block lives in one place. The {{variables}} keep environments and credentials out of the committed file. And anything marked {"value": "...", "secret": true} is masked as ***** everywhere it would otherwise print — including the NDJSON — which matters because agent transcripts get logged verbatim. A non-engineer can read the diff of a *_test.md file in review and actually understand what changed, which is something a page-object-heavy test suite cannot claim.
Recordings: a replay for the 2 a.m. failure
When a smoke test fails on a server you cannot see, you want a replay. Add --record to any run and BrowserBash captures a screenshot and a stitched .webm session video on either engine (video uses ffmpeg, which is bundled). The builtin engine additionally captures a Playwright trace, giving you the same time-travel debugging artifact Playwright users already know.
browserbash run "Open https://www.saucedemo.com and verify the login form is visible" \
--record --engine builtin
Every run is also kept in a private on-disk store with secrets masked. You have two ways to browse that history, and both can be entirely local and private:
browserbash dashboard # free, no account, serves a local dashboard you own
That opens a local view listing your runs, each with its verdict, extracted values, and recording. Nothing leaves your machine. If you want history across machines and shareable per-run pages, there is an optional cloud dashboard — opt-in per run:
browserbash connect --key bb_... # one-time, after creating a free account
browserbash run "..." --record --upload # push THIS run to the cloud
Without --upload, nothing is sent anywhere. This is the privacy default worth underlining: an AI browser testing CLI that runs the model and the browser locally means your pages, your credentials, and your run data stay on your hardware unless you explicitly choose otherwise. On the free tier, cloud runs are kept for 15 days.
Wiring it into CI
Because the interface is exit codes plus NDJSON — not prose you have to grep — the CLI drops into CI cleanly. Here is a complete GitHub Actions job:
name: smoke
on: [push]
jobs:
browser-smoke:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm install -g browserbash-cli
- run: browserbash testmd run ./tests/checkout_test.md --agent --headless --timeout 180 > smoke.ndjson
env:
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
- uses: actions/upload-artifact@v4
if: always()
with:
name: smoke-ndjson
path: smoke.ndjson
There is no "parse results" step. The run fails exactly when the test fails, because the exit code is the verdict. Note the redirect: with --agent, NDJSON goes to stdout and human-readable logs go to stderr, so smoke.ndjson stays clean while the Actions log stays readable. A common refinement is to auto-retry only the environment-flavored exits:
browserbash testmd run ./tests/checkout_test.md --agent --headless --timeout 180 > smoke.ndjson
code=$?
if [ "$code" -eq 2 ] || [ "$code" -eq 3 ]; then
echo "infra-flavored exit ($code) - retrying once" >&2
browserbash testmd run ./tests/checkout_test.md --agent --headless --timeout 180 > smoke.ndjson
code=$?
fi
exit $code
The same NDJSON-and-exit-code contract is exactly what an AI coding agent needs to verify its own work. An agent that just wrote a frontend fix can call browserbash run "..." --agent, read the verdict from the exit code, and attach the run_end line to its pull request — no prose parsing, no brittle log scraping. That is the difference between a tool built for chat and one built for automation.
How the CLI approach compares to code-first frameworks
An AI browser testing CLI is not a drop-in replacement for everything. The honest framing is that plain-English AI tests and code-first frameworks like Playwright, Cypress, or Selenium occupy different bands of work, and most teams benefit from running both. Here is a fair, high-level comparison using only well-known facts about the established tools.
| Dimension | AI browser testing CLI (BrowserBash) | Code-first framework (Playwright / Cypress / Selenium) |
|---|---|---|
| Test authoring | Plain-English objective or markdown steps | Code with explicit locators and assertions |
| Element targeting | Agent resolves elements at run time, no selectors | You write and maintain selectors / page objects |
| Execution model | LLM plans steps per run; goal-deterministic | Same instructions every run; path-deterministic |
| Speed per action | Seconds (includes model inference) | Milliseconds (direct protocol commands) |
| Maintenance on UI change | Often none — the agent adapts | Update locators when the DOM shifts |
| Flakiness model | Self-healing; re-reads the page each run | Auto-waiting (framework-dependent) reduces timing flakes |
| Debugging artifacts | Screenshot + .webm video; trace on builtin engine |
Mature trace/video tooling (framework-dependent) |
| CI contract | Exit codes 0/1/2/3 + NDJSON events | Test-runner exit status + reporters |
| LLM / model cost | Free with local Ollama; paid models optional | None |
| Readable by non-engineers | Yes | No |
| Best fit | New coverage, churny UIs, smoke and journey tests | Large stable regression suites, sub-second budgets |
A couple of cells deserve a footnote. "Goal-deterministic" is the key honest caveat on the AI side: the agent reaches the same verdict but may take a slightly different path each run. And the speed gap is decisive at scale — a twelve-test smoke suite will not notice per-step inference, but an 800-test regression wall absolutely will. Keep the big deterministic suite where it is; that is what it is for.
When to choose which
Reach for a code-first framework when you have a large, stable regression suite, when per-test budgets are sub-second, when you need pixel-precise interactions or low-level network interception, when a fully deterministic, network-free execution trace is mandatory for compliance, and when your test authors are engineers who live in the codebase.
Reach for an AI browser testing CLI when you need new coverage today and cannot afford to write locators first, when the UI churns weekly and selector maintenance is eating your time, for smoke tests and happy-path journeys and post-deploy sanity checks, when you want a test a product manager can read and approve in review, and when you want to run everything locally and for free before any model bill exists.
The realistic pattern is coexistence. Keep the deterministic regression wall in your existing framework, and move the handful of flows that break most often for selector reasons (not product reasons) into plain-English tests. Both run in the same pipeline; both gate merges through the same exit-code contract, so your CI configuration does not care which tool produced the verdict. There is a deeper, tool-by-tool breakdown of this on the BrowserBash blog, and the docs cover engines, providers, and the markdown test format in full.
A five-minute starter path
If you want to try the whole pipeline yourself, here is the shortest route from zero to a passing test, entirely free and local:
- Install Ollama and pull a capable model:
ollama pull qwen3(or a 70B-class model if your hardware allows). - Install the CLI:
npm install -g browserbash-cli. - Run a one-liner against the demo site (the
browserbash runcommand from the top of this guide) and watch the agent log its steps. - Add
--recordand thenbrowserbash dashboardto replay the run in a local, private dashboard. - Move the flow into a
checkout_test.mdfile, parameterize secrets with{{variables}}, and run it withbrowserbash testmd run. - Drop the markdown test into a CI job with
--agent --headlessand let the exit code gate your merges.
By step six you have a committable, reviewable, plain-English test that runs locally for free, records its own replay, and reports a machine-readable verdict to CI — without a single selector or wait in sight.
FAQ
Do I need API keys or a paid model to use an AI browser testing CLI?
No. BrowserBash is Ollama-first, so the default path runs a model locally with no API keys and no per-run cost. The tool itself is free and open source under Apache-2.0. You can optionally add OpenRouter (which includes free models) or your own Anthropic key when you want more capability, but nothing about the core workflow requires payment, and nothing is uploaded anywhere unless you explicitly pass --upload.
How does the agent click elements without selectors?
It re-reads the page at each step. The engine captures an accessibility-oriented snapshot of what is currently visible and interactive, gives that to the model along with your objective, and the model decides the next concrete action — click this control, type into that field, verify this text. Because the target is resolved against the live page every run, a renamed class or restructured DOM that would break a hardcoded selector is usually a non-event.
Is the result reliable enough for CI gates?
Yes, when you use it for the right band of work. The verdict arrives as a process exit code — 0 passed, 1 failed, 2 error, 3 timeout — so CI never infers success from prose. You bound runs with explicit verify steps, a step cap, and a --timeout, and you pair them with a capable model (Qwen3 or Llama 3.3 70B class) for long flows. For smoke tests, journeys, and fast-moving coverage this is solid; for an 800-test deterministic regression wall, keep a code-first framework.
Can my AI coding agent call it directly?
That is a primary use case. Run with --agent and stdout becomes NDJSON with a stable schema while the exit code carries the verdict, so an AI coding agent can invoke browserbash run "..." like a function, read pass/fail from the exit code, and attach the terminal run_end line to a pull request. There is no prose to parse and no log format that a tooling upgrade might silently change.
Ready to try AI browser testing from your terminal? It is free and open source — install with npm install -g browserbash-cli, run your first test locally against Ollama in minutes, and create a free account when you want cloud run history and shareable per-run replays.