agents· 12 min read· by Pramod Dutta

How Natural-Language Assertions Work, and When They Fail

Natural language assertions turn plain-English checks into a pass or fail. How AI test assertions are judged, where they shine, and where they quietly fail.

Short answer: a natural-language assertion like "verify the cart shows one item" becomes a pass or fail in three steps. The agent observes the current page state (the accessibility tree, the DOM, and optionally a screenshot), the model judges whether the asserted condition actually holds against what it just observed, and that verdict drives the run result and the process exit code. Pass exits 0, fail exits 1, so the same plain-English check that reads naturally to a human also gates your CI pipeline.

That is the whole mechanism, and the same step that makes these assertions flexible (a model judging a fuzzy condition) is also where they fail in ways a scripted expect(count).toBe(1) never could. This article covers how the judgment happens, the checks where it beats hand-written selectors, and the honest failure modes, including the one that should worry you most: the hallucinated pass. Examples use BrowserBash, a free open-source Apache-2.0 CLI from The Testing Academy that drives a real Chrome browser from plain-English objectives.

How a plain-English assertion becomes a verdict

Three things happen between you typing an assertion and the run turning green or red. None of them is magic, and seeing the seams is what lets you write assertions that hold up.

Step one: the agent observes the page

Before it can judge anything, the agent needs a representation of what is on screen right now. It does not guess from your prompt; it reads the live page. The primary signal is the accessibility tree, the same structured view a screen reader consumes, which exposes roles, names, and states (a button named "Checkout", a heading reading "Thank you", a list with three items). Alongside it the agent has the DOM, and depending on the engine and configuration it can also take a screenshot so the model has the visual layout, not just the semantic structure.

The two built-in engines gather this differently. The default Stagehand engine observes the live DOM on every step, re-reading the page each time it acts so its picture of state is current rather than cached. The builtin engine re-derives selectors each step and emits Playwright traces, a deterministic, replayable record of exactly what it interacted with. Either way, the assertion is judged against a fresh observation of the actual page, not your expectation of it.

Step two: the model judges the condition

Now the model has two inputs: your asserted condition ("the cart shows one item") and its observation of the page. It decides whether the condition holds. This is a judgment, not a string comparison. When you assert "the error message explains what went wrong," no selector encodes "explains what went wrong"; the model reads the visible error text and decides whether it is genuinely explanatory or just a generic "Something went wrong." That semantic judgment is the entire value proposition, and the entire risk.

The quality of this step scales with the model. A larger model reads ambiguous pages more reliably and is less prone to talking itself into a verdict the page does not support; a small local model handles a crisp assertion ("the heading says Thank you") fine and gets shakier on a fuzzy one. Model choice is a primary lever on reliability, and we return to it below.

Step three: the verdict drives the run result and the exit code

The model's verdict is not advisory. It becomes the run result, and the run result becomes the process exit code, which is what makes any of this usable in CI. A passing assertion exits 0, a failing one exits 1, and higher codes cover other conditions (a usage error, an infrastructure failure), so your pipeline can tell "the assertion was false" apart from "the run never got off the ground." When you wrap a browserbash run "<objective>" in a CI step, that exit code is the gate, exactly as pytest's exit code gates a Python suite.

For machine consumption, --agent emits NDJSON: one JSON object per line describing what the agent did and concluded, which you parse instead of scraping logs. The same run produces a Result.md, a human summary you read when something looks off. Both turn a verdict you have to trust into one you can audit.

Where natural-language assertions genuinely win

These assertions earn their place on checks that scripted assertions handle badly or cannot express at all.

Semantic checks a selector cannot encode

Some conditions are about meaning, not structure, and are awkward or impossible to write as a deterministic selector assertion:

# Did the error actually help the user, or just fail?
browserbash run "submit the form with an invalid email and verify the error message explains what went wrong"

# Does the end state read as success to a human?
browserbash run "complete checkout and verify the page looks like a successful order confirmation"

"Explains what went wrong" and "looks like a successful confirmation" are human judgments. A selector test can assert that an element with class .error exists and contains some text, but it cannot tell you whether that text is useful. The model can, because judging meaning is what it does. Here, natural-language assertions are a capability you did not previously have in an automated test.

Tolerance of wording and layout change

A scripted assertion pinned to getByText("Order confirmed") breaks the day a copywriter changes it to "Your order is confirmed." A natural-language assertion that the page "confirms the order succeeded" survives that edit, because it is judged on meaning rather than a literal string, and the same tolerance covers layout: move the banner, wrap it in a new container, restyle it, and the semantic condition still holds. You assert the user-visible truth, not the markup that currently expresses it, which is the intent-over-implementation discipline in testing user intent, not clicks.

No selector to maintain

Every CSS or XPath selector in a traditional suite is a liability that snaps on class renames, wrapper divs, and reordered DOM. A plain-English assertion has no selector to rot; the agent re-derives what it needs each run, so the assertion's text rarely changes even as the application underneath it churns. The selectors you never wrote are the maintenance you never pay.

When natural-language assertions fail

These assertions fail in specific, predictable ways, and a senior SDET deserves them stated plainly. None is a reason to avoid natural-language assertions; all are reasons to write and audit them deliberately.

Vague assertions produce unreliable verdicts

The model judges exactly what you ask. Ask something uncheckable and you get an unreliable verdict, not an error. "Verify it works" has no observable referent; the model has to invent what "works" means, and two runs can invent differently. The same goes for "make sure the page is fine." A vague assertion is the most common cause of a flaky natural-language check, and the fix is in your hands: name a concrete, observable fact.

# Vague: no observable referent, verdict is a guess
browserbash run "add a product and verify the cart works"

# Specific: one fact the model can read off the page
browserbash run "add one product and verify the cart count shows 1"

The specific version is checkable because "the cart count shows 1" is a thing the model can locate and confirm. The vague version asks the model to decide what success even means, which is your job, not its.

The hallucinated pass

This is the one that matters most, so it gets no euphemism: the model can report success when the condition was not actually met. The assertion goes green, the exit code is 0, your pipeline proceeds, and the thing you asserted was never true. This is the central trust risk of natural-language assertions, and pretending it does not exist is how teams get burned.

It happens because the same generative judgment that lets the model read meaning also lets it confabulate. Given a leading assertion and an ambiguous page, a model can talk itself into a verdict the page does not support, especially a smaller model on a longer or fuzzier flow. A false fail is annoying but loud; you investigate it. A false pass is silent, and silent is dangerous, because it is indistinguishable from a real pass until a customer finds the bug you thought you tested. You manage this risk; you do not eliminate it. Treat any green from a natural-language assertion on a critical path as a claim to be verified, not a fact to be trusted, until you have audited it.

Exact values and numeric precision

Natural-language judgment is the wrong tool for exactness. "Verify the total looks right" invites the model to approximate, and approximation on a number is a bug waiting to pass. If the correct total is $49.99, assert that exact value. Money, counts, quantities, decimal precision, anything where being off by a cent or an item is a defect, should be stated as the concrete value you expect.

# Fuzzy on a number: invites a wrong-but-passing verdict
browserbash run "check out and verify the total looks correct"

# Exact: the verdict is true only if the number is exactly right
browserbash run "check out and verify the order total equals $49.99"

The exact version is still a natural-language assertion, but it pins the judgment to a specific value the model must match, removing the room to approximate. For truly critical numeric paths, even this is best backed by a deterministic check, covered below.

Mitigations that make these assertions trustworthy

The failure modes are manageable with a handful of habits that move natural-language assertions from "neat but risky" to "trustworthy on the paths where you have done the work."

Write specific, observable assertions

The cheapest and highest-leverage habit. Assert concrete facts the model can read directly off the page: "the heading says Thank you," "the cart count is 1," "the success banner contains the order number." Each has exactly one truth condition, which leaves the model far less room to hallucinate and makes the verdict reproducible. Specificity is your first defense against both vagueness and the false pass.

Audit with --record and the Result.md

Trust comes from being able to check the work, and the run gives you two ways. Pass --record and it captures a webm video plus screenshots of what actually happened on screen, so you can watch the run and see for yourself whether the asserted condition held. The Result.md is the human-readable summary of the run and its verdict. When an assertion passes on a path you care about, do not just believe the green; open the recording and the summary and confirm the page really showed what the model said it showed. This is the practical defense against the hallucinated pass, walked through in the recording video and traces tutorial. The builtin engine's Playwright traces add a third, replayable view.

Use a more capable model on high-stakes checks

Assertion reliability scales with model capability, so spend it where the cost of a wrong verdict is highest. A mid-size local model is a fine default for crisp assertions on short flows. For a high-stakes check (a payment confirmation, a permissions boundary, anything where a false pass ships a real bug), escalate to a more capable model, which holds ambiguous judgments together more reliably and is less prone to confabulating a pass.

# Escalate the model on a high-stakes assertion
browserbash run "complete the paid upgrade and verify the account shows the Pro plan is active" --model claude-opus-4-8

How much this actually buys you is something you should measure on your own application rather than take on faith; the method for computing your own run-to-run reliability is in agentic test determinism benchmarks, and this post deliberately cites no invented benchmark numbers for it.

Cross-check critical assertions with a deterministic API check

The strongest mitigation is to not rely on the natural-language verdict alone for what truly matters. Pair the plain-English assertion with a deterministic check against an API or database: after the agent reports the order confirmed, query the orders endpoint and assert the row exists with the exact total. The natural-language assertion verifies the user-visible experience; the deterministic check verifies the ground truth. When both agree you have real confidence; when they disagree you have caught either a UI lie or a hallucinated pass. This hybrid pattern, fuzzy where fuzzy is the point and exact where exact is non-negotiable, is the same approach to taming flakiness in reducing flaky end-to-end tests.

Honest limits

The hallucinated pass is real, not theoretical. A model judging a fuzzy condition can return a passing verdict when the condition was false, and because a false pass is silent it is the most consequential way these assertions fail. Every mitigation above exists to manage that risk, and none reduces it to zero.

Smaller models are worse at it. The gap between a crisp assertion and a confident-but-wrong verdict widens as the model shrinks and the flow lengthens, so match model capability to the stakes of the assertion rather than running one model everywhere.

And natural-language assertions complement exact programmatic checks; they do not replace them on critical numeric paths. For money, counts, and precise quantities, a deterministic assertion against a known value is more trustworthy than any judgment. The right architecture uses both, and used that way, with specific assertions, audited artifacts, the right model, and a deterministic backstop on what matters, they are a real addition to a senior SDET's toolkit rather than a leap of faith. Start at /features or the /learn guides, and install with npm install -g browserbash-cli.

FAQ

How does a natural-language assertion decide pass versus fail?

The agent observes the current page (accessibility tree, DOM, optionally a screenshot), the model judges whether your asserted condition holds against that observation, and the verdict becomes the run result and the exit code: 0 for pass, 1 for fail, higher codes for usage and infrastructure errors. The judgment is made against a fresh reading of the live page, not your prompt's expectation of it.

What is a hallucinated pass and how do I catch it?

A hallucinated pass is when the model reports success even though the asserted condition was not met, so the run goes green on something false. Catch it by auditing rather than trusting the green: run with --record to capture a webm and screenshots, read the Result.md summary, and for anything critical cross-check the verdict against a deterministic API or database query. A more capable model also makes hallucinated passes measurably less frequent.

Should I use natural-language assertions for exact totals or counts?

Use them, but pin the value: assert "the order total equals $49.99," not "the total looks correct," so the model has no room to approximate. For genuinely critical numbers, back the plain-English assertion with a deterministic API check so a wrong-but-passing verdict cannot slip through.

Do natural-language assertions replace my existing programmatic checks?

No, they complement them. Natural-language assertions handle the semantic, wording-tolerant, layout-tolerant, selector-free checks that scripted assertions struggle to express; exact programmatic checks remain better for numeric precision and critical ground truth. The strongest suites use both.

Try it on your own appnpm install -g browserbash-cli
Start learning