AI Accessibility Testing: Automating WCAG Checks With Agents

Name: BrowserBash
Author: The Testing Academy

AI accessibility testing sits in an awkward gap that most teams discover the hard way. You run axe-core in CI, the report comes back green, you ship, and three weeks later a keyboard-only user files a ticket because they cannot reach the "Apply coupon" button without a mouse. The scanner did its job perfectly and still missed the bug. That gap — between what a rule-based scanner can prove and what a human actually experiences — is exactly where AI agents are starting to earn their keep. This guide walks through how AI accessibility testing differs from traditional axe scanning, where each approach wins, and how you can drive a real Chrome browser to verify keyboard flows and screen-reader-relevant behavior described in plain English.

I have spent enough time staring at accessibility reports to be both excited and skeptical about the AI angle. So this is not a pitch. It is an honest map of what rule engines can and cannot do, what an LLM-driven agent adds on top, and where you still need a real assistive-technology user in the loop. By the end you will know which checks to keep in your existing axe pipeline, which to hand to an agent, and how to wire the agent half with BrowserBash without rewriting your whole test suite.

Why rule-based scanners only see part of the page

Tools like axe-core, Pa11y, Lighthouse, and the accessibility audits baked into WAVE are static analyzers at heart. They parse the rendered DOM and the accessibility tree, then apply a fixed set of deterministic rules. Is there an alt attribute on that image? Does this form control have an associated label? Is the contrast ratio between this text and its background at least 4.5:1? Does this button have an accessible name? These are real WCAG success criteria, and a scanner checks them quickly, consistently, and at near-zero cost.

That determinism is the whole point. A rule engine never gets bored, never has an off day, and gives you the same answer on every run. If you are not already running axe-core in CI, stop reading and go add it. It is the single highest-leverage accessibility check you can automate, and nothing in this article replaces it.

But here is the uncomfortable arithmetic that the accessibility community has repeated for years: automated rule scanners catch somewhere in the neighborhood of a third to a half of WCAG issues. The exact fraction is debated and depends heavily on the page, so treat any precise percentage with suspicion. The direction, though, is not in dispute. A large share of accessibility failures are simply not expressible as a DOM rule.

Consider a few examples a scanner structurally cannot judge:

Whether alt text is meaningful. axe can confirm an image has an alt attribute. It cannot tell you that alt="image123.jpg" is useless, or that a decorative divider was given a verbose description that clutters the screen-reader experience.
Whether the reading order makes sense. The DOM order might be logical while the visual order, rearranged by CSS grid or flexbox, tells a different story to a sighted user than to someone using a screen reader.
Whether a custom widget actually behaves like the role it claims. A div with role="button" passes the "has a role" check. Whether it responds to Enter and Space, and whether focus lands somewhere sensible after activation, is behavior, not markup.
Whether an error message is announced. A form can have a perfectly labeled error region that never gets aria-live wired up, so the message appears visually but is silent to assistive tech.

None of these are scanner failures. They are category limits. To catch them you need something that can reason about intent and observe behavior, not just inspect attributes.

How AI agents reason about accessibility differently

An AI accessibility testing agent attacks the problem from the other end. Instead of applying a fixed rule to static markup, it takes a plain-English objective, perceives the page the way a tool can — through the accessibility tree, the DOM, and often a screenshot — and then reasons step by step about whether the experience matches what you described.

The shift is from rule matching to goal reasoning. You do not tell the agent "assert that element #coupon-btn has tabindex >= 0." You tell it something a human would say:

"Using only the keyboard, navigate to the coupon field, enter SAVE10, and apply it. Confirm you can reach and activate every control in the checkout form by tabbing, and that focus never gets trapped or lost."

The agent then drives the browser, presses Tab, reads where focus actually landed, and decides whether the objective was met. That is a fundamentally different capability. It can judge the meaning of an alt string, notice that focus jumped to the top of the page after a modal closed, or flag that a "Skip to content" link exists in the DOM but never becomes visible on focus.

What "reasoning" buys you in practice

The practical payoff shows up in three places:

Semantic judgment. An LLM can read alt="Submit" on a decorative arrow and tell you it is wrong in context — something no contrast ratio or attribute-presence rule will ever surface.
Behavioral verification. Keyboard traps, focus order, focus restoration after a dialog closes, and visible focus indicators are all behaviors over time. An agent that actually presses keys and observes the result can verify them. A static scan of one DOM snapshot cannot.
Plain-English coverage. You can describe an accessibility expectation in a sentence and get a verdict, which means non-specialists on the team can author checks. That lowers the barrier that has historically kept accessibility work siloed with one or two experts.

The honest limits of the AI half

I promised honesty, so here it is. An AI agent is not a screen-reader user, and it is not a substitute for one. It does not hear NVDA or VoiceOver speak. It reasons about what a screen reader would likely announce based on the accessibility tree and ARIA attributes, which is useful and catches a lot, but it is an inference, not a recording of lived experience. Anyone who tells you an agent fully replaces assistive-technology user testing is overselling.

Agents are also probabilistic. Run the same objective twice and a model may take slightly different paths, and a weak model may misjudge an edge case. That non-determinism is the opposite of what makes axe trustworthy. The right mental model is not "agent replaces scanner" but "scanner gives you a deterministic floor, agent extends coverage into the behavioral and semantic space, human AT users confirm the lived experience." Three layers, not one.

AI accessibility testing versus axe scanning, side by side

Here is the comparison I wish someone had handed me when I started mixing these approaches. Neither column is "better" — they cover different ground.

Dimension	Rule-based scanner (axe, Pa11y, Lighthouse)	AI agent (LLM driving a browser)
Core method	Deterministic rules over DOM and a11y tree	Goal reasoning over a11y tree, DOM, screenshot
Determinism	Same result every run	Probabilistic; can vary by run and model
Speed	Milliseconds to seconds	Seconds to minutes per flow
Cost	Effectively free	Free on local models; paid if you use a hosted model
Catches missing `alt` / labels	Yes, reliably	Yes
Judges whether `alt` is meaningful	No	Often
Contrast ratios	Yes, precisely	Unreliable; leave this to the scanner
Keyboard reachability and focus order	No	Yes, by actually tabbing
Focus restoration after a modal	No	Yes
Screen-reader announcement (inferred)	Partial (ARIA presence)	Better (reasons about likely output)
Real screen-reader experience	No	No — needs a human AT user
CI friendliness	Excellent	Good with structured output
Flakiness risk	Near zero	Real; depends on model size

Read that table as a division of labor. Contrast checking and attribute presence belong to the scanner forever — it is faster and exact. Keyboard journeys, focus behavior, and semantic judgment are where the agent pulls its weight. And the bottom row is the reminder that neither tool closes the loop on the actual human experience.

Where BrowserBash fits the agent half

BrowserBash is a free, open-source, natural-language browser automation CLI from The Testing Academy. You write a plain-English objective, an AI agent drives a real Chrome or Chromium browser step by step — no selectors, no page objects — and you get back a verdict plus structured results. For accessibility work, the "real browser" part matters: focus events, :focus-visible styles, and keyboard handlers only behave correctly in an actual rendering engine, not in a parsed-DOM abstraction.

Two design choices make it a sensible fit for the agent layer specifically.

First, it is Ollama-first. By default it uses free local models, so no API keys are needed and nothing leaves your machine — which is exactly the privacy posture you want when you are pointing an agent at internal staging environments or pre-release pages. It auto-resolves a local Ollama install first, then falls back to ANTHROPIC_API_KEY, then OPENROUTER_API_KEY. So you can run a genuinely $0 accessibility pass on local models, and reach for a more capable hosted model only when a flow is hard.

Second, it speaks plain English natively, which is the whole interface an accessibility objective wants. You are not translating "focus must be restored to the trigger button after the dialog closes" into a brittle selector assertion. You write that sentence and the agent verifies it.

A blunt caveat, because it matters here: very small local models — roughly 8B parameters and under — can be flaky on long multi-step objectives, and accessibility flows are often long (tab through twenty controls, open a modal, close it, check focus). The sweet spot is a mid-size local model in the Qwen3 or Llama 3.3 70B class, or a capable hosted model for the hardest flows. If your keyboard journey keeps drifting off course, that is usually the model size talking, not a tooling bug.

A first keyboard-navigation check

Here is a minimal run. The objective is written the way you would describe the requirement to a teammate.

npm install -g browserbash-cli

browserbash run "Go to https://shop.example.com/checkout. Using only the Tab key, \
move through every interactive control in the checkout form in order. Confirm focus \
is always visible, never gets trapped, and reaches the 'Place order' button. Report \
any control you cannot reach by keyboard."

The agent tabs through the page, observes where focus actually lands, and returns a verdict. If the coupon button can only be reached with a mouse, that shows up in the result — the exact failure axe stayed silent on.

Building repeatable, committable accessibility checks

One-off runs are great for exploration, but accessibility coverage only sticks if it lives in your repo and runs on every change. BrowserBash supports committable Markdown tests: *_test.md files where each list item is a step. They support @import composition so you can share a login flow across suites, and {{variables}} templating with secret-marked values masked as ***** in every log line. After each run it writes a human-readable Result.md.

Here is what a keyboard-and-focus suite might look like as a committed file:

browserbash testmd run ./a11y_checkout_test.md

And the a11y_checkout_test.md itself reads like a spec a non-engineer could review:

Open {{baseUrl}}/checkout
Log in as {{user}} with password {{password}}
Using only the keyboard, tab through the shipping form and confirm every field is reachable and has a visible focus ring
Open the "Edit payment" modal and confirm focus moves into the modal
Press Escape and confirm the modal closes and focus returns to the "Edit payment" button
Confirm the order summary region announces the updated total (check for an aria-live region)
Verify the final "Place order" button has an accessible name that matches its visible text

Because password is marked as a secret, it never appears in the logs or in Result.md. The file is reviewable in a pull request, diffs cleanly when requirements change, and reads like acceptance criteria rather than code. That readability is the quiet superpower of plain-English accessibility tests — your product manager can sanity-check the coverage, which almost never happens with a wall of Playwright assertions.

Composing flows with imports

If your login is already captured in a login_test.md, you reference it instead of duplicating it:

@import ./login_test.md
Continue with the checkout accessibility steps above

This keeps the accessibility-specific steps front and center while the boilerplate stays in one place. When the login form changes, you fix it once.

Wiring AI accessibility checks into CI

A check that only runs on your laptop protects nobody. The point of automation is the gate. BrowserBash has an agent mode built for exactly this: --agent emits NDJSON — one JSON event per line on stdout — so a CI job or an AI coding agent can parse the run without scraping prose. The exit codes are unambiguous: 0 passed, 1 failed, 2 error, 3 timeout.

browserbash run "Verify keyboard-only users can complete the full checkout on \
https://staging.example.com, with visible focus at every step and focus restored \
after each modal closes." --agent --headless --record

A few things worth calling out for an accessibility gate:

--agent gives you machine-readable events, so your pipeline can branch on a clean exit code instead of grepping logs. That is the difference between a real gate and a flaky one.
--headless runs without a visible window, which is what you want on a CI runner. One honest note: a small number of accessibility behaviors, like certain :focus-visible heuristics, can differ between headed and headless Chrome, so if a focus check behaves oddly in CI, reproduce it headed locally before assuming the agent is wrong.
--record captures a screenshot and a full .webm session video via ffmpeg, on any engine. For accessibility failures this is gold — when the agent says "focus was lost after closing the modal," you have video proof to hand a developer instead of a vague assertion.

The builtin engine additionally captures a Playwright trace you can open in the trace viewer, which lets you step through exactly what the agent saw and did. For an accessibility regression that only reproduces intermittently, that trace is often the fastest path to a root cause.

Choosing where the browser runs

BrowserBash lets you switch the execution location with one --provider flag. The default is local, your own Chrome, which is the right starting point. For accessibility specifically, running against real cross-browser engines matters because focus and ARIA handling have genuine browser-to-browser quirks. You can point at any DevTools endpoint with cdp, or run on a cloud grid:

browserbash run "Tab through the registration form and confirm every field is \
keyboard reachable with a visible focus indicator." --provider lambdatest --record

If a focus bug only shows up on one engine, this is how you catch it. The default local provider is fine for the bulk of your runs; reach for a grid when you need breadth. You can read more on the providers and features page.

A realistic AI accessibility testing workflow

Let me tie the layers together into something you could actually adopt next sprint, rather than a pile of disconnected commands.

Layer 1 — Deterministic scan on every commit. Keep axe-core (or Pa11y, or Lighthouse CI) in your pipeline exactly as you have it. It catches missing labels, missing alt, contrast failures, and structural ARIA problems fast and reliably. This is your floor, and it never moves.

Layer 2 — Agent-driven behavioral checks on key flows. For your three or four most important journeys — login, checkout, search, the main dashboard — write plain-English BrowserBash *_test.md files that verify keyboard reachability, focus order, focus restoration after modals, and whether alt text and accessible names are actually meaningful. Run them in CI with --agent and gate on the exit code. This is the coverage axe structurally cannot give you.

Layer 3 — Human assistive-technology testing on releases. Before a major release, have someone use the actual flows with a real screen reader. No agent replaces this. What the agent buys you is that by the time a human sits down with NVDA, the obvious keyboard and focus bugs are already gone, so the expensive human time goes toward the subtle experiential issues only a person can feel.

A weekly cadence that works

In practice, teams I have seen succeed run something like this: axe on every push, the agent suite nightly and on release branches (because agent runs take seconds to minutes, not milliseconds, you do not want them blocking every push if your pipeline is latency-sensitive), and human AT testing once per release cycle. The cost shape lines up nicely too — the deterministic and agent layers can both be effectively free if you run the agent on local models, so the only scarce resource you are budgeting is human screen-reader time, which is where it should be spent.

Reviewing results without watching every run

Two free options help here, and both are worth knowing about. There is a fully local dashboard you launch with browserbash dashboard, no account required, which gives you run history on your own machine. There is also an optional, strictly opt-in cloud dashboard with run history, video recordings, and per-run replay — you enable it with browserbash connect and the --upload flag, and free uploaded runs are kept for 15 days. For accessibility, the per-run replay is especially handy when you want to show a stakeholder why a flow fails for keyboard users rather than just telling them. You can compare the options on the pricing page; the short version is that the core CLI and local dashboard cost nothing.

When to reach for an agent, and when not to

Balance matters, so here is the blunt decision guide.

Use an AI accessibility agent when:

You need to verify behavior — keyboard reachability, focus order, focus restoration, focus traps, visible focus.
You need semantic judgment — is this alt text meaningful, does this accessible name match the visible label, would a screen reader likely announce something sensible here.
You want non-specialists to author accessibility checks in plain English so coverage is not bottlenecked on one expert.
You want repeatable, committed checks that diff in pull requests and run in CI.

Stick with a rule scanner (and skip the agent) when:

You are checking contrast ratios — a scanner is exact and instant; an agent is unreliable at this and you would be using the wrong tool.
You need a deterministic, zero-flake gate on attribute-level rules like label presence or heading structure.
The check is trivially expressible as a DOM rule and runs in milliseconds. Do not pay agent latency for something axe nails for free.

And always remember the third layer: neither automated approach replaces a real assistive-technology user. If you ship something that genuinely matters to disabled users, get a human who uses a screen reader daily to try it. The honest framing is that AI accessibility testing widens automated coverage into behavioral and semantic territory — it does not eliminate the need for lived experience.

For more worked examples and patterns, the BrowserBash blog and the case study walk through real flows end to end.

FAQ

Can AI accessibility testing replace axe-core?

No, and you should not try. axe-core is a deterministic rule engine that catches attribute-level WCAG issues like missing labels, missing alt text, and contrast failures instantly and identically on every run. An AI agent extends coverage into behavior and semantics that rules cannot express, such as keyboard focus order and whether alt text is meaningful. Use both: the scanner as your reliable floor, the agent to reach the issues it structurally misses.

Can an AI agent test screen-reader behavior?

Partially. An agent reasons about what a screen reader would likely announce based on the accessibility tree and ARIA attributes, which catches many issues like missing accessible names or unannounced errors. But it does not actually hear NVDA or VoiceOver speak, so it is an informed inference, not a recording of lived experience. For anything that matters to real users, have a person who uses a screen reader daily test the actual flow.

How do I check keyboard navigation with BrowserBash?

Write a plain-English objective that tells the agent to use only the keyboard, then describe what should happen — for example, tab through every control in a form, confirm focus is always visible, and confirm focus returns to the trigger button after a modal closes. Run it with browserbash run for exploration, or save it as a committable *_test.md file and run it with browserbash testmd run in CI. The agent drives real Chrome, presses the keys, and reports any control it cannot reach.

Is AI accessibility testing free?

The BrowserBash CLI is free and open source under Apache-2.0, and it defaults to free local models via Ollama, so you can run a genuine $0 accessibility pass with no API keys and nothing leaving your machine. If a hard flow needs a more capable hosted model, you bring your own Anthropic or OpenRouter key, and OpenRouter even offers some genuinely free hosted models. The local dashboard is free too; the optional cloud dashboard is opt-in and keeps free uploaded runs for 15 days.

Accessibility is one of those areas where the deterministic scanner and the reasoning agent genuinely complement each other, and the cheapest way to find out where your app falls short for keyboard and assistive-tech users is to point an agent at a real browser and describe the experience you expect in plain English. Install it with npm install -g browserbash-cli, write your first keyboard-navigation objective, and run it locally — no account needed. When you want run history, video replay, and a place to share failures with your team, the optional free dashboard sign-up is there when you need it.