Measured, published in full
Reflex finishes the same flows in a fraction of the round trips.
TL;DR: 6 tool calls and ~13K tokens vs 14 calls and ~56K for Playwright MCP across the same three measured flows.
Across the 3 flows verified head-to-head against Playwright MCP, the free default for Claude Desktop and other pure MCP clients, Reflex completed the same tasks in 2 tool calls per flow, 6 in total, where Playwright MCP needed 14. Agent round trips, the model turns where it stops to read output, are what dominate real wall clock, and a batched Reflex flow spends far fewer of them.
Tool calls match turns exactly: 14 vs 6. One task is 2-3 Reflex calls against 6-14 MCP round trips.
A fraction of the tokens into context
Every token a tool puts into the agent's context is a token the agent pays for and a token it can no longer spend on the actual job. Over the 3 verified flows, Reflex sent ~13K tokens into context where Playwright MCP sent ~56K: roughly 4x fewer. Reflex distills each page once, then sends deltas instead of re-dumping.
The amber bar is the context Reflex hands back to you to spend on the task.
The context window is the bigger half of the story. One browser_snapshot look at the W3C CSS Grid spec is ~169K tokens, most of a 200K window gone in a single tool call before the agent has done anything. The same look through Reflex is ~7K tokens. Context the agent keeps is context it spends on the work you asked for.
The cheapest look on every page
The cost an agent pays to read a page it does not already know is the exploration price, and Reflex is the cheapest on every page tested. Playwright MCP's browser_snapshot inlines the full accessibility tree; Reflex distills it. Bars are scaled within each chart.
~26x cheaper. This is the architectural win: distillation with folding, not incremental trimming.
The cheapest look across the wider field
If your agent has a shell, the field is wider than Playwright MCP: Microsoft's token-efficient Playwright CLI and Vercel's agent-browser are strong free alternatives. Reflex is still the cheapest look on every page, and on pathological pages the gap is architectural.
| Page | PW MCP snapshot | PW CLI yml file | agent-browser -i | Reflex |
|---|---|---|---|---|
| Wikipedia article | ~31K tok | ~31K tok | ~4K tok | ~4K tok |
| GitHub repo | ~21K tok | ~21K tok | ~6K tok | ~3K tok |
| Hacker News | ~15K tok | ~14K tok | ~3K tok | ~2K tok |
| W3C CSS Grid spec | ~169K tok | ~171K tok | ~37K tok | ~7K tok |
More reasons Reflex wins
- Verification is built into the batch. 2 turns per flow with expectations checked inside the single guarded call, so a flow either lands or fails loudly without an extra round trip to confirm.
- Forensic failures. On the one shared miss, Reflex returned near-miss candidates, the console feed, and a screenshot. The rest returned a single error line, so a repair costs one agent turn instead of several.
- Tied-best first-try success (5/6), the highest in the field.
- 0-token deterministic replay with self-healing. A saved flow re-runs with no model in the loop and no tokens spent; no other tool here has an equivalent.
Methodology
Measured 2026-06-12 on macOS, same machine, live public sites, with no LLM on either side. Versions pinned: @playwright/mcp 0.0.76, @playwright/cli 0.1.14, agent-browser 0.27.2 (Vercel). Same 6 real-world flows, same end-condition verification, each tool driven per its own agent documentation. Tool-side wall clock is a second or two slower per flow than the fastest tool, while the agent round trips that dominate real time are far fewer. Reproduce it yourself: node bench/headtohead.js (competitor tools install under bench/tools/).
See the install docs or the pricing to get started.
Ready when you are
Same browser tasks. ~13K tokens, not ~56K.
Installed in 2 minutes. Your pages stay yours.