Measured, published in full

Reflex finishes the same flows in a fraction of the round trips.

TL;DR: 6 tool calls and ~13K tokens vs 14 calls and ~56K for Playwright MCP across the same three measured flows.

Fewer round trips, faster end to end

The round-trip and token counts below are why; this is the payoff. In a second test, a real agent (Claude Fable 5) drove both tools through 6 fresh live-site tasks, start to finish, with a fresh agent per flow so neither side reused context. Reflex completed them in 2.8x fewer round trips, on 2.6x less context in aggregate (and 4-29x smaller on the page reads themselves), and finished ~1.7x faster end to end, with all 6 tasks succeeding on both tools.

End-to-end time, 6 agent-driven flows (less is better)

Playwright MCP

529s

Reflex

309s

1.4-2.2x faster per flow. Reflex's own tool calls are a touch slower; what wins is making far fewer round trips and handing the model far less to read each turn. The wall-clock ratio is conservative here: fixed per-run overhead is symmetric on both sides, so it dilutes the gap toward 1x.

That is the whole mechanism: end-to-end time is governed by how many times the model stops to read and how much it must read each time, not by raw tool latency. It is a conservative floor; the heavier the page, the wider the gap. Single live-site runs, measured 2026-07-22; treat wall clock as indicative.

Across the 3 flows verified head-to-head against Playwright MCP, the free default for Claude Desktop and other pure MCP clients, Reflex completed the same tasks in 2 tool calls per flow, 6 in total, where Playwright MCP needed 14. Agent round trips, the model turns where it stops to read output, are what dominate real wall clock, and a batched Reflex flow spends far fewer of them.

Agent round trips, 3 flows (fewer is better)

Playwright MCP

14 turns

Reflex

6 turns

Tool calls match turns exactly: 14 vs 6. One task is 2-3 Reflex calls against 6-14 MCP round trips.

Reflex: 5/6 tasks completedTied best first-try success0-token deterministic replay

A fraction of the tokens into context

Every token a tool puts into the agent's context is a token the agent pays for and a token it can no longer spend on the actual job. Over the 3 verified flows, Reflex sent ~13K tokens into context where Playwright MCP sent ~56K: roughly 4x fewer. Reflex distills each page once, then sends deltas instead of re-dumping.

Tokens into agent context, 3 flows (fewer is better)

Playwright MCP

~56K tok

Reflex

~13K tok

The amber bar is the context Reflex hands back to you to spend on the task.

The context window is the bigger half of the story. One browser_snapshot look at the W3C CSS Grid spec is ~169K tokens, most of a 200K window gone in a single tool call before the agent has done anything. The same look through Reflex is ~7K tokens. Context the agent keeps is context it spends on the work you asked for.

The cheapest look on every page

The cost an agent pays to read a page it does not already know is the exploration price, and Reflexis the cheapest on every page tested. Playwright MCP's browser_snapshot inlines the full accessibility tree; Reflex distills it. Bars are scaled within each chart.

Wikipedia article

Playwright MCP

~31K tok

Reflex

~4K tok

GitHub repo

Playwright MCP

~21K tok

Reflex

~3K tok

Hacker News

Playwright MCP

~15K tok

Reflex

~2K tok

W3C CSS Grid spec

Playwright MCP

~169K tok

Reflex

~7K tok

~26x cheaper. This is the architectural win: distillation with folding, not incremental trimming.

The cheapest look across the wider field

If your agent has a shell, the field is wider than Playwright MCP: Microsoft's token-efficient Playwright CLI and Vercel's agent-browser are strong free alternatives. Reflex is still the cheapest look on every page, and on pathological pages the gap is architectural.

Page	PW MCP snapshot	PW CLI yml file	agent-browser -i	Reflex
Wikipedia article	~31K tok	~31K tok	~4K tok	~4K tok
GitHub repo	~21K tok	~21K tok	~6K tok	~3K tok
Hacker News	~15K tok	~14K tok	~3K tok	~2K tok
W3C CSS Grid spec	~169K tok	~171K tok	~37K tok	~7K tok

More reasons Reflex wins

Verification is built into the batch. 2 turns per flow with expectations checked inside the single guarded call, so a flow either lands or fails loudly without an extra round trip to confirm.
Forensic failures. On the one shared miss, Reflex returned near-miss candidates, the console feed, and a screenshot. The rest returned a single error line, so a repair costs one agent turn instead of several.
Tied-best first-try success (5/6), the highest in the field.
0-token deterministic replay with self-healing. A saved flow re-runs with no model in the loop and no tokens spent; no other tool here has an equivalent.

Methodology

Measured 2026-06-12 on macOS, same machine, live public sites, with no LLM on either side. Versions pinned: @playwright/mcp 0.0.76, @playwright/cli 0.1.14, agent-browser 0.27.2 (Vercel). Same 6 real-world flows, same end-condition verification, each tool driven per its own agent documentation. Tool-side wall clock is a second or two slower per flow than the fastest tool, while the agent round trips that dominate real time are far fewer. Reproduce it yourself: node bench/headtohead.js (competitor tools install under bench/tools/).

See the install docs, Reflex vs Playwright MCP, or the pricing to get started.

Ready when you are

Same browser tasks. ~13K tokens, not ~56K.

Get 30 free credits

Installed in 2 minutes. Your pages stay yours. Or install first and skip the account.