The CLI gives you a browser you can type at. The MCP server gives an AI agent a browser it can think with. Same 85 tools, same capabilities — but now the caller isn't you, it's Claude, Cursor, or whatever agent you've wired in. This issue covers every tool, category by category, and the patterns that make agent-driven browser automation actually work.
reload · get_url · get_title
wait_for_url
get_attribute · get_value
get_text · get_html
a11y_tree · is_visible
is_enabled · is_checked
scroll_into_view
press · keys · check · uncheck
select · drag · hover · focus
scroll · mouse_click
mouse_down · mouse_move
mouse_up · upload
dialog_accept
list_pages · switch_page
frame · frames
diff_map · record_start
record_stop · record_start_chunk
record_stop_chunk
record_start_group
record_stop_group
get_cookies · delete_cookies
set_viewport · get_viewport
set_window · get_window
set_geolocation
restore_storage
storage_state
download_set_dir
wait_for_load · wait_for_text
wait_for_fn · emulate_media
evaluate · dialog_dismiss
set_system_time · set_timezone
pause_at · resume
fast_forward · run_for
Wiring it in
Run vibium mcp from your terminal. That's it — the server starts on stdio and
registers all 85 tools immediately. From the agent's side, it looks like any other MCP server:
a list of tool definitions it can call with structured JSON arguments.
To connect it to Claude Desktop, add a server entry to your MCP config file
(~/Library/Application Support/Claude/claude_desktop_config.json on macOS):
"mcpServers": {
"vibium": {
"command": "vibium",
"args": ["mcp"]
}
}
}
For Cursor, open Settings → MCP and add the same entry. Restart the client and the tools appear in the agent's context automatically. The agent doesn't need to know anything about Vibium's internals — it just sees 85 named tools with typed parameters and descriptions.
The MCP server is the bridge between natural-language intent and precise browser control. The agent decides what to do; Vibium decides how to do it.
One thing worth knowing before you start: browser_start and browser_stop
are explicit tools in the MCP surface. Most agentic workflows begin with a browser_start
call and end with browser_stop. If you want a persistent browser session across
multiple agent turns, start it once and don't stop it between calls — the session stays open
as long as the MCP server process runs.
Navigating
Seven tools. Navigation is where every agentic session begins, and browser_navigate
is the one the agent reaches for first. It accepts a url parameter and an optional
waitUntil hint — load, domcontentloaded,
networkidle, or commit.
{
"url": "https://github.com/login",
"waitUntil": "networkidle"
}
The default wait behaviour is sufficient for most pages. Use networkidle explicitly
for single-page apps with delayed data fetching — it ensures the page has settled before
the agent attempts to read or interact with anything.
browser_get_url and browser_get_title are read-only orientation tools.
Agents use them after navigation to confirm they landed on the expected page — a reliable
sanity check before committing to a sequence of interactions. browser_wait_for_url
blocks until the current URL matches a string, glob, or regex pattern, which is useful
after form submissions or redirects where the agent needs to wait for the destination
before proceeding.
Finding and reading elements
Thirteen tools. The most important of these — by a significant margin — is
browser_map. It takes no parameters and returns every interactive element
on the page: roles, labels, bounding boxes, and a @ref handle for each.
It's how an agent orients itself on an unfamiliar page without needing to read the DOM.
[1] button "Sign in" @e8
[2] textbox "Username or email address" @e12
[3] textbox "Password" @e15
[4] link "Forgot password?" @e19
[5] link "Create an account" @e23
Those @ref handles are what make agentic workflows efficient. Once the agent
has called browser_map, it can pass @e8 directly to
browser_click or browser_fill — no CSS selectors, no XPath,
no guessing about the DOM structure.
browser_find is the targeted variant — it resolves a single element by
role + text, CSS selector, or a combination. Use it when the
agent knows exactly what it's looking for. browser_find_all returns every
element matching a CSS selector — useful for iterating over lists, tables, or repeated
components.
The three boolean tools — browser_is_visible, browser_is_enabled,
browser_is_checked — are guard tools. An agent that checks whether a button is
enabled before clicking it avoids a class of errors that would otherwise surface as cryptic
failures mid-sequence. browser_scroll_into_view brings an off-screen element
into the viewport before the agent tries to interact with it — necessary on long pages
where lazy-rendering hides elements until they're scrolled to.
browser_a11y_tree dumps the full accessibility tree. It's slower than
browser_map but more complete — use it when an agent needs ARIA states,
live region content, or element relationships that the interactive map doesn't surface.
Interacting with elements
Nineteen tools — the largest category. The three you'll see in most agentic sessions are
browser_fill, browser_click, and browser_press.
Together they cover the vast majority of form interactions.
browser_fill {"selector": "#login_field", "value": "user@example.com"}
browser_fill {"selector": "#password", "value": "••••••••"}
browser_click {"role": "button", "text": "Sign in"}
browser_fill clears the field before typing — it's always the right choice
for inputs where the agent is setting a value from scratch. browser_type
appends without clearing, which matters when the agent needs to exercise autocomplete
or typeahead that fires on each keystroke.
The four low-level mouse tools — browser_mouse_click,
browser_mouse_down, browser_mouse_move,
browser_mouse_up — operate on raw coordinates. They exist for interactions
that higher-level tools can't express: canvas elements, custom drag handles, pixel-precise
gestures. Most agents won't need them, but when a target doesn't have a semantic role
or accessible label, coordinates are the fallback.
For destructive actions that trigger a confirmation dialog — deleting a record, navigating
away from unsaved changes — the safest pattern is to call browser_dialog_accept
before the triggering action, not after. The same applies to browser_dialog_dismiss
in the Browser category. Pre-registering the handler eliminates any race between the dialog
appearing and the agent responding to it.
browser_upload takes an array of file paths and populates a
input[type=file] element without triggering the OS file picker. Pass an
absolute path — the MCP server runs in whatever directory vibium mcp was
started from, and relative paths resolve against that working directory.
Pages and frames
Six tools. Multi-page management is one of the areas where agent-driven automation most
commonly breaks without explicit handling. When a click opens a new tab, the agent's
active context is still on the original page — it needs to call browser_list_pages
to see what opened, then browser_switch_page to move into it.
browser_list_pages {}
→ [0] https://example.com (active)
[1] https://example.com/docs
browser_switch_page {"pageIndex": 1}
→ Switched to page 1
browser_new_page opens a fresh tab, optionally navigating immediately. Use
it when the agent needs a clean context alongside an existing session — for example,
opening a reference page while keeping the primary workflow page intact.
browser_close_page closes the current page or a specific one by index.
browser_frame switches the execution context into an iframe matched by name
or URL fragment. Until the agent calls browser_frame to enter an iframe,
every find and interact tool operates on the top-level page — elements inside the frame
are invisible. browser_frames lists all iframes on the current page,
giving the agent the information it needs to decide which frame to enter.
Capturing output
Ten tools. browser_screenshot is the most-used of the group — agents call
it both to document what they see and as a verification step after an action. Pass
fullPage: true to capture the full scrollable document, or a selector
to crop to a specific element.
browser_screenshot {"path": "./after-login.png"}
// Full page — captures everything below the fold
browser_screenshot {"fullPage": true, "path": "./full-page.png"}
// Element crop — useful for annotated bug reports
browser_screenshot {"selector": ".error-banner", "path": "./error.png"}
browser_highlight draws a visible outline around a matched element before
screenshotting. The combination — highlight then screenshot — produces annotated images
that make it unambiguous which element the agent was targeting. Useful when the agent
is producing a bug report or a step-by-step trace.
browser_diff_map compares the current interactive element map against a
previously saved snapshot and returns what changed. It's a lightweight structural diff —
not pixel-level visual regression, but a semantic one: which buttons appeared, which
disappeared, which labels changed.
The recording tools are designed around a hierarchy: a session contains groups, groups
contain chunks. Start with browser_record_start, open groups with
browser_record_start_group, label individual steps with
browser_record_start_chunk. The resulting recording carries a structured
log of what happened at each level — useful for producing replay-ready test artifacts
from agent runs.
Managing state
Twelve tools. The cookie and storage tools are the fastest path to an authenticated
session — call browser_set_cookie with a valid session token before
navigating, and the browser arrives at the target URL already logged in. No need for
the agent to work through the login flow on every run.
"name": "session_token",
"value": "eyJhbGci...",
"domain": ".example.com",
"path": "/"
}
browser_storage_state exports the full session — cookies plus
localStorage — to a JSON object. browser_restore_storage
loads it back. This pair is the right way to persist and reuse an authenticated session
across agent runs: export once after logging in, restore at the top of every subsequent
run.
browser_set_viewport sets the viewport dimensions before navigation — the
right way to simulate a specific screen size for responsive layout testing.
browser_set_geolocation overrides the browser's reported GPS coordinates,
which controls everything that calls the Geolocation API: location-aware content,
region-restricted features, distance-based sorting.
browser_set_content replaces the entire page with an HTML string. It's the
fastest way for an agent to work with an isolated component — no dev server, no routing,
no external dependencies. Pass the component's markup and styles directly and the browser
renders it immediately.
browser_download_set_dir redirects all downloads to a named directory.
Set it before triggering any download — without it, files land in the browser's default
download location, which may be inconvenient to find programmatically.
Browser control and the virtual clock
Eighteen tools across two categories. The Browser category covers lifecycle, timing,
JavaScript execution, and media emulation. The Clock category is its own surface entirely —
a virtual clock that replaces the browser's real-time API, giving the agent complete
control over Date, setTimeout, and setInterval.
browser_wait is the right timing tool for most situations. It polls a
CSS selector until the element reaches a target state: visible,
hidden, attached, or detached. An agent that
uses browser_wait instead of browser_sleep will finish
faster and fail less — it unblocks the moment the condition is met, not after a
fixed delay.
browser_wait {"selector": ".loading-spinner", "state": "hidden"}
// Wait for a success message to appear
browser_wait_for_text {"text": "Your changes have been saved"}
// Wait on an arbitrary JS condition
browser_wait_for_fn {"fn": "() => window.__appReady === true"}
browser_evaluate runs a JavaScript expression in the page context and
returns the result. Use it when no other tool can reach the value the agent needs:
a computed property, a value in the application's internal state, a DOM measurement
that isn't exposed as an attribute.
browser_emulate_media overrides CSS media queries. Set
colorScheme: "dark" to test dark mode styling, or
media: "print" to see how the page renders in a print context — no
printer required.
The Clock category is the most specialised surface in the MCP server. Install the
virtual clock with page_clock_install early in the session, before
any time-sensitive code runs. Once installed, real time stops advancing for the page —
Date.now(), setTimeout, and setInterval all
respond to the virtual clock instead.
page_clock_install {"time": 1748736000000}
page_clock_set_timezone {"timezone": "America/New_York"}
// Fast-forward 24 hours to test a time-based expiry
page_clock_fast_forward {"ms": 86400000}
// Run timers for 5 seconds then pause
page_clock_run_for {"ms": 5000}
This makes date-dependent behaviour — session expiry, scheduled content, countdown timers, cron-driven UI updates — fully testable without waiting for real time to pass or mocking at the test-framework level. The clock runs inside the browser, so it affects the page exactly as a real time change would.
85 tools. 8 categories. The same browser capabilities available from the CLI, now addressable by any agent that speaks MCP.
The pattern that matters most is browser_map first, interact with
@ref handles second. It's more reliable than CSS selectors, more readable
in agent traces, and it keeps the agent grounded in what the page actually contains
rather than what the agent assumes it contains.
The next issue covers the TypeScript/JavaScript API — 68 methods for driving the browser from Node.js, structured around the same surface with the ergonomics of a native async client library.