Issue No. 02: The Vibium MCP Server

The CLI gives you a browser you can type at. The MCP server gives an AI agent a browser it can think with. Same 85 tools, same capabilities — but now the caller isn't you, it's Claude, Cursor, or whatever agent you've wired in. This issue covers every tool, category by category, and the patterns that make agent-driven browser automation actually work.

Navigate

navigate · back · forward
reload · get_url · get_title
wait_for_url

Find & Read

find · find_all · map · count
get_attribute · get_value
get_text · get_html
a11y_tree · is_visible
is_enabled · is_checked
scroll_into_view

Interact

click · dblclick · fill · type
press · keys · check · uncheck
select · drag · hover · focus
scroll · mouse_click
mouse_down · mouse_move
mouse_up · upload
dialog_accept

Pages & Frames

new_page · close_page
list_pages · switch_page
frame · frames

Capture

screenshot · pdf · highlight
diff_map · record_start
record_stop · record_start_chunk
record_stop_chunk
record_start_group
record_stop_group

State

set_content · set_cookie
get_cookies · delete_cookies
set_viewport · get_viewport
set_window · get_window
set_geolocation
restore_storage
storage_state
download_set_dir

Browser

start · stop · sleep · wait
wait_for_load · wait_for_text
wait_for_fn · emulate_media
evaluate · dialog_dismiss

Clock

install · set_fixed_time
set_system_time · set_timezone
pause_at · resume
fast_forward · run_for

Wiring it in

Run vibium mcp from your terminal. That's it — the server starts on stdio and registers all 85 tools immediately. From the agent's side, it looks like any other MCP server: a list of tool definitions it can call with structured JSON arguments.

To connect it to Claude Desktop, add a server entry to your MCP config file (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

{

  "mcpServers": {

    "vibium": {

      "command": "vibium",

      "args": ["mcp"]

    }

  }

}

For Cursor, open Settings → MCP and add the same entry. Restart the client and the tools appear in the agent's context automatically. The agent doesn't need to know anything about Vibium's internals — it just sees 85 named tools with typed parameters and descriptions.

The MCP server is the bridge between natural-language intent and precise browser control. The agent decides what to do; Vibium decides how to do it.

One thing worth knowing before you start: browser_start and browser_stop are explicit tools in the MCP surface. Most agentic workflows begin with a browser_start call and end with browser_stop. If you want a persistent browser session across multiple agent turns, start it once and don't stop it between calls — the session stays open as long as the MCP server process runs.

Navigating

browser_navigate · browser_back · browser_forward · browser_reload · browser_get_url · browser_get_title · browser_wait_for_url

Seven tools. Navigation is where every agentic session begins, and browser_navigate is the one the agent reaches for first. It accepts a url parameter and an optional waitUntil hint — load, domcontentloaded, networkidle, or commit.

// Agent calls browser_navigate

{

  "url": "https://github.com/login",

  "waitUntil": "networkidle"

}

The default wait behaviour is sufficient for most pages. Use networkidle explicitly for single-page apps with delayed data fetching — it ensures the page has settled before the agent attempts to read or interact with anything.

browser_get_url and browser_get_title are read-only orientation tools. Agents use them after navigation to confirm they landed on the expected page — a reliable sanity check before committing to a sequence of interactions. browser_wait_for_url blocks until the current URL matches a string, glob, or regex pattern, which is useful after form submissions or redirects where the agent needs to wait for the destination before proceeding.

Finding and reading elements

browser_find · browser_find_all · browser_map · browser_count · browser_get_attribute · browser_get_value · browser_get_text · browser_get_html · browser_a11y_tree · browser_is_visible · browser_is_enabled · browser_is_checked · browser_scroll_into_view

Thirteen tools. The most important of these — by a significant margin — is browser_map. It takes no parameters and returns every interactive element on the page: roles, labels, bounding boxes, and a @ref handle for each. It's how an agent orients itself on an unfamiliar page without needing to read the DOM.

// browser_map returns a structured list the agent can reason over

[1] button "Sign in" @e8

[2] textbox "Username or email address" @e12

[3] textbox "Password" @e15

[4] link "Forgot password?" @e19

[5] link "Create an account" @e23

Those @ref handles are what make agentic workflows efficient. Once the agent has called browser_map, it can pass @e8 directly to browser_click or browser_fill — no CSS selectors, no XPath, no guessing about the DOM structure.

browser_find is the targeted variant — it resolves a single element by role + text, CSS selector, or a combination. Use it when the agent knows exactly what it's looking for. browser_find_all returns every element matching a CSS selector — useful for iterating over lists, tables, or repeated components.

The three boolean tools — browser_is_visible, browser_is_enabled, browser_is_checked — are guard tools. An agent that checks whether a button is enabled before clicking it avoids a class of errors that would otherwise surface as cryptic failures mid-sequence. browser_scroll_into_view brings an off-screen element into the viewport before the agent tries to interact with it — necessary on long pages where lazy-rendering hides elements until they're scrolled to.

browser_a11y_tree dumps the full accessibility tree. It's slower than browser_map but more complete — use it when an agent needs ARIA states, live region content, or element relationships that the interactive map doesn't surface.

Interacting with elements

browser_click · browser_dblclick · browser_fill · browser_type · browser_press · browser_keys · browser_check · browser_uncheck · browser_select · browser_drag · browser_hover · browser_focus · browser_scroll · browser_mouse_click · browser_mouse_down · browser_mouse_move · browser_mouse_up · browser_upload · browser_dialog_accept

Nineteen tools — the largest category. The three you'll see in most agentic sessions are browser_fill, browser_click, and browser_press. Together they cover the vast majority of form interactions.

// A typical login sequence across three tool calls

browser_fill  {"selector": "#login_field", "value": "user@example.com"}

browser_fill  {"selector": "#password", "value": "••••••••"}

browser_click {"role": "button", "text": "Sign in"}

browser_fill clears the field before typing — it's always the right choice for inputs where the agent is setting a value from scratch. browser_type appends without clearing, which matters when the agent needs to exercise autocomplete or typeahead that fires on each keystroke.

The four low-level mouse tools — browser_mouse_click, browser_mouse_down, browser_mouse_move, browser_mouse_up — operate on raw coordinates. They exist for interactions that higher-level tools can't express: canvas elements, custom drag handles, pixel-precise gestures. Most agents won't need them, but when a target doesn't have a semantic role or accessible label, coordinates are the fallback.

For destructive actions that trigger a confirmation dialog — deleting a record, navigating away from unsaved changes — the safest pattern is to call browser_dialog_accept before the triggering action, not after. The same applies to browser_dialog_dismiss in the Browser category. Pre-registering the handler eliminates any race between the dialog appearing and the agent responding to it.

browser_upload takes an array of file paths and populates a input[type=file] element without triggering the OS file picker. Pass an absolute path — the MCP server runs in whatever directory vibium mcp was started from, and relative paths resolve against that working directory.

Pages and frames

browser_new_page · browser_close_page · browser_list_pages · browser_switch_page · browser_frame · browser_frames

Six tools. Multi-page management is one of the areas where agent-driven automation most commonly breaks without explicit handling. When a click opens a new tab, the agent's active context is still on the original page — it needs to call browser_list_pages to see what opened, then browser_switch_page to move into it.

// After clicking a link that opens a new tab

browser_list_pages  {}

→ [0] https://example.com (active)

   [1] https://example.com/docs

browser_switch_page {"pageIndex": 1}

→ Switched to page 1

browser_new_page opens a fresh tab, optionally navigating immediately. Use it when the agent needs a clean context alongside an existing session — for example, opening a reference page while keeping the primary workflow page intact. browser_close_page closes the current page or a specific one by index.

browser_frame switches the execution context into an iframe matched by name or URL fragment. Until the agent calls browser_frame to enter an iframe, every find and interact tool operates on the top-level page — elements inside the frame are invisible. browser_frames lists all iframes on the current page, giving the agent the information it needs to decide which frame to enter.

Capturing output

browser_screenshot · browser_pdf · browser_highlight · browser_diff_map · browser_record_start · browser_record_stop · browser_record_start_chunk · browser_record_stop_chunk · browser_record_start_group · browser_record_stop_group

Ten tools. browser_screenshot is the most-used of the group — agents call it both to document what they see and as a verification step after an action. Pass fullPage: true to capture the full scrollable document, or a selector to crop to a specific element.

// Viewport screenshot (default)

browser_screenshot {"path": "./after-login.png"}

// Full page — captures everything below the fold

browser_screenshot {"fullPage": true, "path": "./full-page.png"}

// Element crop — useful for annotated bug reports

browser_screenshot {"selector": ".error-banner", "path": "./error.png"}

browser_highlight draws a visible outline around a matched element before screenshotting. The combination — highlight then screenshot — produces annotated images that make it unambiguous which element the agent was targeting. Useful when the agent is producing a bug report or a step-by-step trace.

browser_diff_map compares the current interactive element map against a previously saved snapshot and returns what changed. It's a lightweight structural diff — not pixel-level visual regression, but a semantic one: which buttons appeared, which disappeared, which labels changed.

The recording tools are designed around a hierarchy: a session contains groups, groups contain chunks. Start with browser_record_start, open groups with browser_record_start_group, label individual steps with browser_record_start_chunk. The resulting recording carries a structured log of what happened at each level — useful for producing replay-ready test artifacts from agent runs.

Managing state

browser_set_content · browser_set_cookie · browser_get_cookies · browser_delete_cookies · browser_set_viewport · browser_get_viewport · browser_set_window · browser_get_window · browser_set_geolocation · browser_restore_storage · browser_storage_state · browser_download_set_dir

Twelve tools. The cookie and storage tools are the fastest path to an authenticated session — call browser_set_cookie with a valid session token before navigating, and the browser arrives at the target URL already logged in. No need for the agent to work through the login flow on every run.

browser_set_cookie {

  "name": "session_token",

  "value": "eyJhbGci...",

  "domain": ".example.com",

  "path": "/"

}

browser_storage_state exports the full session — cookies plus localStorage — to a JSON object. browser_restore_storage loads it back. This pair is the right way to persist and reuse an authenticated session across agent runs: export once after logging in, restore at the top of every subsequent run.

browser_set_viewport sets the viewport dimensions before navigation — the right way to simulate a specific screen size for responsive layout testing. browser_set_geolocation overrides the browser's reported GPS coordinates, which controls everything that calls the Geolocation API: location-aware content, region-restricted features, distance-based sorting.

browser_set_content replaces the entire page with an HTML string. It's the fastest way for an agent to work with an isolated component — no dev server, no routing, no external dependencies. Pass the component's markup and styles directly and the browser renders it immediately.

browser_download_set_dir redirects all downloads to a named directory. Set it before triggering any download — without it, files land in the browser's default download location, which may be inconvenient to find programmatically.

Browser control and the virtual clock

browser_start · browser_stop · browser_sleep · browser_wait · browser_wait_for_load · browser_wait_for_text · browser_wait_for_fn · browser_emulate_media · browser_evaluate · browser_dialog_dismiss · page_clock_install · page_clock_set_fixed_time · page_clock_set_system_time · page_clock_set_timezone · page_clock_pause_at · page_clock_resume · page_clock_fast_forward · page_clock_run_for

Eighteen tools across two categories. The Browser category covers lifecycle, timing, JavaScript execution, and media emulation. The Clock category is its own surface entirely — a virtual clock that replaces the browser's real-time API, giving the agent complete control over Date, setTimeout, and setInterval.

browser_wait is the right timing tool for most situations. It polls a CSS selector until the element reaches a target state: visible, hidden, attached, or detached. An agent that uses browser_wait instead of browser_sleep will finish faster and fail less — it unblocks the moment the condition is met, not after a fixed delay.

// Wait for a loading spinner to disappear

browser_wait {"selector": ".loading-spinner", "state": "hidden"}

// Wait for a success message to appear

browser_wait_for_text {"text": "Your changes have been saved"}

// Wait on an arbitrary JS condition

browser_wait_for_fn {"fn": "() => window.__appReady === true"}

browser_evaluate runs a JavaScript expression in the page context and returns the result. Use it when no other tool can reach the value the agent needs: a computed property, a value in the application's internal state, a DOM measurement that isn't exposed as an attribute.

browser_emulate_media overrides CSS media queries. Set colorScheme: "dark" to test dark mode styling, or media: "print" to see how the page renders in a print context — no printer required.

The Clock category is the most specialised surface in the MCP server. Install the virtual clock with page_clock_install early in the session, before any time-sensitive code runs. Once installed, real time stops advancing for the page — Date.now(), setTimeout, and setInterval all respond to the virtual clock instead.

// Install and jump to a specific moment

page_clock_install       {"time": 1748736000000}

page_clock_set_timezone  {"timezone": "America/New_York"}

// Fast-forward 24 hours to test a time-based expiry

page_clock_fast_forward  {"ms": 86400000}

// Run timers for 5 seconds then pause

page_clock_run_for       {"ms": 5000}

This makes date-dependent behaviour — session expiry, scheduled content, countdown timers, cron-driven UI updates — fully testable without waiting for real time to pass or mocking at the test-framework level. The clock runs inside the browser, so it affects the page exactly as a real time change would.

85 tools. 8 categories. The same browser capabilities available from the CLI, now addressable by any agent that speaks MCP.

The pattern that matters most is browser_map first, interact with @ref handles second. It's more reliable than CSS selectors, more readable in agent traces, and it keeps the agent grounded in what the page actually contains rather than what the agent assumes it contains.

The next issue covers the TypeScript/JavaScript API — 68 methods for driving the browser from Node.js, structured around the same surface with the ergonomics of a native async client library.

←

Previous · Issue No. 01

The Vibium CLI — all 66 commands.

Up next · Issue No. 03

The Vibium TypeScript/JavaScript API — all 68 methods.