browserBrowser Use Agent

Overview

The Browser Use Agent enables AI agents to interact with websites automatically. It can navigate pages, fill forms, extract data, and handle complex scenarios like CAPTCHAs and logins with human assistance when needed.

Execution Flow & Architecture

Process Flow Diagram

Workflow Explanation

  1. Task Submission: You provide a web automation task (e.g., "Search for jobs on LinkedIn")

  2. Browser Session Creation: The agent creates an isolated browser session and provides you with a live browser URL

  3. Step-by-Step Execution: The agent plans and executes actions one by one:

    1. Analyzes the current page

    2. Decides what to do next (click button, fill form, navigate)

    3. Executes the action

    4. Reports progress

  4. Human Assistance: When the agent encounters challenges like CAPTCHAs or login prompts, you can help by opening the live browser URL and completing them manually

  5. Result Delivery: The agent returns the final results (extracted data, completion status, etc.)


Human-in-the-Loop Mechanisms

The Browser Use Agent handles challenging web scenarios through intelligent automation and optional human assistance.

How It Handles CAPTCHAs and Logins

Automatic Attempts: The agent tries to handle challenges automatically using its built-in instructions:

  1. CAPTCHAs: Attempts to solve them when possible; uses alternative strategies if blocked

  2. Logins: Only attempts login if credentials are provided or explicitly required

  3. Stuck Situations: Re-evaluates the task and tries different approaches

When You Need to Help: Sometimes the agent needs human assistance for complex challenges:

  1. Agent encounters a challenge (CAPTCHA, login prompt, etc.) during execution

  2. Live browser URL is available - You receive a URL that shows the current browser session in real-time

  3. You open the URL and manually complete the challenge (solve CAPTCHA, enter login credentials, etc.)

  4. Agent continues - The agent proceeds with its next step independently (it doesn't wait for you)

  5. Task continues - When the agent executes its next step, it may detect that you've completed the challenge and continue with the updated page state (this detection is automatic but not guaranteed)

Important Points:

  1. The agent doesn't pause or wait for you - it continues executing steps independently

  2. You can help at any time by opening the live browser URL

  3. The agent may detect your changes when it executes its next step (this is automatic, not guaranteed)

  4. There's no explicit pause/resume - you're helping in parallel with the agent's execution

Task: "Search for software engineer jobs on a job board and extract the first 5 listings"

What Happens:

  1. Agent navigates to the job board website

  2. Agent encounters a CAPTCHA during search

  3. You receive a live browser URL

  4. You open the URL and solve the CAPTCHA manually

  5. Agent continues searching and extracts job listings

  6. Agent returns the results

If Login is Required:

  1. If no credentials are provided, the agent skips login (per its instructions)

  2. If login is necessary, you can complete it via the live browser URL

  3. The agent then continues with the task

Recovery Strategies

When the agent gets stuck or encounters errors:

  1. Automatic Retry: The agent tries alternative approaches automatically

  2. Session Recovery: If the browser connection is lost, the agent recreates the session and continues

  3. State Preservation: Your manual changes (like completing a CAPTCHA) are typically preserved in the browser session, and the agent may detect them when it executes its next step


How It Works

Browser Automation Process

Step-by-Step Execution:

  1. The agent analyzes the current webpage

  2. It plans the next action using AI reasoning

  3. It executes the action (click, type, navigate, extract data)

  4. It checks the result and plans the next step

  5. This continues until the task is complete

Real-Time Updates:

  1. You receive progress updates showing what the agent is doing

  2. You can see the agent's "thinking" process

  3. A live browser URL lets you watch or intervene if needed

Session Recording:

  1. Optionally records a video of the entire browser session

  2. Useful for debugging or reviewing what happened

  3. Available after task completion

Error Handling

Automatic Recovery:

  1. If the browser disconnects, the agent automatically recreates the session

  2. If an action fails, the agent tries alternative approaches

  3. Configurable retry limits prevent infinite loops

Common Issues:

  1. CAPTCHA/Login Blocks: Use the live browser URL to complete manually

  2. Element Not Found: Agent waits, refreshes, or tries alternative selectors

  3. Session Disconnects: Automatic retry with session recreation


Sample Usage

Basic Web Automation

Using via SDK:

Example Output:

Task Requiring Human Assistance

What to Expect:

  1. Agent starts executing the task

  2. If a CAPTCHA appears, you receive a live browser URL as a status update

  3. Open the URL, solve the CAPTCHA

  4. Agent continues with its next step and might detect your changes when it executes the next action

  5. Final results are returned with the job listings

Configuring Timeouts

You can configure timeout settings to match your task requirements:


Capabilities & Limitations

Known Capabilities

The Browser Use Agent excels at a wide range of web automation tasks:

  1. Web Navigation & Form Filling

    1. Use Case: Automatically fill out contact forms, registration pages, or search forms

    2. Example Task: "Go to https://duckduckgo.com, search for 'Python web automation', and extract the titles of the first 5 search results"

  2. Data Extraction & Collection

    1. Use Case: Gather information from multiple pages or websites

    2. Example Task: "Navigate to https://en.wikipedia.org/wiki/Python_(programming_language) and extract the first paragraph"

  3. Multi-Step Task Automation

    1. Use Case: Complete complex workflows that require multiple sequential actions

    2. Example Task: "Go to https://www.python.org, navigate to the documentation section, find the 'Tutorial' page, and extract the main topics covered"

  4. Scrolling & Pagination

    1. Use Case: Navigate through long pages or multiple pages of results

    2. Example Task: "Go to https://en.wikipedia.org/wiki/Python_(programming_language) and scroll down past the introduction section"

  5. Multi-Tab Operations

    1. Use Case: Open multiple tabs for research or parallel information gathering

    2. Example Task: "Open https://www.python.org, open the documentation section in a new tab, then extract the main heading from each page"

Known Limitations

While powerful, the Browser Use Agent has some limitations:

  1. CAPTCHAs in Iframes

    1. Limitation: CAPTCHAs embedded in iframes are difficult to solve automatically

    2. Example Scenario: A login page with a CAPTCHA widget loaded in an iframe may require manual intervention

    3. Workaround: Use the live browser URL to complete CAPTCHAs manually when needed

  2. Login Without Credentials

    1. Limitation: The agent skips login attempts if no credentials are provided (by design for security)

    2. Example Scenario: Task requires accessing a protected area but no login credentials are available

    3. Workaround: Provide credentials in the task description or complete login manually via the live browser URL

  3. Timeouts & Limits

    1. Limitation: Several timeout and limit constraints may affect task execution:

      1. Task Length: Tasks requiring more than 100 steps may face memory constraints. This is a practical limitation based on observed memory usage patterns, not a hard limit enforced by the tool. The browser-use framework roadmap includes plans to improve agent memory handling for longer tasks.

      2. Timeout Settings: Three configurable timeout settings limit task duration (all configurable via BrowserUseToolConfig):

        1. Steel Session API Timeout: Default 600 seconds (10 minutes) - controls how long the Steel session can remain active (steel_timeout_in_ms)

        2. Browser Use Agent LLM Timeout: Default 60 seconds - controls how long the LLM has to respond for each planning step (browser_use_llm_timeout_in_s)

        3. Browser Use Agent Step Timeout: Default 180 seconds (3 minutes) - controls how long each agent step can take (browser_use_step_timeout_in_s)

      3. Network Latency: Due to geographic distance between Browser Use deployment (South East Asia) and Steel servers (United States), network latency can cause timeout scenarios during rapid interactions

      4. Steel Hobby Plan Limits: Browser Use currently uses Steel's free Hobby plan with the following limits (note: these limits are subject to change if we upgrade to a paid Steel plan):

        1. Max Session Time: 15 minutes per browser session

        2. Daily Requests: 500 requests per day

        3. Requests per Second: 1 request per second rate limit

        4. Concurrent Sessions: Maximum 5 concurrent browser sessions

        5. Data Retention: Session data retained for 24 hours

    2. Example Scenario: Long-running tasks exceeding 15 minutes will be terminated, rapid interactions may timeout due to network latency, or hitting daily request limits will prevent new sessions

    3. Workaround: Break large tasks into smaller subtasks under 15 minutes, use the file system to track progress across multiple runs, configure timeout values if needed, or upgrade to a paid Steel plan for higher limits (see Steel Pricingarrow-up-right)

  4. Cross-Origin Iframe Interactions

    1. Limitation: Interacting with elements inside cross-origin iframes can be unreliable

    2. Example Scenario: A payment form embedded in an iframe from a different domain

    3. Workaround: Manual intervention via live browser URL for critical iframe interactions

  5. Sequential Execution

    1. Limitation: Tasks execute sequentially, not in parallel

    2. Example Scenario: Applying to 50 different job postings must be done one at a time

    3. Workaround: For parallel tasks, run multiple agent instances or break into batches

  6. UI Element Detection

    1. Limitation: Some dynamically loaded or custom UI elements may not be immediately detected

    2. Example Scenario: A custom dropdown menu that loads content via JavaScript after a delay

    3. Workaround: The agent will wait and retry, or you can use the live browser URL to verify element visibility

  7. Real-Time Interactive Elements

    1. Limitation: Elements that require real-time human interaction (like drag-and-drop) may be challenging

    2. Example Scenario: A complex image editor with drag-and-drop functionality

    3. Workaround: Use manual intervention via live browser URL for complex interactions

  8. Elements with Mouse Events

    1. Limitation: Elements that rely on mouse event handlers (such as mousedown, mouseup, mouseover, etc.) instead of standard click events may not respond correctly to agent interactions

    2. Example Scenario: A custom button or interactive element that only triggers actions on mouse events (common in some JavaScript frameworks or custom UI libraries)

    3. Workaround: Use manual intervention via live browser URL to interact with such elements, or contact support if this is a critical requirement

  9. Token Consumption

    1. Limitation: Very large pages with extensive DOM content can consume significant tokens

    2. Example Scenario: A single-page application with thousands of interactive elements

    3. Workaround: Configure vision detail levels (auto/low/high) to optimize token usage


Technical Details

Browser Sessions

The agent uses isolated browser sessions that:

  1. Run in secure, isolated environments

  2. Automatically clean up after task completion

  3. Support real-time monitoring via live browser URLs

AI Models

The agent uses two AI models:

  1. Primary Model: Plans actions and makes decisions

  2. Secondary Model: Extracts structured data from web pages

Both models work together to understand pages and execute tasks effectively.

Streaming Events

The agent provides real-time updates through streaming events:

  1. Status Updates: Progress notifications, session initialization

  2. Step Results: Action execution results with thinking process

  3. Live Browser URL: You'll receive the live browser URL as a status update early in execution, allowing you to monitor or intervene if needed

Security

Isolation:

  1. Each browser session is isolated from others

  2. No data persists between tasks

  3. API keys are loaded from environment variables (never hardcoded)

Safety:

  1. Actions are validated before execution (enforced by the browser-use framework)

  2. Error messages are sanitized

  3. Logging available for monitoring


Performance & Troubleshooting

Efficiency:

  1. Configurable vision detail levels (auto/low/high) for faster processing

  2. Background video recording doesn't slow down execution

  3. Automatic resource cleanup

Common Issues:

Issue
Solution

CAPTCHA/Login blocks

Use the live browser URL to complete manually

Session disconnects

Automatic retry - agent recreates session

Element not found

Agent waits and retries with alternative approaches

Task stuck

Agent re-evaluates and tries different strategies

Debug Resources:

  1. Live browser URLs for real-time monitoring

  2. Video recordings (if enabled) for reviewing sessions

  3. Action logs showing what the agent did

  4. AI reasoning traces showing decision-making process

Last updated

Was this helpful?