robotGPT-Researcher

Audience: Developers

GPT-Researcher Provider

Overview

GPT-Researcher is an automated research agent that conducts comprehensive research on any given topic. It uses web scraping, information gathering, and LLM-based synthesis to generate detailed research reports.

What is GPT-Researcher?

GPT-Researcher is an open-source library that automates the research process by:

  1. Generating Research Plans: Creates structured research plans based on queries

  2. Web Scraping: Extracts information from multiple web sources

  3. Information Aggregation: Combines information from various sources

  4. Report Generation: Synthesizes findings into comprehensive reports

  5. Citation Support: Includes source citations in generated reports

Monitored Sources

GL Open DeepResearch uses the open-source GPT Researcher implementation. The following are the main references for the upstream project:

Resource
Description

Official GPT Researcher homepage and product overview

Official documentation (getting started, concepts, and guides)

How It Works

Research Process

GPT-Researcher follows a structured research workflow:

spinner

Key Components

  1. GPTResearcher: Main research orchestrator

  2. Retriever: Handles web search and result retrieval

  3. Scraper: Extracts content from web pages

  4. LLM Provider: Generates research plans and reports

  5. Report Generator: Formats final research output

Research Flow

  1. Initialization: GPTResearcher initialized with query and configuration

  2. Query Generation: System generates search queries from research question

  3. Web Search: Retriever searches web for relevant sources

  4. Content Extraction: Scraper fetches and extracts content from URLs

  5. Analysis: LLM analyzes gathered content and extracts key information

  6. Synthesis: Information synthesized into structured context

  7. Report Generation: Final report generated from synthesized context

Integration

Integration Approach

We use the official gpt-researcher package installed via pip, and then create custom modifications in the gl_deep_research/packages/gpt_researcher/ package folder. Our customizations are applied through runtime patching (monkey patching) to extend the library's functionality without modifying the installed package directly.

The integration follows the Adapter pattern:

  • GPTResearcherAdapter implements the OrchestratorAdapter protocol

  • Adapter bridges GPT-Researcher engine to the orchestrator

  • Profile-based configuration determines provider selection

  • Streaming support via adapter-specific postprocessors

This approach allows us to:

  • Use the official gpt-researcher package from PyPI

  • Contribute our custom changes through patches in packages/gpt_researcher/

  • Extend functionality without forking or modifying the original package

  • Apply patches at runtime during adapter initialization

  • Maintain compatibility with upstream package updates

  • Integrate seamlessly with the orchestrator system

Adapter Layer

The GPT-Researcher provider is integrated through GPTResearcherAdapter (research/adapter/gpt_researcher_adapter.py):

Initialization Process

  1. Patch Initialization: Applies all necessary patches to GPT-Researcher library

  2. Adapter Creation: Creates GPTResearcherAdapter instance

  3. Orchestrator Registration: Adapter registered with OrchestratorFactory

  4. Adapter Ready: Adapter ready to accept research requests via orchestrator

Request Flow

  1. Request received via task or taskgroup API (POST /v1/tasks or POST /v1/taskgroup)

  2. Request authenticated via API key (account or master key)

  3. Profile loaded from database based on profile parameter

  4. Orchestrator creates adapter instance via OrchestratorFactory

  5. Temporary configuration file created with custom retriever/scraper

  6. GPTResearcher initialized with configuration

  7. Research conducted via conduct_research()

  8. Report generated via write_report()

  9. Result formatted as DeepResearchResult

  10. Response returned through orchestrator to router

Customizations and Changes

Package Location

The GPT-Researcher customization code is located in gl_deep_research/packages/gpt_researcher/, which contains patches and extensions to the original GPT-Researcher library. The base gpt-researcher package is installed via pip, and our patches are applied at runtime to extend its functionality.

Key Changes Made

1. Patch Initializer (patch_initializer.py)

Purpose: Centralized patch management for all GPT-Researcher modifications.

Functionality:

  • Applies all patches in correct order

  • Ensures patches are applied before using GPT-Researcher

  • Called during adapter initialization

Patches Applied:

  1. Extended Deep Research patch

  2. Smart Search Retriever patch

  3. Smart Search Scraper patch

  4. Sea Lion LLM patch

2. Smart Search Retriever Patch (patch/smart_search_retriever.py)

Purpose: Replaces default retriever with Smart Search SDK integration.

Changes:

  • Custom Retriever Class: SmartSearchRetriever replaces CustomRetriever

  • Smart Search Integration: Uses WebSearchClient from Smart Search SDK

  • Result Formatting: Formats results to match GPT-Researcher's expected structure

  • Async Support: Handles async operations with thread pool execution

Key Features:

Integration:

  • Patches gpt_researcher.retrievers.CustomRetriever

  • Activated via RETRIEVER: "custom" in configuration

3. Smart Search Scraper Patch (patch/smart_search_scraper.py)

Purpose: Adds Smart Search SDK scraper support to GPT-Researcher.

Changes:

  • Custom Scraper Class: SmartSearchScraper for web page fetching

  • Smart Search Integration: Uses WebSearchClient.fetch_web_page()

  • Content Extraction: Extracts markdown content and metadata

  • Error Handling: Graceful fallback on scraping failures

Key Features:

Integration:

  • Patches gpt_researcher.scraper.scraper.Scraper.get_scraper()

  • Activated via SCRAPER: "smart_search" in configuration

4. Extended Deep Research Patch (patch/extended_deep_research.py)

Purpose: Extends deep research functionality with additional parameters.

Changes:

  • Parameter Support: Adds support for query_domains, source_urls, document_urls, complement_source_urls

  • Enhanced Researcher Creation: Passes all parameters to sub-researchers

  • Recursive Research: Maintains parameter context through recursive calls

Key Enhancements:

Integration:

  • Patches gpt_researcher.skills.deep_research.DeepResearchSkill.deep_research

  • Automatically applied during patch initialization

5. Sea Lion LLM Patch (patch/extended_llm_configs.py)

Purpose: Adds Sea Lion LLM provider support to GPT-Researcher.

Changes:

  • Provider Registration: Adds "sea_lion" to supported providers

  • OpenAI Compatibility: Uses ChatOpenAI with custom base URL

  • Configuration: Reads from SEA_LION_BASE_URL and SEA_LION_API_KEY

Key Features:

Integration:

  • Patches gpt_researcher.llm_provider.generic.base.GenericLLMProvider.from_provider

  • Adds "sea_lion" to _SUPPORTED_PROVIDERS set

Configuration

The adapter creates a temporary configuration file for each research request:

Patch Application Order

Patches are applied in a specific order to ensure dependencies are met:

  1. Extended Deep Research: Base functionality extension

  2. Smart Search Retriever: Replaces default retriever

  3. Smart Search Scraper: Adds scraper support

  4. Sea Lion LLM: Adds LLM provider support

Configuration

Required Environment Variables

Report Configuration

Usage

Basic Usage

Use the task or taskgroup API with a profile. The profile determines the provider and configuration:

Use the returned taskgroup_id and tasks to stream (GET /v1/taskgroup/{id}/stream) or poll for status and result (GET /v1/tasks/{id}). See Quick Start Guide.

Request Format

For taskgroup: query and profile (form or JSON per API contract). For a single task: same fields via POST /v1/tasks.

Note: Profile-specific options (like report_type, focus, max_sources, etc.) are configured in the profile itself, not in the request. See Research Profiles for more information.

Response Format

Technical Details

Patch Mechanism

All patches use Python's monkey patching to modify GPT-Researcher at runtime:

Thread Safety

  • Retriever: Uses thread pool executor for async operations

  • Scraper: Uses asyncio.run() for async page fetching

  • Adapter: Stateless, supports concurrent requests

Error Handling

  • Patch Failures: Logged but don't prevent adapter initialization

  • Research Failures: Caught and returned as failed results

  • Scraping Errors: Gracefully handled with empty content

  • Retriever Errors: Returns None, handled by GPT-Researcher

Performance Considerations

  • Temporary Config Files: Created per request, cleaned up automatically

  • Async Operations: Smart Search SDK calls are async

  • Concurrent Research: GPT-Researcher handles concurrency internally

  • Report Caching: Not implemented, each request generates fresh report

Last updated