GPT-Researcher

Audience: Developers

GPT-Researcher Provider

Overview

GPT-Researcher is an automated research agent that conducts comprehensive research on any given topic. It uses web scraping, information gathering, and LLM-based synthesis to generate detailed research reports.

What is GPT-Researcher?

GPT-Researcher is an open-source library that automates the research process by:

Generating Research Plans: Creates structured research plans based on queries
Web Scraping: Extracts information from multiple web sources
Information Aggregation: Combines information from various sources
Report Generation: Synthesizes findings into comprehensive reports
Citation Support: Includes source citations in generated reports

Monitored Sources

GL Open DeepResearch uses the open-source GPT Researcher implementation. The following are the main references for the upstream project:

Resource

Description

GPT Researcher (gptr.dev)

Official GPT Researcher homepage and product overview

GPT Researcher Docs — Welcome

Official documentation (getting started, concepts, and guides)

How It Works

Research Process

GPT-Researcher follows a structured research workflow:

Key Components

GPTResearcher: Main research orchestrator
Retriever: Handles web search and result retrieval
Scraper: Extracts content from web pages
LLM Provider: Generates research plans and reports
Report Generator: Formats final research output

Research Flow

Initialization: GPTResearcher initialized with query and configuration
Query Generation: System generates search queries from research question
Web Search: Retriever searches web for relevant sources
Content Extraction: Scraper fetches and extracts content from URLs
Analysis: LLM analyzes gathered content and extracts key information
Synthesis: Information synthesized into structured context
Report Generation: Final report generated from synthesized context

Integration

Integration Approach

We use the official gpt-researcher package installed via pip, and then create custom modifications in the gl_deep_research/packages/gpt_researcher/ package folder. Our customizations are applied through runtime patching (monkey patching) to extend the library's functionality without modifying the installed package directly.

The integration follows the Adapter pattern:

GPTResearcherAdapter implements the OrchestratorAdapter protocol
Adapter bridges GPT-Researcher engine to the orchestrator
Profile-based configuration determines provider selection
Streaming support via adapter-specific postprocessors

This approach allows us to:

Use the official gpt-researcher package from PyPI
Contribute our custom changes through patches in packages/gpt_researcher/
Extend functionality without forking or modifying the original package
Apply patches at runtime during adapter initialization
Maintain compatibility with upstream package updates
Integrate seamlessly with the orchestrator system

Adapter Layer

The GPT-Researcher provider is integrated through GPTResearcherAdapter (research/adapter/gpt_researcher_adapter.py):

class GPTResearcherAdapter:
    """Adapter for running GPTResearcher deep research agent.

    This adapter bridges the GPT-Researcher engine to the orchestrator
    using the Adapter pattern. It implements the OrchestratorAdapter protocol.
    """

    def __init__(self):
        """Initialize the GPTResearcherAdapter."""
        initialize_patches()

    @property
    def name(self) -> str:
        return "GPT-Researcher"

    @property
    def provider_type(self) -> ProviderType:
        return ProviderType.GPTR

    @property
    def streaming_postprocessor(self) -> Callable[[str], str]:
        return self._postprocess_streaming_event_gptr

Initialization Process

Patch Initialization: Applies all necessary patches to GPT-Researcher library
Adapter Creation: Creates GPTResearcherAdapter instance
Orchestrator Registration: Adapter registered with OrchestratorFactory
Adapter Ready: Adapter ready to accept research requests via orchestrator

Request Flow

Request received via task or taskgroup API (POST /v1/tasks or POST /v1/taskgroup)
Request authenticated via API key (account or master key)
Profile loaded from database based on profile parameter
Orchestrator creates adapter instance via OrchestratorFactory
Temporary configuration file created with custom retriever/scraper
GPTResearcher initialized with configuration
Research conducted via conduct_research()
Report generated via write_report()
Result formatted as DeepResearchResult
Response returned through orchestrator to router

Customizations and Changes

Package Location

The GPT-Researcher customization code is located in gl_deep_research/packages/gpt_researcher/, which contains patches and extensions to the original GPT-Researcher library. The base gpt-researcher package is installed via pip, and our patches are applied at runtime to extend its functionality.

Key Changes Made

1. Patch Initializer (patch_initializer.py)

Purpose: Centralized patch management for all GPT-Researcher modifications.

Functionality:

Applies all patches in correct order
Ensures patches are applied before using GPT-Researcher
Called during adapter initialization

Patches Applied:

Extended Deep Research patch
Smart Search Retriever patch
Smart Search Scraper patch
Sea Lion LLM patch

2. Smart Search Retriever Patch (patch/smart_search_retriever.py)

Purpose: Replaces default retriever with Smart Search SDK integration.

Changes:

Custom Retriever Class: SmartSearchRetriever replaces CustomRetriever
Smart Search Integration: Uses WebSearchClient from Smart Search SDK
Result Formatting: Formats results to match GPT-Researcher's expected structure
Async Support: Handles async operations with thread pool execution

Key Features:

class SmartSearchRetriever:
    def search(self, max_results: int = 10) -> list[dict[str, Any]]:
        # Uses Smart Search SDK for web search
        # Returns formatted results matching GPT-Researcher structure
        return [
            {
                "title": "...",
                "href": "...",
                "body": "..."
            }
        ]

Integration:

Patches gpt_researcher.retrievers.CustomRetriever
Activated via RETRIEVER: "custom" in configuration

3. Smart Search Scraper Patch (patch/smart_search_scraper.py)

Purpose: Adds Smart Search SDK scraper support to GPT-Researcher.

Changes:

Custom Scraper Class: SmartSearchScraper for web page fetching
Smart Search Integration: Uses WebSearchClient.fetch_web_page()
Content Extraction: Extracts markdown content and metadata
Error Handling: Graceful fallback on scraping failures

Key Features:

class SmartSearchScraper:
    def scrape(self) -> tuple[str, list, str]:
        # Fetches page using Smart Search SDK
        # Returns: (content, image_urls, title)
        return content, image_urls, title

Integration:

Patches gpt_researcher.scraper.scraper.Scraper.get_scraper()
Activated via SCRAPER: "smart_search" in configuration

4. Extended Deep Research Patch (patch/extended_deep_research.py)

Purpose: Extends deep research functionality with additional parameters.

Changes:

Parameter Support: Adds support for query_domains, source_urls, document_urls, complement_source_urls
Enhanced Researcher Creation: Passes all parameters to sub-researchers
Recursive Research: Maintains parameter context through recursive calls

Key Enhancements:

async def patched_deep_research(
    self,
    query: str,
    breadth: int,
    depth: int,
    learnings: list[str] | None = None,
    citations: dict[str, str] | None = None,
    visited_urls: set[str] | None = None,
    on_progress=None,
) -> dict[str, Any]:
    # Enhanced with additional parameter support
    researcher = GPTResearcher(
        query=serp_query["query"],
        query_domains=self.researcher.query_domains,  # ✅ Added
        source_urls=self.researcher.source_urls,      # ✅ Added
        document_urls=self.researcher.document_urls,  # ✅ Added
        complement_source_urls=self.researcher.complement_source_urls,  # ✅ Added
    )

Integration:

Patches gpt_researcher.skills.deep_research.DeepResearchSkill.deep_research
Automatically applied during patch initialization

5. Sea Lion LLM Patch (patch/extended_llm_configs.py)

Purpose: Adds Sea Lion LLM provider support to GPT-Researcher.

Changes:

Provider Registration: Adds "sea_lion" to supported providers
OpenAI Compatibility: Uses ChatOpenAI with custom base URL
Configuration: Reads from SEA_LION_BASE_URL and SEA_LION_API_KEY

Key Features:

def apply_sea_lion_llm_patch():
    # Patches GenericLLMProvider.from_provider()
    # Adds "sea_lion" provider support
    if provider == "sea_lion":
        llm = ChatOpenAI(
            openai_api_base=SEA_LION_BASE_URL,
            openai_api_key=SEA_LION_API_KEY,
        )

Integration:

Patches gpt_researcher.llm_provider.generic.base.GenericLLMProvider.from_provider
Adds "sea_lion" to _SUPPORTED_PROVIDERS set

Configuration

The adapter creates a temporary configuration file for each research request:

config_data = {
    "RETRIEVER": SMART_SEARCH_RETRIEVER,  # "custom"
    "SCRAPER": SMART_SEARCH_SCRAPER,      # "smart_search"
}

researcher = GPTResearcher(
    query=query,
    report_type=REPORT_TYPE,
    config_path=temp_config_path,
)

Patch Application Order

Patches are applied in a specific order to ensure dependencies are met:

Extended Deep Research: Base functionality extension
Smart Search Retriever: Replaces default retriever
Smart Search Scraper: Adds scraper support
Sea Lion LLM: Adds LLM provider support

Configuration

Required Environment Variables

# OpenAI Configuration (for GPT-Researcher LLM)
OPENAI_API_KEY=your-openai-api-key

# Smart Search SDK Configuration
SMART_SEARCH_BASE_URL=https://your-smart-search-url
SMART_SEARCH_IDENTIFIER=your-smart-search-identifier
SMART_SEARCH_SECRET=your-smart-search-secret

# Optional Sea Lion LLM Configuration
SEA_LION_BASE_URL=https://api.sealion.ai/v1
SEA_LION_API_KEY=your-sea-lion-api-key

Report Configuration

REPORT_TYPE = ReportType.ResearchReport.value
REPORT_SOURCE = ReportSource.Web.value

Usage

Basic Usage

Use the task or taskgroup API with a profile. The profile determines the provider and configuration:

curl -X POST 'https://stag-gl-deep-research.obrol.id/v1/taskgroup' \
  -H "X-API-Key: your-api-key" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  --data-urlencode 'query=What are the latest developments in quantum computing?' \
  --data-urlencode 'profile=GPTR-QUICK'

Use the returned taskgroup_id and tasks to stream (GET /v1/taskgroup/{id}/stream) or poll for status and result (GET /v1/tasks/{id}). See Quick Start Guide.

Request Format

For taskgroup: query and profile (form or JSON per API contract). For a single task: same fields via POST /v1/tasks.

Note: Profile-specific options (like report_type, focus, max_sources, etc.) are configured in the profile itself, not in the request. See Research Profiles for more information.

Response Format

{
  "result": {
    "question": "Your research question",
    "answer": "",
    "prediction": "Generated research report...",
    "status": "completed",
    "termination": "",
    "provider": "gptr",
    "metadata": {
      "sources": ["source1", "source2"],
      "report_type": "research_report"
    },
    "duration_seconds": 60.45,
    "created_at": "2024-12-02T10:00:00Z",
    "completed_at": "2024-12-02T10:01:00Z"
  },
  "success": true,
  "error": null
}

Technical Details

Patch Mechanism

All patches use Python's monkey patching to modify GPT-Researcher at runtime:

# Example: Patching a class
original_class = gpt_researcher.module.Class
gpt_researcher.module.Class = PatchedClass

# Example: Patching a method
original_method = Class.method
Class.method = patched_method

Thread Safety

Retriever: Uses thread pool executor for async operations
Scraper: Uses asyncio.run() for async page fetching
Adapter: Stateless, supports concurrent requests

Error Handling

Patch Failures: Logged but don't prevent adapter initialization
Research Failures: Caught and returned as failed results
Scraping Errors: Gracefully handled with empty content
Retriever Errors: Returns None, handled by GPT-Researcher

Performance Considerations

Temporary Config Files: Created per request, cleaned up automatically
Async Operations: Smart Search SDK calls are async
Concurrent Research: GPT-Researcher handles concurrency internally
Report Caching: Not implemented, each request generates fresh report

PreviousTongyi Deep Research NextDeployment Guide

Last updated 5 days ago

hashtagGPT-Researcher Provider

hashtagOverview

hashtagWhat is GPT-Researcher?

hashtagMonitored Sources

hashtagHow It Works

hashtagResearch Process

hashtagKey Components

hashtagResearch Flow

hashtagIntegration

hashtagIntegration Approach

hashtagAdapter Layer

hashtagInitialization Process

hashtagRequest Flow

hashtagCustomizations and Changes

hashtagPackage Location

hashtagKey Changes Made

hashtagConfiguration

hashtagPatch Application Order

hashtagConfiguration

hashtagRequired Environment Variables

hashtagReport Configuration

hashtagUsage

hashtagBasic Usage

hashtagRequest Format

hashtagResponse Format

hashtagTechnical Details

hashtagPatch Mechanism

hashtagThread Safety

hashtagError Handling

hashtagPerformance Considerations

GPT-Researcher Provider

Overview

What is GPT-Researcher?

Monitored Sources

How It Works

Research Process

Key Components

Research Flow

Integration

Integration Approach

Adapter Layer

Initialization Process

Request Flow

Customizations and Changes

Package Location

Key Changes Made

Configuration

Patch Application Order

Configuration

Required Environment Variables

Report Configuration

Usage

Basic Usage

Request Format

Response Format

Technical Details

Patch Mechanism

Thread Safety

Error Handling

Performance Considerations