GPT-Researcher
Audience: Developers
GPT-Researcher Provider
Overview
GPT-Researcher is an automated research agent that conducts comprehensive research on any given topic. It uses web scraping, information gathering, and LLM-based synthesis to generate detailed research reports.
What is GPT-Researcher?
GPT-Researcher is an open-source library that automates the research process by:
Generating Research Plans: Creates structured research plans based on queries
Web Scraping: Extracts information from multiple web sources
Information Aggregation: Combines information from various sources
Report Generation: Synthesizes findings into comprehensive reports
Citation Support: Includes source citations in generated reports
Monitored Sources
GL Open DeepResearch uses the open-source GPT Researcher implementation. The following are the main references for the upstream project:
Official GPT Researcher homepage and product overview
Official documentation (getting started, concepts, and guides)
How It Works
Research Process
GPT-Researcher follows a structured research workflow:
Key Components
GPTResearcher: Main research orchestrator
Retriever: Handles web search and result retrieval
Scraper: Extracts content from web pages
LLM Provider: Generates research plans and reports
Report Generator: Formats final research output
Research Flow
Initialization: GPTResearcher initialized with query and configuration
Query Generation: System generates search queries from research question
Web Search: Retriever searches web for relevant sources
Content Extraction: Scraper fetches and extracts content from URLs
Analysis: LLM analyzes gathered content and extracts key information
Synthesis: Information synthesized into structured context
Report Generation: Final report generated from synthesized context
Integration
Integration Approach
We use the official gpt-researcher package installed via pip, and then create custom modifications in the gl_deep_research/packages/gpt_researcher/ package folder. Our customizations are applied through runtime patching (monkey patching) to extend the library's functionality without modifying the installed package directly.
The integration follows the Adapter pattern:
GPTResearcherAdapterimplements theOrchestratorAdapterprotocolAdapter bridges GPT-Researcher engine to the orchestrator
Profile-based configuration determines provider selection
Streaming support via adapter-specific postprocessors
This approach allows us to:
Use the official
gpt-researcherpackage from PyPIContribute our custom changes through patches in
packages/gpt_researcher/Extend functionality without forking or modifying the original package
Apply patches at runtime during adapter initialization
Maintain compatibility with upstream package updates
Integrate seamlessly with the orchestrator system
Adapter Layer
The GPT-Researcher provider is integrated through GPTResearcherAdapter (research/adapter/gpt_researcher_adapter.py):
Initialization Process
Patch Initialization: Applies all necessary patches to GPT-Researcher library
Adapter Creation: Creates
GPTResearcherAdapterinstanceOrchestrator Registration: Adapter registered with
OrchestratorFactoryAdapter Ready: Adapter ready to accept research requests via orchestrator
Request Flow
Request received via task or taskgroup API (
POST /v1/tasksorPOST /v1/taskgroup)Request authenticated via API key (account or master key)
Profile loaded from database based on
profileparameterOrchestrator creates adapter instance via
OrchestratorFactoryTemporary configuration file created with custom retriever/scraper
GPTResearcher initialized with configuration
Research conducted via
conduct_research()Report generated via
write_report()Result formatted as
DeepResearchResultResponse returned through orchestrator to router
Customizations and Changes
Package Location
The GPT-Researcher customization code is located in gl_deep_research/packages/gpt_researcher/, which contains patches and extensions to the original GPT-Researcher library. The base gpt-researcher package is installed via pip, and our patches are applied at runtime to extend its functionality.
Key Changes Made
1. Patch Initializer (patch_initializer.py)
Purpose: Centralized patch management for all GPT-Researcher modifications.
Functionality:
Applies all patches in correct order
Ensures patches are applied before using GPT-Researcher
Called during adapter initialization
Patches Applied:
Extended Deep Research patch
Smart Search Retriever patch
Smart Search Scraper patch
Sea Lion LLM patch
2. Smart Search Retriever Patch (patch/smart_search_retriever.py)
Purpose: Replaces default retriever with Smart Search SDK integration.
Changes:
Custom Retriever Class:
SmartSearchRetrieverreplacesCustomRetrieverSmart Search Integration: Uses
WebSearchClientfrom Smart Search SDKResult Formatting: Formats results to match GPT-Researcher's expected structure
Async Support: Handles async operations with thread pool execution
Key Features:
Integration:
Patches
gpt_researcher.retrievers.CustomRetrieverActivated via
RETRIEVER: "custom"in configuration
3. Smart Search Scraper Patch (patch/smart_search_scraper.py)
Purpose: Adds Smart Search SDK scraper support to GPT-Researcher.
Changes:
Custom Scraper Class:
SmartSearchScraperfor web page fetchingSmart Search Integration: Uses
WebSearchClient.fetch_web_page()Content Extraction: Extracts markdown content and metadata
Error Handling: Graceful fallback on scraping failures
Key Features:
Integration:
Patches
gpt_researcher.scraper.scraper.Scraper.get_scraper()Activated via
SCRAPER: "smart_search"in configuration
4. Extended Deep Research Patch (patch/extended_deep_research.py)
Purpose: Extends deep research functionality with additional parameters.
Changes:
Parameter Support: Adds support for
query_domains,source_urls,document_urls,complement_source_urlsEnhanced Researcher Creation: Passes all parameters to sub-researchers
Recursive Research: Maintains parameter context through recursive calls
Key Enhancements:
Integration:
Patches
gpt_researcher.skills.deep_research.DeepResearchSkill.deep_researchAutomatically applied during patch initialization
5. Sea Lion LLM Patch (patch/extended_llm_configs.py)
Purpose: Adds Sea Lion LLM provider support to GPT-Researcher.
Changes:
Provider Registration: Adds "sea_lion" to supported providers
OpenAI Compatibility: Uses
ChatOpenAIwith custom base URLConfiguration: Reads from
SEA_LION_BASE_URLandSEA_LION_API_KEY
Key Features:
Integration:
Patches
gpt_researcher.llm_provider.generic.base.GenericLLMProvider.from_providerAdds "sea_lion" to
_SUPPORTED_PROVIDERSset
Configuration
The adapter creates a temporary configuration file for each research request:
Patch Application Order
Patches are applied in a specific order to ensure dependencies are met:
Extended Deep Research: Base functionality extension
Smart Search Retriever: Replaces default retriever
Smart Search Scraper: Adds scraper support
Sea Lion LLM: Adds LLM provider support
Configuration
Required Environment Variables
Report Configuration
Usage
Basic Usage
Use the task or taskgroup API with a profile. The profile determines the provider and configuration:
Use the returned taskgroup_id and tasks to stream (GET /v1/taskgroup/{id}/stream) or poll for status and result (GET /v1/tasks/{id}). See Quick Start Guide.
Request Format
For taskgroup: query and profile (form or JSON per API contract). For a single task: same fields via POST /v1/tasks.
Note: Profile-specific options (like report_type, focus, max_sources, etc.) are configured in the profile itself, not in the request. See Research Profiles for more information.
Response Format
Technical Details
Patch Mechanism
All patches use Python's monkey patching to modify GPT-Researcher at runtime:
Thread Safety
Retriever: Uses thread pool executor for async operations
Scraper: Uses
asyncio.run()for async page fetchingAdapter: Stateless, supports concurrent requests
Error Handling
Patch Failures: Logged but don't prevent adapter initialization
Research Failures: Caught and returned as failed results
Scraping Errors: Gracefully handled with empty content
Retriever Errors: Returns None, handled by GPT-Researcher
Performance Considerations
Temporary Config Files: Created per request, cleaned up automatically
Async Operations: Smart Search SDK calls are async
Concurrent Research: GPT-Researcher handles concurrency internally
Report Caching: Not implemented, each request generates fresh report
Last updated
