langchain-ai · vbarda · Jun 6, 2025 · Jun 6, 2025 · Jun 9, 2025 · Jun 9, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,94 @@
+# Open Deep Research
+
+## About Open Deep Research
+
+Open Deep Research is an experimental, fully open-source research assistant that automates deep research and produces comprehensive reports on any topic. It's designed to help researchers, analysts, and curious individuals generate detailed, well-sourced reports without manual research overhead.
+
+### Key Features
+- **Automated Research**: Searches multiple sources (web, academic papers, specialized databases)
+- **Comprehensive Reports**: Generates structured markdown reports with proper citations
+- **Multiple Search APIs**: Supports Tavily, Perplexity, Exa, ArXiv, PubMed, DuckDuckGo, and more
+- **Flexible Models**: Compatible with any LLM that supports the `init_chat_model()` API
+- **Quality Evaluation**: Built-in evaluation systems to assess report quality
+
+## Two Research Implementations
+
+Open Deep Research offers two distinct approaches to automated research, each with unique advantages:
+
+### 1. Graph-based Workflow Implementation
+
+The **graph-based implementation** (`src/open_deep_research/graph.py`) follows a structured plan-and-execute workflow:
+
+**Characteristics:**
+- **Interactive Planning**: Uses a planner model to generate a structured report outline
+- **Human-in-the-Loop**: Allows review and feedback on the report plan before execution
+- **Sequential Process**: Creates sections one by one with reflection between iterations
+- **Quality Focus**: Emphasizes report accuracy and structure through iterative refinement
+
+**Best for:**
+- High-stakes research where accuracy is critical
+- Reports requiring specific structure or customization
+- Situations where you want control over the research process
+- Academic or professional research contexts
+
+### 2. Multi-Agent Implementation
+
+The **multi-agent implementation** (`src/open_deep_research/multi_agent.py`) uses a supervisor-researcher architecture:
+
+**Characteristics:**
+- **Supervisor Agent**: Manages overall research process and assembles final report
+- **Parallel Research**: Multiple researcher agents work simultaneously on different sections
+- **Speed Optimized**: Significantly faster due to parallel processing
+- **Tool Specialization**: Each agent has specific tools for their role
+
+**Best for:**
+- Quick research and rapid report generation
+- Exploratory research where speed matters
+- Situations with less need for human oversight
+- Business intelligence and market research
+
+## Quality Evaluation
+
+This guide explains how to quickly test and evaluate the quality of reports generated by Open Deep Research using the pytest evaluation system. The pytest evaluation system provides an easy way to:
+- Test both research agent implementations (multi-agent and graph-based)
+- Get immediate visual feedback with rich console output
+- Verify report quality against 9 comprehensive criteria
+- Compare different model configurations
+- Track results in LangSmith for analysis
+
+### Test Specific Agent
+```bash
+# Test only the multi-agent implementation
+python tests/run_test.py --agent multi_agent
+
+# Test only the graph-based implementation  
+python tests/run_test.py --agent graph
+```
+
+## Understanding the Output
+
+### Console Output
+The evaluation provides rich visual feedback including:
+
+1. **Test Configuration Panel**: Shows which agent and search API are being tested
+2. **Model Configuration Table**: Displays all model settings in a formatted table
+3. **Report Generation Status**: Real-time feedback during report creation
+4. **Generated Report Display**: Full report rendered in markdown format
+5. **Evaluation Results**: 
+   - **PASSED/FAILED** status in color-coded panel
+   - **Report Structure Analysis**: Table showing section headers
+   - **Evaluation Justification**: Detailed explanation from the evaluator
+
+### What Gets Evaluated
+
+The system checks reports against 9 quality criteria:
+
+1. **Topic Relevance (Overall)**: Does the report address the input topic thoroughly?
+2. **Section Relevance (Critical)**: Are all sections directly relevant to the main topic?
+3. **Structure and Flow**: Do sections flow logically and create a cohesive narrative?
+4. **Introduction Quality**: Does the introduction provide context and scope?
+5. **Conclusion Quality**: Does the conclusion summarize key findings?
+6. **Structural Elements**: Proper use of tables, lists, etc.
+7. **Section Headers**: Correct Markdown formatting (# for title, ## for sections)
+8. **Citations**: Proper source citation in each main body section
+9. **Overall Quality**: Well-researched, accurate, and professionally written
diff --git a/README.md b/README.md
@@ -153,6 +153,17 @@ The multi-agent implementation uses a supervisor-researcher architecture:
 
 This implementation focuses on efficiency and parallelization, making it ideal for faster report generation with less direct user involvement.
 
+You can customize the multi-agent implementation through several parameters:
+
+- `supervisor_model`: Model for the supervisor agent (default: "anthropic:claude-3-5-sonnet-latest")
+- `researcher_model`: Model for researcher agents (default: "anthropic:claude-3-5-sonnet-latest") 
+- `number_of_queries`: Number of search queries to generate per section (default: 2)
+- `search_api`: API to use for web searches (default: "tavily", options include "duckduckgo", "none")
+- `ask_for_clarification`: Whether the supervisor should ask clarifying questions before research (default: false) - **Important**: Set to `true` to enable the Question tool for the supervisor agent
+- `mcp_server_config`: Configuration for MCP servers (optional)
+- `mcp_prompt`: Additional instructions for using MCP tools (optional)
+- `mcp_tools_to_include`: Specific MCP tools to include (optional)
+
 ## MCP (Model Context Protocol) Support
 
 The multi-agent implementation (`src/open_deep_research/multi_agent.py`) supports MCP servers to extend research capabilities beyond web search. MCP tools are available to research agents alongside or instead of traditional search tools, enabling access to local files, databases, APIs, and other data sources.
@@ -211,7 +222,13 @@ MCP server config:
 
 MCP prompt: 
 ```
-Step 1: Use the `list_allowed_directories` tool to get the list of allowed directories. Step 2: Use the `read_file` tool to read files in the allowed directory.
+CRITICAL: You MUST follow this EXACT sequence when using filesystem tools:
+
+1. FIRST: Call `list_allowed_directories` tool to discover allowed directories
+2. SECOND: Call `list_directory` tool on a specific directory from step 1 to see available files  
+3. THIRD: Call `read_file` tool to read specific files found in step 2
+
+DO NOT call `list_directory` or `read_file` until you have first called `list_allowed_directories`. You must discover the allowed directories before attempting to browse or read files.
 ```
 
 MCP tools: 
@@ -221,7 +238,7 @@ list_directory
 read_file
 ```
 
-Example test case that you can provide: 
+Example test topic and follow-up feedback that you can provide that will reference the included file: 
 
 Topic:
 ```
@@ -297,14 +314,41 @@ groq.APIError: Failed to call a function. Please adjust your prompt. See 'failed
 
 (7) For working with local models via Ollama, see [here](https://github.com/langchain-ai/open_deep_research/issues/65#issuecomment-2743586318).
 
-## Testing Report Quality
+## Evaluation Systems
+
+Open Deep Research includes two comprehensive evaluation systems to assess report quality and performance:
 
-To compare the quality of reports generated by both implementations:
+### 1. Pytest-based Evaluation System
 
+A developer-friendly testing framework that provides immediate feedback during development and testing cycles.
+
+#### **Features:**
+- **Rich Console Output**: Formatted tables, progress indicators, and color-coded results
+- **Binary Pass/Fail Testing**: Clear success/failure criteria for CI/CD integration
+- **LangSmith Integration**: Automatic experiment tracking and logging
+- **Flexible Configuration**: Extensive CLI options for different testing scenarios
+- **Real-time Feedback**: Live output during test execution
+
+#### **Evaluation Criteria:**
+The system evaluates reports against 9 comprehensive quality dimensions:
+- Topic relevance (overall and section-level)
+- Structure and logical flow
+- Introduction and conclusion quality
+- Proper use of structural elements (headers, citations)
+- Markdown formatting compliance
+- Citation quality and source attribution
+- Overall research depth and accuracy
+
+#### **Usage:**
 ```bash
-# Test with default Anthropic models
+# Run all agents with default settings
 python tests/run_test.py --all
 
+# Test specific agent with custom models
+python tests/run_test.py --agent multi_agent \
+  --supervisor-model "anthropic:claude-3-7-sonnet-latest" \
+  --search-api tavily
+
 # Test with OpenAI o3 models
 python tests/run_test.py --all \
   --supervisor-model "openai:o3" \
@@ -317,7 +361,68 @@ python tests/run_test.py --all \
   --search-api "tavily"
 ```
 
-The test results will be logged to LangSmith, allowing you to compare the quality of reports generated by each implementation with different model configurations.
+#### **Key Files:**
+- `tests/run_test.py`: Main test runner with rich CLI interface
+- `tests/test_report_quality.py`: Core test implementation
+- `tests/conftest.py`: Pytest configuration and CLI options
+
+### 2. LangSmith Evaluate API System
+
+A comprehensive batch evaluation system designed for detailed analysis and comparative studies.
+
+#### **Features:**
+- **Multi-dimensional Scoring**: Four specialized evaluators with 1-5 scale ratings
+- **Weighted Criteria**: Detailed scoring with customizable weights for different quality aspects
+- **Dataset-driven Evaluation**: Batch processing across multiple test cases
+- **Performance Optimization**: Caching with extended TTL for evaluator prompts
+- **Professional Reporting**: Structured analysis with improvement recommendations
+
+#### **Evaluation Dimensions:**
+
+1. **Overall Quality** (7 weighted criteria):
+   - Research depth and source quality (20%)
+   - Analytical rigor and critical thinking (15%)
+   - Structure and organization (20%)
+   - Practical value and actionability (10%)
+   - Balance and objectivity (15%)
+   - Writing quality and clarity (10%)
+   - Professional presentation (10%)
+
+2. **Relevance**: Section-by-section topic relevance analysis with strict criteria
+
+3. **Structure**: Assessment of logical flow, formatting, and citation practices
+
+4. **Groundedness**: Evaluation of alignment with retrieved context and sources
+
+#### **Usage:**
+```bash
+# Run comprehensive evaluation on LangSmith datasets
+python tests/evals/run_evaluate.py
+```
+
+#### **Key Files:**
+- `tests/evals/run_evaluate.py`: Main evaluation script
+- `tests/evals/evaluators.py`: Four specialized evaluator functions
+- `tests/evals/prompts.py`: Detailed evaluation prompts for each dimension
+- `tests/evals/target.py`: Report generation workflows
+
+### When to Use Each System
+
+**Use Pytest System for:**
+- Development and debugging cycles
+- CI/CD pipeline integration
+- Quick model comparison experiments
+- Interactive testing with immediate feedback
+- Gate-keeping before production deployments
+
+**Use LangSmith System for:**
+- Comprehensive model evaluation across datasets
+- Research and analysis of system performance
+- Detailed performance profiling and benchmarking
+- Comparative studies between different configurations
+- Production monitoring and quality assurance
+
+Both evaluation systems complement each other and provide comprehensive coverage for different use cases and development stages.
 
 ## UX
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -12,7 +12,7 @@ dependencies = [
     "langgraph>=0.2.55",
     "langchain-community>=0.3.9",
     "langchain-openai>=0.3.7",
-    "langchain-anthropic>=0.3.9",
+    "langchain-anthropic>=0.3.15",
     "langchain-mcp-adapters>=0.1.6",
     "langchain-deepseek>=0.1.2",
     "langchain-tavily",
@@ -33,7 +33,11 @@ dependencies = [
     "markdownify>=0.11.6",
     "azure-identity>=1.21.0",
     "azure-search>=1.0.0b2",
-    "azure-search-documents>=11.5.2"
+    "azure-search-documents>=11.5.2",
+    "rich>=13.0.0",
+    "langgraph-cli[inmem]>=0.3.1",
+    "langsmith>=0.3.37",
+    "langchain-core>=0.3.64",
 ]
 
 [project.optional-dependencies]
@@ -44,10 +48,11 @@ requires = ["setuptools>=73.0.0", "wheel"]
 build-backend = "setuptools.build_meta"
 
 [tool.setuptools]
-packages = ["open_deep_research"]
+packages = ["open_deep_research", "tests"]
 
 [tool.setuptools.package-dir]
 "open_deep_research" = "src/open_deep_research"
+"tests" = "tests"
 
 [tool.setuptools.package-data]
 "*" = ["py.typed"]

diff --git a/src/open_deep_research/configuration.py b/src/open_deep_research/configuration.py
@@ -29,49 +29,68 @@ class SearchAPI(Enum):
     NONE = "none"
 
 @dataclass(kw_only=True)
-class Configuration:
-    """The configurable fields for the chatbot."""
+class WorkflowConfiguration:
+    """Configuration for the workflow/graph-based implementation (graph.py)."""
     # Common configuration
-    report_structure: str = DEFAULT_REPORT_STRUCTURE # Defaults to the default report structure
-    search_api: SearchAPI = SearchAPI.TAVILY # Default to TAVILY
+    report_structure: str = DEFAULT_REPORT_STRUCTURE
+    search_api: SearchAPI = SearchAPI.TAVILY
     search_api_config: Optional[Dict[str, Any]] = None
     process_search_results: Literal["summarize", "split_and_rerank"] | None = None
-    # Summarization model for summarizing search results
-    # will be used if summarize_search_results is True
     summarization_model_provider: str = "anthropic"
     summarization_model: str = "claude-3-5-haiku-latest"
-    # Whether to include search results string in the agent output state
-    # This is used for evaluation purposes only
     include_source_str: bool = False
 
-    # Graph-specific configuration
+    # Workflow-specific configuration
     number_of_queries: int = 2 # Number of search queries to generate per iteration
     max_search_depth: int = 2 # Maximum number of reflection + search iterations
-    planner_provider: str = "anthropic"  # Defaults to Anthropic as provider
-    planner_model: str = "claude-3-7-sonnet-latest" # Defaults to claude-3-7-sonnet-latest
-    planner_model_kwargs: Optional[Dict[str, Any]] = None # kwargs for planner_model
-    writer_provider: str = "anthropic" # Defaults to Anthropic as provider
-    writer_model: str = "claude-3-5-sonnet-latest" # Defaults to claude-3-5-sonnet-latest
-    writer_model_kwargs: Optional[Dict[str, Any]] = None # kwargs for writer_model
+    planner_provider: str = "anthropic"
+    planner_model: str = "claude-3-7-sonnet-latest"
+    planner_model_kwargs: Optional[Dict[str, Any]] = None
+    writer_provider: str = "anthropic"
+    writer_model: str = "claude-3-7-sonnet-latest"
+    writer_model_kwargs: Optional[Dict[str, Any]] = None
+
+    @classmethod
+    def from_runnable_config(
+        cls, config: Optional[RunnableConfig] = None
+    ) -> "WorkflowConfiguration":
+        """Create a WorkflowConfiguration instance from a RunnableConfig."""
+        configurable = (
+            config["configurable"] if config and "configurable" in config else {}
+        )
+        values: dict[str, Any] = {
+            f.name: os.environ.get(f.name.upper(), configurable.get(f.name))
+            for f in fields(cls)
+            if f.init
+        }
+        return cls(**{k: v for k, v in values.items() if v})
+
+@dataclass(kw_only=True)
+class MultiAgentConfiguration:
+    """Configuration for the multi-agent implementation (multi_agent.py)."""
+    # Common configuration
+    search_api: SearchAPI = SearchAPI.TAVILY
+    search_api_config: Optional[Dict[str, Any]] = None
+    process_search_results: Literal["summarize", "split_and_rerank"] | None = None
+    summarization_model_provider: str = "anthropic"
+    summarization_model: str = "claude-3-5-haiku-latest"
 
     # Multi-agent specific configuration
-    supervisor_model: str = "openai:gpt-4.1" # Model for supervisor agent in multi-agent setup
-    researcher_model: str = "openai:gpt-4.1" # Model for research agents in multi-agent setup 
+    number_of_queries: int = 2 # Number of search queries to generate per section
+    supervisor_model: str = "anthropic:claude-3-7-sonnet-latest"
+    researcher_model: str = "anthropic:claude-3-7-sonnet-latest"
+    final_report_model: str = "openai:gpt-4.1"
     ask_for_clarification: bool = False # Whether to ask for clarification from the user
-    # MCP server configuration for multi-agent setup
-    # see examples here: https://github.com/langchain-ai/langchain-mcp-adapters#client-1
+    # MCP server configuration
     mcp_server_config: Optional[Dict[str, Any]] = None
-    # optional prompt to append to the researcher agent prompt
     mcp_prompt: Optional[str] = None
-    # optional list of MCP tool names to include in the researcher agent
-    # if not set, all MCP tools across all servers in the config will be included
     mcp_tools_to_include: Optional[list[str]] = None
 
     @classmethod
     def from_runnable_config(
         cls, config: Optional[RunnableConfig] = None
-    ) -> "Configuration":
-        """Create a Configuration instance from a RunnableConfig."""
+    ) -> "MultiAgentConfiguration":
+        """Create a MultiAgentConfiguration instance from a RunnableConfig."""
         configurable = (
             config["configurable"] if config and "configurable" in config else {}
         )
@@ -81,3 +100,6 @@ def from_runnable_config(
             if f.init
         }
         return cls(**{k: v for k, v in values.items() if v})
+
+# Keep the old Configuration class for backward compatibility
+Configuration = WorkflowConfiguration