Skip to content

[experiment] generate a single final report instead of separate sections #114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: vb/evals-and-improvements
Choose a base branch
from
94 changes: 94 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Open Deep Research

## About Open Deep Research

Open Deep Research is an experimental, fully open-source research assistant that automates deep research and produces comprehensive reports on any topic. It's designed to help researchers, analysts, and curious individuals generate detailed, well-sourced reports without manual research overhead.

### Key Features
- **Automated Research**: Searches multiple sources (web, academic papers, specialized databases)
- **Comprehensive Reports**: Generates structured markdown reports with proper citations
- **Multiple Search APIs**: Supports Tavily, Perplexity, Exa, ArXiv, PubMed, DuckDuckGo, and more
- **Flexible Models**: Compatible with any LLM that supports the `init_chat_model()` API
- **Quality Evaluation**: Built-in evaluation systems to assess report quality

## Two Research Implementations

Open Deep Research offers two distinct approaches to automated research, each with unique advantages:

### 1. Graph-based Workflow Implementation

The **graph-based implementation** (`src/open_deep_research/graph.py`) follows a structured plan-and-execute workflow:

**Characteristics:**
- **Interactive Planning**: Uses a planner model to generate a structured report outline
- **Human-in-the-Loop**: Allows review and feedback on the report plan before execution
- **Sequential Process**: Creates sections one by one with reflection between iterations
- **Quality Focus**: Emphasizes report accuracy and structure through iterative refinement

**Best for:**
- High-stakes research where accuracy is critical
- Reports requiring specific structure or customization
- Situations where you want control over the research process
- Academic or professional research contexts

### 2. Multi-Agent Implementation

The **multi-agent implementation** (`src/open_deep_research/multi_agent.py`) uses a supervisor-researcher architecture:

**Characteristics:**
- **Supervisor Agent**: Manages overall research process and assembles final report
- **Parallel Research**: Multiple researcher agents work simultaneously on different sections
- **Speed Optimized**: Significantly faster due to parallel processing
- **Tool Specialization**: Each agent has specific tools for their role

**Best for:**
- Quick research and rapid report generation
- Exploratory research where speed matters
- Situations with less need for human oversight
- Business intelligence and market research

## Quality Evaluation

This guide explains how to quickly test and evaluate the quality of reports generated by Open Deep Research using the pytest evaluation system. The pytest evaluation system provides an easy way to:
- Test both research agent implementations (multi-agent and graph-based)
- Get immediate visual feedback with rich console output
- Verify report quality against 9 comprehensive criteria
- Compare different model configurations
- Track results in LangSmith for analysis

### Test Specific Agent
```bash
# Test only the multi-agent implementation
python tests/run_test.py --agent multi_agent

# Test only the graph-based implementation
python tests/run_test.py --agent graph
```

## Understanding the Output

### Console Output
The evaluation provides rich visual feedback including:

1. **Test Configuration Panel**: Shows which agent and search API are being tested
2. **Model Configuration Table**: Displays all model settings in a formatted table
3. **Report Generation Status**: Real-time feedback during report creation
4. **Generated Report Display**: Full report rendered in markdown format
5. **Evaluation Results**:
- **PASSED/FAILED** status in color-coded panel
- **Report Structure Analysis**: Table showing section headers
- **Evaluation Justification**: Detailed explanation from the evaluator

### What Gets Evaluated

The system checks reports against 9 quality criteria:

1. **Topic Relevance (Overall)**: Does the report address the input topic thoroughly?
2. **Section Relevance (Critical)**: Are all sections directly relevant to the main topic?
3. **Structure and Flow**: Do sections flow logically and create a cohesive narrative?
4. **Introduction Quality**: Does the introduction provide context and scope?
5. **Conclusion Quality**: Does the conclusion summarize key findings?
6. **Structural Elements**: Proper use of tables, lists, etc.
7. **Section Headers**: Correct Markdown formatting (# for title, ## for sections)
8. **Citations**: Proper source citation in each main body section
9. **Overall Quality**: Well-researched, accurate, and professionally written
117 changes: 111 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,17 @@ The multi-agent implementation uses a supervisor-researcher architecture:

This implementation focuses on efficiency and parallelization, making it ideal for faster report generation with less direct user involvement.

You can customize the multi-agent implementation through several parameters:

- `supervisor_model`: Model for the supervisor agent (default: "anthropic:claude-3-5-sonnet-latest")
- `researcher_model`: Model for researcher agents (default: "anthropic:claude-3-5-sonnet-latest")
- `number_of_queries`: Number of search queries to generate per section (default: 2)
- `search_api`: API to use for web searches (default: "tavily", options include "duckduckgo", "none")
- `ask_for_clarification`: Whether the supervisor should ask clarifying questions before research (default: false) - **Important**: Set to `true` to enable the Question tool for the supervisor agent
- `mcp_server_config`: Configuration for MCP servers (optional)
- `mcp_prompt`: Additional instructions for using MCP tools (optional)
- `mcp_tools_to_include`: Specific MCP tools to include (optional)

## MCP (Model Context Protocol) Support

The multi-agent implementation (`src/open_deep_research/multi_agent.py`) supports MCP servers to extend research capabilities beyond web search. MCP tools are available to research agents alongside or instead of traditional search tools, enabling access to local files, databases, APIs, and other data sources.
Expand Down Expand Up @@ -211,7 +222,13 @@ MCP server config:

MCP prompt:
```
Step 1: Use the `list_allowed_directories` tool to get the list of allowed directories. Step 2: Use the `read_file` tool to read files in the allowed directory.
CRITICAL: You MUST follow this EXACT sequence when using filesystem tools:

1. FIRST: Call `list_allowed_directories` tool to discover allowed directories
2. SECOND: Call `list_directory` tool on a specific directory from step 1 to see available files
3. THIRD: Call `read_file` tool to read specific files found in step 2

DO NOT call `list_directory` or `read_file` until you have first called `list_allowed_directories`. You must discover the allowed directories before attempting to browse or read files.
```

MCP tools:
Expand All @@ -221,7 +238,7 @@ list_directory
read_file
```

Example test case that you can provide:
Example test topic and follow-up feedback that you can provide that will reference the included file:

Topic:
```
Expand Down Expand Up @@ -297,14 +314,41 @@ groq.APIError: Failed to call a function. Please adjust your prompt. See 'failed

(7) For working with local models via Ollama, see [here](https://github.com/langchain-ai/open_deep_research/issues/65#issuecomment-2743586318).

## Testing Report Quality
## Evaluation Systems

Open Deep Research includes two comprehensive evaluation systems to assess report quality and performance:

To compare the quality of reports generated by both implementations:
### 1. Pytest-based Evaluation System

A developer-friendly testing framework that provides immediate feedback during development and testing cycles.

#### **Features:**
- **Rich Console Output**: Formatted tables, progress indicators, and color-coded results
- **Binary Pass/Fail Testing**: Clear success/failure criteria for CI/CD integration
- **LangSmith Integration**: Automatic experiment tracking and logging
- **Flexible Configuration**: Extensive CLI options for different testing scenarios
- **Real-time Feedback**: Live output during test execution

#### **Evaluation Criteria:**
The system evaluates reports against 9 comprehensive quality dimensions:
- Topic relevance (overall and section-level)
- Structure and logical flow
- Introduction and conclusion quality
- Proper use of structural elements (headers, citations)
- Markdown formatting compliance
- Citation quality and source attribution
- Overall research depth and accuracy

#### **Usage:**
```bash
# Test with default Anthropic models
# Run all agents with default settings
python tests/run_test.py --all

# Test specific agent with custom models
python tests/run_test.py --agent multi_agent \
--supervisor-model "anthropic:claude-3-7-sonnet-latest" \
--search-api tavily

# Test with OpenAI o3 models
python tests/run_test.py --all \
--supervisor-model "openai:o3" \
Expand All @@ -317,7 +361,68 @@ python tests/run_test.py --all \
--search-api "tavily"
```

The test results will be logged to LangSmith, allowing you to compare the quality of reports generated by each implementation with different model configurations.
#### **Key Files:**
- `tests/run_test.py`: Main test runner with rich CLI interface
- `tests/test_report_quality.py`: Core test implementation
- `tests/conftest.py`: Pytest configuration and CLI options

### 2. LangSmith Evaluate API System

A comprehensive batch evaluation system designed for detailed analysis and comparative studies.

#### **Features:**
- **Multi-dimensional Scoring**: Four specialized evaluators with 1-5 scale ratings
- **Weighted Criteria**: Detailed scoring with customizable weights for different quality aspects
- **Dataset-driven Evaluation**: Batch processing across multiple test cases
- **Performance Optimization**: Caching with extended TTL for evaluator prompts
- **Professional Reporting**: Structured analysis with improvement recommendations

#### **Evaluation Dimensions:**

1. **Overall Quality** (7 weighted criteria):
- Research depth and source quality (20%)
- Analytical rigor and critical thinking (15%)
- Structure and organization (20%)
- Practical value and actionability (10%)
- Balance and objectivity (15%)
- Writing quality and clarity (10%)
- Professional presentation (10%)

2. **Relevance**: Section-by-section topic relevance analysis with strict criteria

3. **Structure**: Assessment of logical flow, formatting, and citation practices

4. **Groundedness**: Evaluation of alignment with retrieved context and sources

#### **Usage:**
```bash
# Run comprehensive evaluation on LangSmith datasets
python tests/evals/run_evaluate.py
```

#### **Key Files:**
- `tests/evals/run_evaluate.py`: Main evaluation script
- `tests/evals/evaluators.py`: Four specialized evaluator functions
- `tests/evals/prompts.py`: Detailed evaluation prompts for each dimension
- `tests/evals/target.py`: Report generation workflows

### When to Use Each System

**Use Pytest System for:**
- Development and debugging cycles
- CI/CD pipeline integration
- Quick model comparison experiments
- Interactive testing with immediate feedback
- Gate-keeping before production deployments

**Use LangSmith System for:**
- Comprehensive model evaluation across datasets
- Research and analysis of system performance
- Detailed performance profiling and benchmarking
- Comparative studies between different configurations
- Production monitoring and quality assurance

Both evaluation systems complement each other and provide comprehensive coverage for different use cases and development stages.

## UX

Expand Down
11 changes: 8 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ dependencies = [
"langgraph>=0.2.55",
"langchain-community>=0.3.9",
"langchain-openai>=0.3.7",
"langchain-anthropic>=0.3.9",
"langchain-anthropic>=0.3.15",
"langchain-mcp-adapters>=0.1.6",
"langchain-deepseek>=0.1.2",
"langchain-tavily",
Expand All @@ -33,7 +33,11 @@ dependencies = [
"markdownify>=0.11.6",
"azure-identity>=1.21.0",
"azure-search>=1.0.0b2",
"azure-search-documents>=11.5.2"
"azure-search-documents>=11.5.2",
"rich>=13.0.0",
"langgraph-cli[inmem]>=0.3.1",
"langsmith>=0.3.37",
"langchain-core>=0.3.64",
]

[project.optional-dependencies]
Expand All @@ -44,10 +48,11 @@ requires = ["setuptools>=73.0.0", "wheel"]
build-backend = "setuptools.build_meta"

[tool.setuptools]
packages = ["open_deep_research"]
packages = ["open_deep_research", "tests"]

[tool.setuptools.package-dir]
"open_deep_research" = "src/open_deep_research"
"tests" = "tests"

[tool.setuptools.package-data]
"*" = ["py.typed"]
Expand Down
70 changes: 46 additions & 24 deletions src/open_deep_research/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,49 +29,68 @@ class SearchAPI(Enum):
NONE = "none"

@dataclass(kw_only=True)
class Configuration:
"""The configurable fields for the chatbot."""
class WorkflowConfiguration:
"""Configuration for the workflow/graph-based implementation (graph.py)."""
# Common configuration
report_structure: str = DEFAULT_REPORT_STRUCTURE # Defaults to the default report structure
search_api: SearchAPI = SearchAPI.TAVILY # Default to TAVILY
report_structure: str = DEFAULT_REPORT_STRUCTURE
search_api: SearchAPI = SearchAPI.TAVILY
search_api_config: Optional[Dict[str, Any]] = None
process_search_results: Literal["summarize", "split_and_rerank"] | None = None
# Summarization model for summarizing search results
# will be used if summarize_search_results is True
summarization_model_provider: str = "anthropic"
summarization_model: str = "claude-3-5-haiku-latest"
# Whether to include search results string in the agent output state
# This is used for evaluation purposes only
include_source_str: bool = False

# Graph-specific configuration
# Workflow-specific configuration
number_of_queries: int = 2 # Number of search queries to generate per iteration
max_search_depth: int = 2 # Maximum number of reflection + search iterations
planner_provider: str = "anthropic" # Defaults to Anthropic as provider
planner_model: str = "claude-3-7-sonnet-latest" # Defaults to claude-3-7-sonnet-latest
planner_model_kwargs: Optional[Dict[str, Any]] = None # kwargs for planner_model
writer_provider: str = "anthropic" # Defaults to Anthropic as provider
writer_model: str = "claude-3-5-sonnet-latest" # Defaults to claude-3-5-sonnet-latest
writer_model_kwargs: Optional[Dict[str, Any]] = None # kwargs for writer_model
planner_provider: str = "anthropic"
planner_model: str = "claude-3-7-sonnet-latest"
planner_model_kwargs: Optional[Dict[str, Any]] = None
writer_provider: str = "anthropic"
writer_model: str = "claude-3-7-sonnet-latest"
writer_model_kwargs: Optional[Dict[str, Any]] = None

@classmethod
def from_runnable_config(
cls, config: Optional[RunnableConfig] = None
) -> "WorkflowConfiguration":
"""Create a WorkflowConfiguration instance from a RunnableConfig."""
configurable = (
config["configurable"] if config and "configurable" in config else {}
)
values: dict[str, Any] = {
f.name: os.environ.get(f.name.upper(), configurable.get(f.name))
for f in fields(cls)
if f.init
}
return cls(**{k: v for k, v in values.items() if v})

@dataclass(kw_only=True)
class MultiAgentConfiguration:
"""Configuration for the multi-agent implementation (multi_agent.py)."""
# Common configuration
search_api: SearchAPI = SearchAPI.TAVILY
search_api_config: Optional[Dict[str, Any]] = None
process_search_results: Literal["summarize", "split_and_rerank"] | None = None
summarization_model_provider: str = "anthropic"
summarization_model: str = "claude-3-5-haiku-latest"

# Multi-agent specific configuration
supervisor_model: str = "openai:gpt-4.1" # Model for supervisor agent in multi-agent setup
researcher_model: str = "openai:gpt-4.1" # Model for research agents in multi-agent setup
number_of_queries: int = 2 # Number of search queries to generate per section
supervisor_model: str = "anthropic:claude-3-7-sonnet-latest"
researcher_model: str = "anthropic:claude-3-7-sonnet-latest"
final_report_model: str = "openai:gpt-4.1"
ask_for_clarification: bool = False # Whether to ask for clarification from the user
# MCP server configuration for multi-agent setup
# see examples here: https://github.com/langchain-ai/langchain-mcp-adapters#client-1
# MCP server configuration
mcp_server_config: Optional[Dict[str, Any]] = None
# optional prompt to append to the researcher agent prompt
mcp_prompt: Optional[str] = None
# optional list of MCP tool names to include in the researcher agent
# if not set, all MCP tools across all servers in the config will be included
mcp_tools_to_include: Optional[list[str]] = None

@classmethod
def from_runnable_config(
cls, config: Optional[RunnableConfig] = None
) -> "Configuration":
"""Create a Configuration instance from a RunnableConfig."""
) -> "MultiAgentConfiguration":
"""Create a MultiAgentConfiguration instance from a RunnableConfig."""
configurable = (
config["configurable"] if config and "configurable" in config else {}
)
Expand All @@ -81,3 +100,6 @@ def from_runnable_config(
if f.init
}
return cls(**{k: v for k, v in values.items() if v})

# Keep the old Configuration class for backward compatibility
Configuration = WorkflowConfiguration
Loading