TextEvolve is a system that uses LLM-driven reasoning and memory to iteratively improve its approach to solving problems from datasets. TextEvolve automates the labor-intensive task of iterating over system design, prompt engineering, and program flow by learning, remembering, and adapating its approach.
Input: Any dataset with "question" and "answer" fields
Output: A series of executable python scripts containing advanced workflows and agentic behavior optimized for the dataset.
The system employs dynamic exploration/exploitation/refinement strategies and adapts its approach (creating new functions, writing code that generates and executes new code, writing prompts, etc.) based on performance feedback, keeping rich logs of its past performance.
π Read the Paper: TextEvolve: Automated Program Discovery with Large Language Models
π₯ Watch the Demo: YouTube
π Examples in Action:
System-generated script for MATH dataset
System-generated script for GPQA Diamond dataset
System-generated script for HotpotQA dataset
System-generated script for DROP dataset
Memory / experiment log for HotpotQA
π§ Repository Status: Under Construction π§
Current State:
- β TextEvolve is stable and runs successfully
- π Codebase is actively being refactored and cleaned up
- π Documentation and examples are being written
- ποΈ Repository structure may change frequently
Please expect ongoing changes to code organization and documentation as we work toward a more polished release.
Preliminary Results:
Here's some preliminary benchmark data from the paper comparing standard I/O with Gemini 2.0 Flash vs. the best performing TextEvolve program run with Gemini 2.0 Flash. (NB: Run over 100 randomly sampled examples from the test set - more extensive benchmarking is underway). TextEvolve boosts performance by automating manual experimenting over workflows, prompts, and code.
Program search is typically completed (20-30 iterations) in around 20-30 minutes over ~100 data examples for
-
Set your Gemini API key:
export GEMINI_API_KEY=your_api_key_here
-
Run the system:
# Basic usage with 5 iterations python run_script.py --dataset your_dataset.jsonl --loader jsonl --iterations 5 # Example with MATH benchmark python run_script.py --dataset hendrycks_math/math_test.jsonl --loader math --iterations 5
Program search is typically completed (20-30 iterations) in around 20-30 minutes over ~100 data examples for
You can run more iterations if you're unsatisfied with the result. This will pick up right where the system left off. After the system runs you'll see a final report:
-
Validate results:
# Test the best script on examples 100-199 python validate_script.py --script scripts/script_iteration_4.py --dataset hendrycks_math/math_test.jsonl --loader math --start 100 --end 199
-
Reset system:
# Wipe memory and start the system from scratch python reset_system.py
The system supports multiple dataset formats through modular loaders:
Loader | Dataset Type | Example Usage |
---|---|---|
arc |
ARC (Abstraction and Reasoning Corpus) | --loader arc |
jsonl |
JSONL files (one JSON per line) | --loader jsonl |
json |
JSON files with configurable fields | --loader json |
simpleqa |
SimpleQA dataset | --loader simpleqa |
math |
MATH dataset | --loader math |
natural_plan |
Natural Plan dataset | --loader natural_plan |
gpqa |
GPQA dataset | --gpqa |
hotpotqa |
HotpotQA dataset | --gpqa |
custom |
Your own custom format | --loader custom |
# ARC dataset (directory of JSON files)
python run_script.py --dataset ARC_2024_Training/ --loader arc --iterations 10
# JSONL dataset (like MATH benchmark)
python run_script.py --dataset math_test.jsonl --loader math --iterations 5
# Custom JSON with specific fields
python run_script.py --dataset custom.json --loader json --input-field question --output-field answer --iterations 5
# JSONL with custom fields (like DROP dataset)
python run_script.py --dataset drop_dataset.jsonl --loader jsonl --input-field question --output-field answers_spans --iterations 5
# Disable shuffling for consistent ordering
python run_script.py --dataset dataset.jsonl --loader jsonl --no-shuffle --iterations 5
Option | Description | Default |
---|---|---|
--iterations |
Number of iterations to run | 5 |
--dataset |
Path to dataset file/directory | required |
--loader |
Type of dataset loader | required |
--input-field |
Input field name (JSON/JSONL) | "input" |
--output-field |
Output field name (JSON/JSONL) | "output" |
--passage-field |
Passage field (JSONL) | "passage" |
--no-shuffle |
Disable dataset shuffling | False |
--seed |
Random seed | 42 |
Option | Description | Default |
---|---|---|
--script |
Path to script to validate | required |
--dataset |
Path to dataset | required |
--loader |
Dataset loader type | required |
--start |
Start index for validation | 0 |
--end |
End index for validation | 99 |
--detailed |
Show detailed results | False |
For basic custom formats, extend the DatasetLoader
class:
from dataset_loader import DatasetLoader
import json
class MyDatasetLoader(DatasetLoader):
def _load_examples(self):
"""Load examples from your custom format"""
with open(self.dataset_path, 'r') as f:
data = json.load(f)
for key, example in data.items():
# Convert to standard format
self.examples.append({
"id": key,
"question": example["my_input_field"], # Standard field: "question"
"answer": example["my_output_field"], # Standard field: "answer"
"meta": {
"source": "my_dataset",
"original_data": example
}
})
print(f"Loaded {len(self.examples)} examples from custom dataset")
# Register and use your loader
from dataset_loader import create_dataset_loader
def create_my_loader(**kwargs):
return MyDatasetLoader(**kwargs)
# Add to the create_dataset_loader function or use directly
loader = MyDatasetLoader(dataset_path="my_data.json", shuffle=True)
For more complex formats, use the built-in custom loader:
from dataset_loader import create_dataset_loader
def load_my_examples(dataset_path):
"""Load examples from your dataset"""
# Your custom loading logic
with open(dataset_path, 'r') as f:
raw_data = f.read()
# Process and return list of examples
examples = []
# ... your processing logic ...
return examples
def get_my_input(example):
"""Extract input from example"""
return example["my_question_field"]
def get_my_output(example):
"""Extract output from example"""
return example["my_answer_field"]
# Create the custom loader
loader = create_dataset_loader(
"custom",
dataset_path="my_dataset.xyz",
load_examples_fn=load_my_examples,
get_input_fn=get_my_input,
get_output_fn=get_my_output,
shuffle=True
)
# Use with agent system
from agent_system import AgentSystem
agent = AgentSystem(dataset_loader=loader)
The system uses three main strategies:
- Explore (60% initially): Try completely new approaches
- Exploit (20% initially): Combine successful techniques
- Refine (20% initially): Make targeted improvements to the best script
The balance between these strategies adapts based on performance.
- Starts with small batches (3-5 examples)
- For promising scripts (>60% accuracy), runs progressive testing (backtesting) on a set of previously seen examples
- Adjusts batch size based on performance, balancing throughput with acquiring accurate measurement of itertation performance
- Uses LLM reasoning for strategy decisions, error analysis, and script generation
- Employs and creates novel advanced agentic patterns like ReAct, chain-of-thought, and verification loops
- Automatically repairs and debugs generated scripts
Iteration 0: Baseline script (simple LLM call) β 45% accuracy
Iteration 1: Explore new approach β 62% accuracy β Progressive testing β 58% overall
Iteration 2: Exploit successful techniques β 71% accuracy β Progressive testing β 65% overall
Iteration 3: Refine best approach β 73% accuracy β Progressive testing β 68% overall
...
macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone <www.github.com/nickcdryan/textevolve>
# Install dependencies (creates virtual environment automatically)
uv sync
# Set up your Gemini API key
export GEMINI_API_KEY=your_api_key_here
# Run the verification script
uv run python verify_setup.py
# Quick test run (optional)
uv run python run_script.py --dataset hendrycks_math/math_test.jsonl --loader math --iterations 1
The system creates several directories:
βββ archive/ # Iteration data and summaries
β βββ iteration_0.json # Detailed data for each iteration
β βββ iteration_1.json
β βββ summaries.json # Performance summaries
βββ scripts/ # Generated scripts
β βββ script_iteration_0.py
β βββ script_iteration_1.py
β βββ ...
βββ learnings.txt # Accumulated insights and patterns
βββ README.md
The system tracks multiple metrics:
- Batch Accuracy: Performance on current test batch
- Progressive Accuracy: Performance on small set of previously seen examples
- Combined Accuracy: Weighted average across all tested examples
- Capability Assessment: Strengths, weaknesses, and improvement areas
Example output:
Iteration Strategy Batch Acc. Prog. Acc. Combined Batch Size Prog. Size
8 exploit 75.00% 68.33% (60) 69.23% 4 60
system_improver.py has access to review and edit the core functionality of the repository. Specifically, system_improver.py
- reviews the program
- reviews iteration history and performance
- reviews past changes made by system_improver.py in /diffs
- proposes and integrates changes to the system, e.g. adding utility function, rewriting meta-agent prompts, etc.
- creates system backups in /backup (eventually auto-rollback will be integrated if the new changes are system-breaking)
This system does not reliably work yet, please stay tuned.
from dataset_loader import create_dataset_loader
from agent_system import AgentSystem
# Create dataset loader
loader = create_dataset_loader(
"jsonl",
dataset_path="your_dataset.jsonl",
shuffle=True,
random_seed=42
)
# Initialize agent system
agent = AgentSystem(dataset_loader=loader)
# Run iterations
for i in range(10):
result = agent.run_iteration()
print(f"Iteration {i}: {result.get('performance', {}).get('accuracy', 0):.2f} accuracy")
# Get best script info
best_script = agent.get_best_script_info()
print(f"Best script: {best_script.get('path')} with {best_script.get('combined_accuracy', 0):.2f} accuracy")
For datasets with non-standard field names:
# JSON dataset with custom fields
python run_script.py --dataset custom.json --loader json --input-field "problem_statement" --output-field "solution"
# JSONL dataset with passage and question
python run_script.py --dataset reading_comprehension.jsonl --loader jsonl --input-field "question" --passage-field "context" --output-field "answer"
- Parallelized batch testing
- Support for multi-turn flows
- system_improver.py iteration
- Better code execution
- Custom evaluation and multi-objective functions
- Integrate RAG capability
- More dynamic memory selection (access and reasoning over memory filesystem)
- Further metaheuristic testing
- API, tool, modality (vision) integration
To add support for a new dataset format:
- Create a new loader class inheriting from
DatasetLoader
- Implement the
_load_examples()
method - Ensure examples use standard field names:
"question"
,"answer"
,"id"
- Add your loader to the
create_dataset_loader()
function - Test with both
run_script.py
andvalidate_script.py
MIT License - see LICENSE file for details.