A professional, extensible framework for evaluating and comparing Large Language Models (LLMs) across different industry domains using automated scoring, radar charts, and comprehensive reporting.
Example radar chart comparing AFM-4-5-B-Preview vs Llama-3-2-8-B-Instruct-Turbo models across multiple evaluation metrics
Radar Evaluator is a sophisticated tool designed to systematically assess and compare the performance of different LLMs across various industry-specific domains. It uses a multi-stage evaluation process that combines automated question generation, model response evaluation, and comprehensive visualization to provide insights into model capabilities.
- Multi-Model Comparison: Evaluate and compare multiple LLMs side-by-side
- Industry-Specific Evaluation: Domain-specific questions across 10+ industries
- Automated Scoring: AI-powered evaluation using DeepSeek-R1 as the judge
- Radar Chart Visualization: Intuitive radar charts for easy comparison
- Comprehensive Reporting: Detailed Markdown reports with scores and analysis
- Extensible Architecture: Easy to add new models, industries, and metrics
- CLI Interface: Simple command-line interface for automation
- Parallel Processing: Efficient evaluation using multi-threading
The tool selects industry-specific questions from a curated database covering:
- Information Technology: AI/ML, cybersecurity, cloud architecture
- Healthcare: Medical research, drug discovery, patient care
- Financial Services: Risk management, trading, regulatory compliance
- Consumer Discretionary: Retail, e-commerce, customer experience
- Communication Services: Media, telecommunications, content creation
- Industrials: Manufacturing, supply chain, automation
- Consumer Staples: Food & beverage, agriculture, sustainability
- Energy: Renewable energy, grid management, sustainability
- Utilities: Infrastructure, smart cities, resource management
- Materials: Advanced materials, manufacturing processes
For each selected question:
- Generates a system prompt tailored to each model
- Sends the question to both models being compared
- Collects streaming responses for real-time feedback
- Handles errors gracefully with fallback mechanisms
Each model response is evaluated by DeepSeek-R1 (the judge) across five key metrics:
Metric | Description | Scale |
---|---|---|
Accuracy | Factual correctness and reliability of information | 1-5 |
Completeness | How thoroughly the response addresses all aspects | 1-5 |
Clarity | Organization and understandability of the explanation | 1-5 |
Depth | Level of detail and insight provided | 1-5 |
Overall Quality | Comprehensive assessment of response quality | 1-5 |
- Extracts numerical scores using regex patterns
- Validates scores against defined ranges (1-5 scale)
- Applies fallback logic for missing or invalid scores
- Ensures data quality and consistency
- Radar Charts: Multi-dimensional comparison visualization
- Summary Reports: Markdown reports with detailed analysis
- Statistical Analysis: Average scores, standard deviations
- Export Options: PNG, Markdown, JSON, and CSV formats
- Python 3.8+ installed on your system
- Together AI API Key for model access
- Git for cloning the repository
# Clone the repository
git clone https://github.com/juliensimon/radar-evaluator.git
cd radar-evaluator
# Create and activate virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install the package
pip install .
- Set your API key:
export TOGETHER_API_KEY="your_api_key_here"
- Customize models (optional):
Edit
config.json
to add or modify models:
{
"models": {
"your_model": {
"name": "Your Model Name",
"model_endpoint": "your/model/endpoint",
"model_key": "your_model"
}
}
}
# Compare two models across all industries
python radar_evaluator.py --model1 afm --model2 llama3_8b
# Compare with specific industries
python radar_evaluator.py --model1 afm --model2 gemma --industries "Information Technology,Healthcare"
# Limit the number of questions per industry
python radar_evaluator.py --model1 afm --model2 qwen --num-questions 5
# Full help and options
python radar_evaluator.py --help
# Available models
python radar_evaluator.py --list-models
# Custom configuration file
python radar_evaluator.py --config custom_config.json --model1 afm --model2 gemma
radar_results/
βββ 20241216_143022/ # Timestamp-based results directory
βββ radar_chart.png # Radar chart visualization
βββ summary_report.md # Detailed Markdown report
βββ results.json # Raw evaluation data
βββ scores.csv # Processed scores for analysis
- Spider/Radar Chart: Each axis represents an evaluation metric
- Model Comparison: Different colors/lines for each model
- Score Scale: 1-5 scale where 5 is the best performance
- Area Coverage: Larger area indicates better overall performance
The Markdown report includes:
- Executive Summary: High-level comparison and key findings
- Detailed Scores: Per-question breakdown with scores
- Statistical Analysis: Averages, standard deviations, confidence intervals
- Industry Breakdown: Performance by industry domain
- Recommendations: Insights and suggestions based on results
{
"models": {
"model_key": {
"name": "Display Name",
"model_endpoint": "provider/model-name",
"model_key": "internal_key"
}
},
"evaluation": {
"evaluator_model": "deepseek-ai/DeepSeek-R1", # The judge model
"max_tokens": 1024,
"temperature": 0.7,
"evaluation_max_tokens": 2048,
"evaluation_temperature": 0.3
},
"metrics": {
"accuracy": {"min": 1, "max": 5},
"completeness": {"min": 1, "max": 5},
"clarity": {"min": 1, "max": 5},
"depth": {"min": 1, "max": 5},
"overall": {"min": 1, "max": 5}
}
}
Add custom questions for specific industries:
{
"Your Industry": [
"Your domain-specific question here?",
"Another industry-relevant question?"
]
}
- Add model configuration to
config.json
- Ensure the model endpoint is accessible via Together AI
- Test with a small evaluation first
- Add industry questions to
industry_questions.json
- Follow the existing format with domain-specific questions
- Ensure questions are complex enough for meaningful evaluation
- Modify the metrics section in
config.json
- Update the evaluation prompt in the code
- Adjust score extraction patterns if needed
# Run all tests
python -m pytest
# Run with coverage
python -m pytest --cov=radar_evaluator
# View coverage report
open htmlcov/index.html
# Format code
black .
# Lint code
flake8 .
# Type checking
mypy .
# Security scanning
bandit -r .
- Parallel Processing: Uses ThreadPoolExecutor for concurrent evaluations
- Streaming Responses: Real-time response collection for better UX
- Error Handling: Robust error handling with fallback mechanisms
- Memory Management: Efficient data structures and cleanup
- API Rate Limiting: Built-in handling for API limits
We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
# Install development dependencies
pip install -e ".[dev]"
# Set up pre-commit hooks
pre-commit install
This project is licensed under the Creative Commons Attribution 4.0 International License.
- Together AI for providing the model inference platform
- DeepSeek AI for providing the judge model (DeepSeek-R1)
- Matplotlib for radar chart visualization
- Pandas for data processing and analysis
- Pytest for comprehensive testing framework
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: This README and inline code comments
Built with β€οΈ for the AI community