A flexible, extensible, and reproducible framework for evaluating LLM workflows, applications, retrieval-augmented generation pipelines, and standalone models across custom and standard datasets.
- 📚 Document-Based Q&A Generation: Transform your technical documentation, guides, and knowledge bases into comprehensive question-answer test catalogs
- 📊 Multi-Dimensional Evaluation Metrics:
- ✅ Answer Relevancy: Measures how well responses address the actual question
- 🧠 G-Eval: Sophisticated evaluation using other LLMs as judges
- 🔍 Faithfulness: Assesses adherence to source material facts
- 🚫 Hallucination Detection: Identifies fabricated information not present in source documents
- 📈 Long-Term Quality Tracking:
- 📆 Temporal Performance Analysis: Monitor model degradation or improvement over time
- 🔄 Regression Testing: Automatically detect when model updates negatively impact performance
- 📊 Trend Visualization: Track quality metrics across model versions with interactive charts
- 🔄 Universal Compatibility: Seamlessly works with all OpenAI-compatible endpoints including local solutions like Ollama
- 🏷️ Version Control for Q&A Catalogs: Easily track changes in your evaluation sets over time
- 📊 Comparative Analysis: Visualize performance differences between models on identical question sets
- 🚀 Batch Processing: Evaluate multiple models simultaneously for efficient workflows
- 🔌 Extensible Plugin System: Add new providers, metrics, and dataset generation techniques
- OpenAI: Integrate and evaluate models from OpenAI's API, including support for custom base URLs, temperature, and language control
- Azure OpenAI: Use Azure-hosted OpenAI models with deployment, API version, and custom language output support
- C4: Connect to C4 endpoints for LLM evaluation with custom configuration and API key support
- 🚀 Key Features
- 📖 Table of Contents
- 📝 Introduction
- Getting Started
- 🤝 Contributing & Code of Conduct
- 📜 License
LLM-Eval is an open-source toolkit designed to evaluate large language model workflows, applications, retrieval-augmented generation pipelines, and standalone models. Whether you're developing a conversational agent, a summarization service, or a RAG-based search tool, LLM-Eval provides a clear, reproducible framework to test and compare performance across providers, metrics, and datasets.
Key benefits include: end-to-end evaluation of real-world applications, reproducible reports, and an extensible platform for custom metrics and datasets.
To run LLM-Eval locally (for evaluation and usage, not development), use our pre-configured Docker Compose setup.
- Docker
- Docker Compose
-
Clone the repository:
git clone <LLM-Eval github url> cd llm-eval
-
Copy and configure environment:
cp .env.example .env # Edit .env to add your API keys and secrets as needed
Required: Generate the encryption keys set to
CHANGEME
with the respective commands commented next to them in.env
-
Enable host networking in docker desktop (for macos users):
Go to
Settings -> Resources -> Network
and checkEnable host networking
, without this step on macos, the frontend wouldn't be reachable on localhost. -
Start the stack:
docker compose -f docker-compose.yaml -f docker-compose.local.yaml up -d
-
Access the application:
- Web UI: http://localhost:3000 (Default login:
username
:password
) - API: http://localhost:8070/docs
- Web UI: http://localhost:3000 (Default login:
-
Login using default user:
Default user for llmeval username:
username
, password:password
.
To stop the app:
docker compose -f docker-compose.yaml -f docker-compose.local.yaml down
If you want to contribute to LLM-Eval or run it in a development environment, follow these steps:
- Python 3.12
- Poetry
- Docker (for required services)
- Node.js & npm (for frontend)
git clone <LLM-Eval github url>
cd llm-eval
poetry install --only=main,dev,test
poetry self add poetry-plugin-shell
-
Install Git pre-commit hook:
pre-commit install
-
Start Poetry shell:
poetry shell
-
Copy and configure environment:
cp .env.example .env # Add your API keys and secrets to .env # Fill CHANGEME with appropriate keys
-
Comment the following in .env
from
# container variables KEYCLOAK_HOST=keycloak CELERY_BROKER_HOST=rabbit-mq PG_HOST=eval-db
to
# container variables # KEYCLOAK_HOST=keycloak # CELERY_BROKER_HOST=rabbit-mq # PG_HOST=eval-db
-
Start databases and other services:
docker compose up -d
-
Start backend:
cd backend uvicorn llm_eval.main:app --host 0.0.0.0 --port 8070 --reload
-
Start Celery worker:
cd backend celery -A llm_eval.tasks worker --loglevel=INFO --concurrency=4
-
Start frontend:
cd frontend npm install npm run dev
-
Login using default user:
Default user for llmeval username:
username
, password:password
.
User access is managed through Keycloak, available at localhost:8080 (Default admin credentials: admin
:admin
). Select the llm-eval
realm to manage users.
- If you want to adjust keycloak manually see docs/keycloak-setup-guide.md for step-by-step guide.
- Otherwise it will use default configuration found in keycloak-config, when docker compose launchs.
Once keycloak is up and running, tokens might be requested by calling:
Without session by service client dev-ide
(direct backend api calls):
$ curl -X POST \
'http://localhost:8080/realms/llm-eval/protocol/openid-connect/token' \
-H 'Content-Type: application/x-www-form-urlencoded' \
-d 'client_id=dev-ide' \
-d 'client_secret=dev-ide' \
-d 'grant_type=client_credentials' | jq
Or with session using client llm-eval-ui
(frontend calls) :
$ curl -X POST \
'http://localhost:8080/realms/llm-eval/protocol/openid-connect/token' \
-H 'Content-Type: application/x-www-form-urlencoded' \
-d 'client_id=llm-eval-ui' \
-d 'client_secret=llm-eval-ui' \
-d 'username=username' \
-d 'password=password' \
-d 'grant_type=password' | jq
As the repo isn't fully prepared for contributions, we aren't open for them for the moment.
This project is licensed under the Apache 2.0 License.