LLM-Eval

A flexible, extensible, and reproducible framework for evaluating LLM workflows, applications, retrieval-augmented generation pipelines, and standalone models across custom and standard datasets.

🚀 Key Features

📚 Document-Based Q&A Generation: Transform your technical documentation, guides, and knowledge bases into comprehensive question-answer test catalogs
📊 Multi-Dimensional Evaluation Metrics:
- ✅ Answer Relevancy: Measures how well responses address the actual question
- 🧠 G-Eval: Sophisticated evaluation using other LLMs as judges
- 🔍 Faithfulness: Assesses adherence to source material facts
- 🚫 Hallucination Detection: Identifies fabricated information not present in source documents
📈 Long-Term Quality Tracking:
- 📆 Temporal Performance Analysis: Monitor model degradation or improvement over time
- 🔄 Regression Testing: Automatically detect when model updates negatively impact performance
- 📊 Trend Visualization: Track quality metrics across model versions with interactive charts
🔄 Universal Compatibility: Seamlessly works with all OpenAI-compatible endpoints including local solutions like Ollama
🏷️ Version Control for Q&A Catalogs: Easily track changes in your evaluation sets over time
📊 Comparative Analysis: Visualize performance differences between models on identical question sets
🚀 Batch Processing: Evaluate multiple models simultaneously for efficient workflows
🔌 Extensible Plugin System: Add new providers, metrics, and dataset generation techniques

Available Providers

OpenAI: Integrate and evaluate models from OpenAI's API, including support for custom base URLs, temperature, and language control
Azure OpenAI: Use Azure-hosted OpenAI models with deployment, API version, and custom language output support
C4: Connect to C4 endpoints for LLM evaluation with custom configuration and API key support

📖 Table of Contents

🚀 Key Features
📖 Table of Contents
📝 Introduction
Getting Started
1. Running LLM-Eval Locally
  - Prerequisites
  - Quick Start - for local usage
2. Development Setup
🤝 Contributing & Code of Conduct
📜 License

📝 Introduction

LLM-Eval is an open-source toolkit designed to evaluate large language model workflows, applications, retrieval-augmented generation pipelines, and standalone models. Whether you're developing a conversational agent, a summarization service, or a RAG-based search tool, LLM-Eval provides a clear, reproducible framework to test and compare performance across providers, metrics, and datasets.

Key benefits include: end-to-end evaluation of real-world applications, reproducible reports, and an extensible platform for custom metrics and datasets.

Getting Started

Running LLM-Eval Locally

To run LLM-Eval locally (for evaluation and usage, not development), use our pre-configured Docker Compose setup.

Prerequisites

Docker
Docker Compose

Quick Start - for local usage

Clone the repository:

git clone <LLM-Eval github url>
cd llm-eval

Copy and configure environment:
```
cp .env.example .env
# Edit .env to add your API keys and secrets as needed
```
Required: Generate the encryption keys set to CHANGEME with the respective commands commented next to them in .env
Enable host networking in docker desktop (for macos users):

Go to Settings -> Resources -> Network and check Enable host networking, without this step on macos, the frontend wouldn't be reachable on localhost.

Start the stack:

docker compose -f docker-compose.yaml -f docker-compose.local.yaml up -d

Access the application:
- Web UI: http://localhost:3000 (Default login: username:password)
- API: http://localhost:8070/docs
Login using default user:

Default user for llmeval username: username, password: password.

To stop the app:

docker compose -f docker-compose.yaml -f docker-compose.local.yaml down

Development Setup

If you want to contribute to LLM-Eval or run it in a development environment, follow these steps:

Development prerequisites

Python 3.12
Poetry
Docker (for required services)
Node.js & npm (for frontend)

Installation & Local Development

git clone <LLM-Eval github url>
cd llm-eval
poetry install --only=main,dev,test
poetry self add poetry-plugin-shell

Install Git pre-commit hook:
```
pre-commit install
```

Start Poetry shell:
```
poetry shell
```

Copy and configure environment:

cp .env.example .env
# Add your API keys and secrets to .env
# Fill CHANGEME with appropriate keys

Comment the following in .env

from

# container variables
KEYCLOAK_HOST=keycloak
CELERY_BROKER_HOST=rabbit-mq
PG_HOST=eval-db

to

# container variables
# KEYCLOAK_HOST=keycloak
# CELERY_BROKER_HOST=rabbit-mq
# PG_HOST=eval-db

Start databases and other services:
```
docker compose up -d
```

Start backend:

cd backend
uvicorn llm_eval.main:app --host 0.0.0.0 --port 8070 --reload

Start Celery worker:

cd backend
celery -A llm_eval.tasks worker --loglevel=INFO --concurrency=4

Start frontend:
```
cd frontend
npm install
npm run dev
```
Login using default user:

Default user for llmeval username: username, password: password.

Keycloak Setup (Optional if you want to override defaults)

User access is managed through Keycloak, available at localhost:8080 (Default admin credentials: admin:admin). Select the llm-eval realm to manage users.

If you want to adjust keycloak manually see docs/keycloak-setup-guide.md for step-by-step guide.
Otherwise it will use default configuration found in keycloak-config, when docker compose launchs.

Acquiring tokens from keycloak

Once keycloak is up and running, tokens might be requested by calling:

Without session by service client dev-ide (direct backend api calls):

$ curl -X POST \
  'http://localhost:8080/realms/llm-eval/protocol/openid-connect/token' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'client_id=dev-ide' \
  -d 'client_secret=dev-ide' \
  -d 'grant_type=client_credentials' | jq

Or with session using client llm-eval-ui (frontend calls) :

$ curl -X POST \
  'http://localhost:8080/realms/llm-eval/protocol/openid-connect/token' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'client_id=llm-eval-ui' \
  -d 'client_secret=llm-eval-ui' \
  -d 'username=username' \
  -d 'password=password' \
  -d 'grant_type=password' | jq

🤝 Contributing & Code of Conduct

As the repo isn't fully prepared for contributions, we aren't open for them for the moment.

📜 License

This project is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github		.github
backend		backend
data		data
frontend		frontend
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.env.example		.env.example
.env.services		.env.services
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
docker-compose.local.yaml		docker-compose.local.yaml
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM-Eval

🚀 Key Features

Available Providers

📖 Table of Contents

📝 Introduction

Getting Started

Running LLM-Eval Locally

Prerequisites

Quick Start - for local usage

Development Setup

Development prerequisites

Installation & Local Development

Keycloak Setup (Optional if you want to override defaults)

Acquiring tokens from keycloak

🤝 Contributing & Code of Conduct

📜 License

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

codecentric/llm-eval

Folders and files

Latest commit

History

Repository files navigation

LLM-Eval

🚀 Key Features

Available Providers

📖 Table of Contents

📝 Introduction

Getting Started

Running LLM-Eval Locally

Prerequisites

Quick Start - for local usage

Development Setup

Development prerequisites

Installation & Local Development

Keycloak Setup (Optional if you want to override defaults)

Acquiring tokens from keycloak

🤝 Contributing & Code of Conduct

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages