Skip to content

A flexible, extensible, and reproducible framework for evaluating LLM workflows, applications, retrieval-augmented generation pipelines, and standalone models across custom and standard datasets.

License

Notifications You must be signed in to change notification settings

codecentric/llm-eval

Repository files navigation

LLM-Eval

A flexible, extensible, and reproducible framework for evaluating LLM workflows, applications, retrieval-augmented generation pipelines, and standalone models across custom and standard datasets.

🚀 Key Features

  • 📚 Document-Based Q&A Generation: Transform your technical documentation, guides, and knowledge bases into comprehensive question-answer test catalogs
  • 📊 Multi-Dimensional Evaluation Metrics:
    • Answer Relevancy: Measures how well responses address the actual question
    • 🧠 G-Eval: Sophisticated evaluation using other LLMs as judges
    • 🔍 Faithfulness: Assesses adherence to source material facts
    • 🚫 Hallucination Detection: Identifies fabricated information not present in source documents
  • 📈 Long-Term Quality Tracking:
    • 📆 Temporal Performance Analysis: Monitor model degradation or improvement over time
    • 🔄 Regression Testing: Automatically detect when model updates negatively impact performance
    • 📊 Trend Visualization: Track quality metrics across model versions with interactive charts
  • 🔄 Universal Compatibility: Seamlessly works with all OpenAI-compatible endpoints including local solutions like Ollama
  • 🏷️ Version Control for Q&A Catalogs: Easily track changes in your evaluation sets over time
  • 📊 Comparative Analysis: Visualize performance differences between models on identical question sets
  • 🚀 Batch Processing: Evaluate multiple models simultaneously for efficient workflows
  • 🔌 Extensible Plugin System: Add new providers, metrics, and dataset generation techniques

Available Providers

  • OpenAI: Integrate and evaluate models from OpenAI's API, including support for custom base URLs, temperature, and language control
  • Azure OpenAI: Use Azure-hosted OpenAI models with deployment, API version, and custom language output support
  • C4: Connect to C4 endpoints for LLM evaluation with custom configuration and API key support

📖 Table of Contents

  1. 🚀 Key Features
  2. 📖 Table of Contents
  3. 📝 Introduction
  4. Getting Started
    1. Running LLM-Eval Locally
    2. Development Setup
  5. 🤝 Contributing & Code of Conduct
  6. 📜 License

📝 Introduction

LLM-Eval is an open-source toolkit designed to evaluate large language model workflows, applications, retrieval-augmented generation pipelines, and standalone models. Whether you're developing a conversational agent, a summarization service, or a RAG-based search tool, LLM-Eval provides a clear, reproducible framework to test and compare performance across providers, metrics, and datasets.

Key benefits include: end-to-end evaluation of real-world applications, reproducible reports, and an extensible platform for custom metrics and datasets.

Getting Started

Running LLM-Eval Locally

To run LLM-Eval locally (for evaluation and usage, not development), use our pre-configured Docker Compose setup.

Prerequisites

  • Docker
  • Docker Compose

Quick Start - for local usage

  1. Clone the repository:

    git clone <LLM-Eval github url>
    cd llm-eval
  2. Copy and configure environment:

    cp .env.example .env
    # Edit .env to add your API keys and secrets as needed

    Required: Generate the encryption keys set to CHANGEME with the respective commands commented next to them in .env

  3. Enable host networking in docker desktop (for macos users):

    Go to Settings -> Resources -> Network and check Enable host networking, without this step on macos, the frontend wouldn't be reachable on localhost.

  4. Start the stack:

    docker compose -f docker-compose.yaml -f docker-compose.local.yaml up -d
  5. Access the application:

  6. Login using default user:

    Default user for llmeval username: username, password: password.

To stop the app:

docker compose -f docker-compose.yaml -f docker-compose.local.yaml down

Development Setup

If you want to contribute to LLM-Eval or run it in a development environment, follow these steps:

Development prerequisites

  • Python 3.12
  • Poetry
  • Docker (for required services)
  • Node.js & npm (for frontend)

Installation & Local Development

git clone <LLM-Eval github url>
cd llm-eval
poetry install --only=main,dev,test
poetry self add poetry-plugin-shell
  • Install Git pre-commit hook:

    pre-commit install
  1. Start Poetry shell:

    poetry shell
  2. Copy and configure environment:

    cp .env.example .env
    # Add your API keys and secrets to .env
    # Fill CHANGEME with appropriate keys
  3. Comment the following in .env

    from

    # container variables
    KEYCLOAK_HOST=keycloak
    CELERY_BROKER_HOST=rabbit-mq
    PG_HOST=eval-db

    to

    # container variables
    # KEYCLOAK_HOST=keycloak
    # CELERY_BROKER_HOST=rabbit-mq
    # PG_HOST=eval-db
  4. Start databases and other services:

    docker compose up -d
  5. Start backend:

    cd backend
    uvicorn llm_eval.main:app --host 0.0.0.0 --port 8070 --reload
  6. Start Celery worker:

    cd backend
    celery -A llm_eval.tasks worker --loglevel=INFO --concurrency=4
  7. Start frontend:

    cd frontend
    npm install
    npm run dev
  8. Login using default user:

    Default user for llmeval username: username, password: password.

Keycloak Setup (Optional if you want to override defaults)

User access is managed through Keycloak, available at localhost:8080 (Default admin credentials: admin:admin). Select the llm-eval realm to manage users.

Acquiring tokens from keycloak

Once keycloak is up and running, tokens might be requested by calling:

Without session by service client dev-ide (direct backend api calls):

$ curl -X POST \
  'http://localhost:8080/realms/llm-eval/protocol/openid-connect/token' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'client_id=dev-ide' \
  -d 'client_secret=dev-ide' \
  -d 'grant_type=client_credentials' | jq

Or with session using client llm-eval-ui (frontend calls) :

$ curl -X POST \
  'http://localhost:8080/realms/llm-eval/protocol/openid-connect/token' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'client_id=llm-eval-ui' \
  -d 'client_secret=llm-eval-ui' \
  -d 'username=username' \
  -d 'password=password' \
  -d 'grant_type=password' | jq

🤝 Contributing & Code of Conduct

As the repo isn't fully prepared for contributions, we aren't open for them for the moment.

📜 License

This project is licensed under the Apache 2.0 License.

About

A flexible, extensible, and reproducible framework for evaluating LLM workflows, applications, retrieval-augmented generation pipelines, and standalone models across custom and standard datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published