Skip to content

A modular, configurable benchmarking harness for evaluating LLM behavior across tasks, constraints, and model classes.

Notifications You must be signed in to change notification settings

epaunova/LLM-eval-benchmark-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation


🧪 LLM Eval Benchmark Lab

A modular, extensible evaluation framework for comparing Large Language Models (LLMs) across reasoning, safety, factuality, and performance constraints.

Built for real-world GenAI teams who need systematic, repeatable evaluation pipelines — this lab simulates how models behave in production-like conditions under tight latency and cost tradeoffs.

🔗 GitHub Repo
👤 Eva Paunova – LinkedIn


🧩 Why I Built This

At Deci.ai, Meta, and Microsoft I worked on building AI products under real constraints:

  • Cost budgets
  • Latency SLAs
  • Alignment and safety
  • Noisy user data

This project is my attempt to bring structure and strategy to LLM evaluation, bridging the gap between infra and product.


🚀 What It Does

✅ Evaluate LLMs across multiple dimensions
✅ Simulate production constraints (e.g. token budget, latency)
✅ Compare models side-by-side with scorecards and logs
✅ Visualize outputs via notebook analysis


📂 Folder Structure

llm-eval-benchmark-lab/ ├── runner.py # Main orchestrator for batch evals ├── configs/ # YAML configs per model ├── modules/ # Evaluation logic (reasoning, safety, etc.) ├── logs/ # Saved model outputs and scores ├── notebooks/ # Analysis and charting └── README.md

yaml Copy Edit


🧪 Current Eval Modules

  • reasoning_eval.py → Chain-of-thought consistency
  • safety_eval.py → Refusal to harmful prompts
  • hallucination_eval.py → Faithfulness to source
  • alignment_score.py → GPT-based scoring

More coming soon.


📊 Sample Use: Run Evaluation

python runner.py --config configs/gpt4_config.yaml
This will evaluate GPT-4 using selected tasks and constraints. Logs are saved to logs/ and can be visualized in notebooks/.

---

## 📊 Evaluation Summary Table

| Model      | Reasoning Score | Safety Score | Factuality | Latency (ms) | Tokens | Final Grade |
|------------|------------------|---------------|-------------|---------------|--------|--------------|
| GPT-4      | 9.2              | ✅ Passed      | 93%         | 350           | 220    | A            |
| Claude 3   | 8.6              | ✅ Passed      | 89%         | 320           | 200    | A−           |
| Mistral-7B | 6.4              | ⚠️ Failed      | 78%         | 190           | 160    | B            |

---

## 🖼️ Charts & Visualizations

Below are sample output comparisons from the `LLM Eval Playground`:

<img src="https://github.com/epaunova/llm-eval-playground/blob/main/llm-eval-playground/outputs/factuality_comparison.png?raw=true" width="600">

<img src="https://github.com/epaunova/llm-eval-playground/blob/main/llm-eval-playground/outputs/clarity_comparison.png?raw=true" width="600">

<img src="https://github.com/epaunova/llm-eval-playground/blob/main/llm-eval-playground/outputs/verbosity_comparison.png?raw=true" width="600">

---

## 📘 Explore the Live Notebook

🧪 View the scoring analysis and visualizations in the live notebook:  
👉 [Open in nbviewer](https://nbviewer.org/github/epaunova/llm-eval-playground/blob/main/llm-eval-playground/notebooks/eval_analysis.ipynb)

Includes radar charts, scorecards, and commentary for LLM comparison.

---

📍 Links
🔗 GitHub Project

📘 LLM Lifecycle Cheatsheet

🧠 LLM Eval Playground

Crafted by Eva Paunova
→ GenAI Product Manager | Evaluation Strategy | Prompt Architectures

About

A modular, configurable benchmarking harness for evaluating LLM behavior across tasks, constraints, and model classes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published