Pinned Loading
-
-
LLM-Drift-Observatory
LLM-Drift-Observatory PublicA hands-on framework for detecting and visualizing **behavioral drift** in Large Language Models (LLMs) across versions and providers.
Jupyter Notebook
-
LLM-Scoring-Dashboard-Streamlit-OpenAI-eval-UI-
LLM-Scoring-Dashboard-Streamlit-OpenAI-eval-UI- PublicStreamlit-based interactive dashboard to evaluate LLM outputs on key qualitative metrics: Factuality Clarity Style
Python 1
-
LLM-eval-benchmark-lab
LLM-eval-benchmark-lab PublicA modular, configurable benchmarking harness for evaluating LLM behavior across tasks, constraints, and model classes.
Python 1
-
Prompt-Efficiency-Sandbox
Prompt-Efficiency-Sandbox Public* Compare 3 versions of a prompt against: * GPT-3.5 * Mistral * QLoRA-based small model * Metrics: token usage, latency, eval score
Jupyter Notebook 1
-
LLM-Evaluation-Toolkit
LLM-Evaluation-Toolkit PublicA practical toolkit for evaluating LLM outputs using GPT-based auto-grading. Designed for product teams to benchmark factuality, coherence, and tone in real-world use cases.
Python 1
If the problem persists, check the GitHub status page or contact support.