epaunova

Eva Paunova epaunova

LLM Evaluation & Prompt Engineering | Model Behavior, Drift & Alignment | GPT-4.1, Claude, Mistral — and open-weight models like LLaMA | AI Interoperability | G

1 follower · 17 following

Pinned Loading

llm-prompt-engineering-guide llm-prompt-engineering-guide Public

Python 1
LLM-Drift-Observatory LLM-Drift-Observatory Public

A hands-on framework for detecting and visualizing **behavioral drift** in Large Language Models (LLMs) across versions and providers.

Jupyter Notebook
LLM-Scoring-Dashboard-Streamlit-OpenAI-eval-UI- LLM-Scoring-Dashboard-Streamlit-OpenAI-eval-UI- Public

Streamlit-based interactive dashboard to evaluate LLM outputs on key qualitative metrics: Factuality Clarity Style

Python 1
LLM-eval-benchmark-lab LLM-eval-benchmark-lab Public

A modular, configurable benchmarking harness for evaluating LLM behavior across tasks, constraints, and model classes.

Python 1
Prompt-Efficiency-Sandbox Prompt-Efficiency-Sandbox Public

* Compare 3 versions of a prompt against: * GPT-3.5 * Mistral * QLoRA-based small model * Metrics: token usage, latency, eval score

Jupyter Notebook 1
LLM-Evaluation-Toolkit LLM-Evaluation-Toolkit Public

A practical toolkit for evaluating LLM outputs using GPT-based auto-grading. Designed for product teams to benchmark factuality, coherence, and tone in real-world use cases.

Python 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eva Paunova epaunova

Block or report epaunova

Pinned Loading

Uh oh!