LiteLLM Infra Model Evaluation Project

Overview

This repository is an experiment in using state-of-the-art Large Language Models (LLMs) to set up a simple, production-ready infrastructure for an AI API gateway based on LiteLLM. The goal was to test how well different LLMs can interpret a requirements plan and implement a real-world infrastructure project, and then judge each other's work.

Project Workflow

Initial Plan: The project began with a single file, litellm_infra_plan.md, outlining the requirements for a modular, secure, and extensible LiteLLM-based API gateway using Docker Compose and Caddy.
Branching & Model Setup: For each LLM, a new branch was created. Each model was asked to set up the project from scratch, following the plan. The models tested were:
- o3
- gemini-2.5-pro
- claude-4-sonnet
- claude-4-opus
Cross-Model Judging: After all branches were created, each model was asked (from the main branch) to review all four branches and provide a verdict on which setup was best, with reasoning. Their verdicts are saved in:

Verdicts & Comparative Analysis

Summary Table

Model	Winner Chosen	Runner-Up	Notable Comments
o3	claude-4-opus	claude-4-sonnet	Most complete, best onboarding
gemini-2.5-pro	claude-4-sonnet	claude-4-opus	Modular, robust, extensible
claude-4-sonnet	claude-4-opus	claude-4-sonnet	Makefile, scripts, prod/dev split
claude-4-opus	claude-4-sonnet	claude-4-opus	Documentation, modularity, security

Key Insights from the Verdicts

claude-4-opus and claude-4-sonnet were consistently rated as the top two setups by all models.
claude-4-opus was praised for its comprehensive documentation, Makefile automation, onboarding experience, and production/development separation.
claude-4-sonnet was highlighted for its modularity, extensibility (with docker-compose.extensions.yml), and future-proof architecture (easy addition of Postgres, Redis, Prometheus, Grafana).
gemini-2.5-pro was recognized for its clean file organization and version pinning, but lacked documentation and automation.
o3 was the simplest and easiest to understand, but too minimal for production use (no docs, no advanced config, no helper scripts).

Interesting Patterns

Documentation and onboarding were universally valued. Branches with a detailed README and setup scripts were always rated higher.
Automation (Makefile/scripts) and dev/prod separation were seen as major strengths for real-world use.
Modularity and extensibility (especially via separate extension files) were considered best practice for scalable infrastructure.
Security and health checks in the Caddy config and Docker Compose were important for production-readiness.
Version pinning (as in gemini-2.5-pro) was noted as a good practice, but not enough to outweigh missing docs or automation.

Model Disagreements

The top two branches (claude-4-opus and claude-4-sonnet) swapped places as winner/runner-up depending on the model, but all agreed these were the best.
All models agreed that o3 was only suitable for demos or as a learning scaffold.

How to Use This Repo

To see the original plan:
- Read litellm_infra_plan.md
To see each model's implementation:
- Check out the corresponding branch: o3, gemini-2.5-pro, claude-4-sonnet, claude-4-opus
To see the verdicts:
- Read the *_verdict.md files in the main branch

Conclusion

This project demonstrates that modern LLMs can not only follow infrastructure plans, but also critically evaluate and compare each other's work. The best results come from combining strong documentation, automation, modularity, and production best practices. If you want a robust starting point for a LiteLLM-based API gateway, use the claude-4-opus or claude-4-sonnet branches as your foundation.

This repository is a living experiment. Feel free to contribute, test new models, or suggest improvements to the evaluation process!

Note:

All tests and model evaluations were performed in the Cursor editor environment.
This README was generated by the GPT-4.1 model.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
claude_4_opus_verdict.md		claude_4_opus_verdict.md
claude_4_sonnet_verdict.md		claude_4_sonnet_verdict.md
gemini_2.5_verdict.md		gemini_2.5_verdict.md
litellm_infra_plan.md		litellm_infra_plan.md
o3_verdict.md		o3_verdict.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LiteLLM Infra Model Evaluation Project

Overview

Project Workflow

Verdicts & Comparative Analysis

Summary Table

Key Insights from the Verdicts

Interesting Patterns

Model Disagreements

How to Use This Repo

Conclusion

About

Uh oh!

Releases

Packages

cutalion/litellm-infra

Folders and files

Latest commit

History

Repository files navigation

LiteLLM Infra Model Evaluation Project

Overview

Project Workflow

Verdicts & Comparative Analysis

Summary Table

Key Insights from the Verdicts

Interesting Patterns

Model Disagreements

How to Use This Repo

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages