ITBench

📢 Announcements

Latest Updates

[June 13, 2025] Identified 25+ additional scenarios to be developed over the summer.
[May 2, 2025] 🚀 ITBench now provides fully-managed scenario environments for everyone! Our platform handles the complete workflow—from scenario deployment to agent evaluation and leaderboard updates. Visit our GitHub repository here for guidelines and get started today.
[February 28, 2025] 🏆 Limited Access Beta: Invite-only access to the ITBench hosted scenario environments. ITBench handles scenario deployment, agent evaluation, and leaderboard updates. To request access, e-mail us here.
[February 7, 2025] 🎉 Initial release! Includes research paper, self-hosted environment setup tooling, sample scenarios, and baseline agents.

Overview

ITBench measures the performance of AI agents across a wide variety of complex and real-world inspired IT automation tasks targeting three key use cases:

Use Case	Focus Area
SRE (Site Reliability Engineering)	Availability and resiliency
CISO (Compliance & Security Operations)	Compliance and security enforcement
FinOps (Financial Operations)	Cost efficiencies and ROI optimization

Key Features

Real-world representation of IT environments and incident scenarios
Open, extensible framework with comprehensive IT coverage
Push-button workflows and interpretable metrics
Kubernetes-based scenario environments

What's Included

ITBench enables researchers and developers to replicate real-world incidents in Kubernetes environments and develop AI agents to address them.

We provide:

Push-button deployment tooling for environment setup (open-source)
Framework for recreating realistic IT scenarios using the deployment tooling:
- 6 SRE scenarios and *21 mechanisms (open-source)
- 4 categories of CISO scenarios (open-source)
- 1 FinOps scenario (open-source)
Two reference AI agents:
- SRE (Site Reliability Engineering) Agent (open-source)
- CISO (Chief Information Security Officer) Agent (open-source)
Fully-managed leaderboard for agent evaluation and comparison

Roadmap

Timeline	Key Deliverables
July 2025	• Refactor leading to a scenario specification generator and runner allowing for most (if not all) mechanisms to be re-used across diverse applications and microservices • Implementation of 10 of the additional scenarios identified
August 2025	• SRE-Agent-Lite: Lightweight agent to assist non-systems personnel with environment debugging • Snapshot & Replay: Data capture and replay capabilities • Implementation of 15 of the additional scenarios to be developed over the summer
Fall 2025	BYOA (Bring Your Own Application): Support for custom application integration

Leaderboard

The ITBench Leaderboard tracks agent performance across SRE, FinOps, and CISO scenarios. We provide fully managed scenario environments while researchers/developers run their agents on their own systems and submit their outputs for evaluation.

Domain	Leaderboard
SRE	View SRE Leaderboard
CISO	View CISO Leaderboard

Get Started: Visit docs/leaderboard.md for access and evaluation guidelines.

Scenarios

ITBench incorporates a collection of problems that we call scenarios. Each scenario is deployed in an operational environment where specific problems occur.

Examples of Scenarios

SRE: Resolve "High error rate on service checkout" in a Kubernetes environment
CISO: Assess compliance posture for "new control rule detected for RHEL 9"
FinOps: Identify and resolve cost overruns and anomalies

Find all scenarios: Scenarios repository

Agents

Two baseline agents are being open-sourced with ITBench, built using the CrewAI framework.

Agent Features

Configurable LLMs: watsonx, Azure, or vLLM support
Natural language tools: Interactions with the environment for information gathering

Available Agents

Agent	Repository
SRE Agent	itbench-sre-agent
CISO Agent	itbench-ciso-caa-agent

How to Cite

@misc{jha2025itbench,
      title={ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks},
      author={Jha, Saurabh and Arora, Rohan and Watanabe, Yuji and others},
      year={2025},
      url={https://github.com/IBM/itbench-sample-scenarios/blob/main/it_bench_arxiv.pdf}
}

Join the Discussion

Have questions or need help getting started with ITBench?

Create a GitHub issue for bug reports or feature requests
Join our Discord community for real-time discussions
For formal inquiries, please see the contacts section

Contacts

General inquiries: [email protected]
Saurabh Jha: [email protected]
Yuji Watanabe: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
docs		docs
images		images
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
CONTRIBUTORS.md		CONTRIBUTORS.md
LEADERBOARD_CISO.md		LEADERBOARD_CISO.md
LEADERBOARD_SRE.md		LEADERBOARD_SRE.md
LICENSE		LICENSE
README.md		README.md
it_bench_arxiv.pdf		it_bench_arxiv.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ITBench

📢 Announcements

Latest Updates

Overview

Key Features

What's Included

Roadmap

Leaderboard

Scenarios

Examples of Scenarios

Agents

Agent Features

Available Agents

How to Cite

Join the Discussion

Contacts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 9

License

itbench-hub/ITBench

Folders and files

Latest commit

History

Repository files navigation

ITBench

📢 Announcements

Latest Updates

Overview

Key Features

What's Included

Roadmap

Leaderboard

Scenarios

Examples of Scenarios

Agents

Agent Features

Available Agents

How to Cite

Join the Discussion

Contacts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 9

Packages