Description
Title
AgentEvaluator only reports the first failing metric, subsequent metrics not evaluated
Describe the bug
When running tests that involve AgentEvaluator.evaluate_eval_set, if the first metric in the criteria fails, an AssertionError is raised immediately. This causes the test execution to halt, so any subsequent metrics are not evaluated. As a result, the test report does not provide a complete overview of all metric failures for the given agent evaluation.
To Reproduce
Steps to reproduce the behavior:
- Create a test case with multiple metrics in the criteria dict
- Ensure that the first metric will fail and the second metric would also fail if evaluated.
- Run the test with pytest.
- Observe that only the first metric failure is reported; the second is not evaluated or reported.
Expected behavior
All metrics in the criteria should be evaluated, and the test should report failures for all that do not meet the threshold, not just the first one.
Actual behavior
Test execution stops at the first failing metric due to an assertion, and subsequent metrics are not checked.
Potential Fix
Accumulate failures for each metric in a list, and after all metrics are evaluated, raise an AssertionError with details of all failed metrics.
Version observed
v1.0.0