Skip to content

test(W-19105940): confidence tests #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 62 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
cc4eba1
feat: add telemetry
mdonnalley Jun 10, 2025
1e8e9a2
Merge branch 'main' into mdonnalley/telemetry
mdonnalley Jun 10, 2025
e15d1e8
fix: ensure all events have same props
mdonnalley Jun 10, 2025
5ba0e32
chore: clean up
mdonnalley Jun 10, 2025
4df91ce
feat: add runtimeMs
mdonnalley Jun 11, 2025
e5c062a
fix: handle failed connection to appinsights
mdonnalley Jun 11, 2025
8e393fa
feat: logging
mdonnalley Jun 12, 2025
fe2d52b
chore: extract method signatures
cristiand391 Jun 12, 2025
5e85f75
chore: code review
mdonnalley Jun 12, 2025
35aad85
Merge branch 'mdonnalley/telemetry' into mdonnalley/logging
mdonnalley Jun 12, 2025
c6d7f79
feat: add client info to telemetry events
mdonnalley Jun 12, 2025
7eda5d1
fix: init telemetry in catch
mdonnalley Jun 12, 2025
05d188b
fix: consolidate TOOL_CALLED and TOOL_ERROR
mdonnalley Jun 12, 2025
6a2c895
Merge branch 'mdonnalley/telemetry' into mdonnalley/logging
mdonnalley Jun 12, 2025
024fa77
feat: count tokens of each tool
mdonnalley Jun 12, 2025
a909d69
chore: clean up token count logging
mdonnalley Jun 13, 2025
14ce216
Merge branch 'main' into mdonnalley/logging
mdonnalley Jun 13, 2025
b7dc943
Merge remote-tracking branch 'origin/main' into mdonnalley/logging
cristiand391 Jun 16, 2025
c5bb910
chore: bump core
cristiand391 Jun 16, 2025
e93b915
fix: remove faulty token counting
mdonnalley Jun 16, 2025
fd6d746
Merge branch 'mdonnalley/logging' of github.com:salesforcecli/mcp int…
mdonnalley Jun 16, 2025
91523e2
fix: set SF_LOG_LEVEL when using --debug
mdonnalley Jun 16, 2025
03b7a3d
test: add testing against LLM Gateway
mdonnalley Jun 16, 2025
394cb50
Merge branch 'main' into mdonnalley/llmg
mdonnalley Jun 16, 2025
5cefe3d
chore: clean up
mdonnalley Jun 16, 2025
15aa440
test: list token counts
mdonnalley Jun 16, 2025
d18028f
test: make more developer friendly
mdonnalley Jun 16, 2025
bb3754a
chore: clean up
mdonnalley Jun 17, 2025
0b534d1
chore: clean up
mdonnalley Jun 17, 2025
4856d5b
chore: clean up
mdonnalley Jun 17, 2025
3ce1fc8
chore: make tables prettier
mdonnalley Jun 17, 2025
f66b793
test: allow longer chats
mdonnalley Jun 18, 2025
a6b3705
test: clean up implementation
mdonnalley Jun 18, 2025
6f86505
test: add --entry-point flag
mdonnalley Jun 18, 2025
1446d75
Merge branch 'main' into mdonnalley/llmg
mdonnalley Jun 26, 2025
9db0618
Merge branch 'main' into mdonnalley/llmg
mdonnalley Jul 18, 2025
5a2cb85
fix: use correct inspector package
mdonnalley Jul 18, 2025
77714c7
test: convert to confidence test
mdonnalley Jul 18, 2025
1fedd54
refactor: use model const
mdonnalley Jul 18, 2025
343cc6c
refactor: make into confidence test
mdonnalley Jul 21, 2025
bf0a38e
chore: clean up
mdonnalley Jul 22, 2025
540f045
chore: clean up
mdonnalley Jul 22, 2025
889a267
refactor: make it easier to add more confidence commands
mdonnalley Jul 22, 2025
d0eeaf7
Merge branch 'main' into mdonnalley/llmg
mdonnalley Jul 22, 2025
de62399
chore: clean up and more test cases
mdonnalley Jul 23, 2025
01ad59a
chore: add comments about client feature
mdonnalley Jul 24, 2025
856532c
fix: implement retries for 429
mdonnalley Jul 24, 2025
6a7ba92
feat: rate limit api requests to avoid 429s
mdonnalley Jul 24, 2025
f41a056
feat: better result reporting
mdonnalley Jul 24, 2025
02d7f4e
fix: allow bursts in RateLimiter
mdonnalley Jul 24, 2025
865008f
ci: setup up GHA for confidence tests
mdonnalley Jul 24, 2025
6e54cba
docs: add docs about confidence tests
mdonnalley Jul 24, 2025
1a09b00
fix: replace --verbose with --concise
mdonnalley Jul 24, 2025
54011bb
fix: throw better error when SF_LLMG_API_KEY is not set
mdonnalley Jul 24, 2025
09e5406
ci: give confidence-test job access to env var
mdonnalley Jul 24, 2025
5ec6591
ci: fix test command
mdonnalley Jul 24, 2025
f801679
fix: smarter retry and limiting
mdonnalley Jul 24, 2025
f76a554
ci: debug failures
mdonnalley Jul 24, 2025
ee9e17f
ci: debug failures
mdonnalley Jul 24, 2025
25af4c6
refactor: user ECA consumer secret and key to auth
mdonnalley Aug 1, 2025
024a244
Merge branch 'main' into mdonnalley/llmg
mdonnalley Aug 1, 2025
52a8440
chore: bump oclif/table
mdonnalley Aug 1, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,23 @@ on:
jobs:
yarn-lockfile-check:
uses: salesforcecli/github-workflows/.github/workflows/lockFileCheck.yml@main

# Detect which files have changed to determine what tests to run
changes:
runs-on: ubuntu-latest
outputs:
confidence-changed: ${{ steps.changes.outputs.confidence }}
steps:
- uses: actions/checkout@v4
- uses: dorny/paths-filter@v2
id: changes
with:
filters: |
confidence:
- 'confidence/**'
- 'test/confidence/**'
- 'src/tools/**'

# Since the Windows unit tests take much longer, we run the linux unit tests first and then run the windows unit tests in parallel with NUTs
linux-unit-tests:
needs: yarn-lockfile-check
Expand All @@ -15,6 +32,31 @@ jobs:
needs: linux-unit-tests
uses: salesforcecli/github-workflows/.github/workflows/unitTestsWindows.yml@main

# Run the confidence tests after the unit tests
confidence-tests:
needs: [linux-unit-tests, changes]
runs-on: ubuntu-latest
if: ${{ needs.changes.outputs.confidence-changed == 'true'}}
env:
SF_MCP_CONFIDENCE_CONSUMER_KEY: ${{ secrets.SF_MCP_CONFIDENCE_CONSUMER_KEY }}
SF_MCP_CONFIDENCE_CONSUMER_SECRET: ${{ secrets.SF_MCP_CONFIDENCE_CONSUMER_SECRET }}
SF_MCP_CONFIDENCE_INSTANCE_URL: ${{ secrets.SF_MCP_CONFIDENCE_INSTANCE_URL }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: lts/*
cache: yarn
- run: yarn install --frozen-lockfile
# Note: we cannot parallelize confidence tests since we don't have the rate limits to support it
# the test runner has rate limiting built-in to prevent hitting the API limits within that test run
- name: Run confidence tests
run: |
for file in test/confidence/*.yml; do
echo "Running confidence test for $file"
yarn test:confidence --file "$file"
done

# Uncomment to enable NUT testing in Github Actions
# nuts:
# needs: linux-unit-tests
Expand Down
54 changes: 52 additions & 2 deletions DEVELOPING.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,9 +124,59 @@ mcp-inspector --cli node bin/run.js --orgs DEFAULT_TARGET_ORG --method tools/lis

Unit tests are run with `yarn test` and use the Mocha test framework. Tests are located in the `test` directory and are named with the pattern, `test/**/*.test.ts`.

### Confidence Tests

Confidence tests validate that the MCP server tools are accurately invoked by various LLM models through the Salesforce LLM Gateway API. These tests ensure that natural language prompts correctly trigger the expected tools with appropriate parameters, maintaining the quality of the AI-powered tool selection.

#### Running Confidence Tests Locally

1. **Set up API access**: Follow this [documentation](https://developer.salesforce.com/docs/einstein/genai/guide/access-models-api-with-rest.html) to setup an External Client App that will give you access to the Models API. Once you have the consumer key and secret from the External Client App, you'll need to add these to environment variables:

```shell
export SF_MCP_CONFIDENCE_CONSUMER_KEY=your_client_id_here
export SF_MCP_CONFIDENCE_CONSUMER_SECRET=your_client_secret_here
export SF_MCP_CONFIDENCE_INSTANCE_URL=https://your_instance.salesforce.com
```

These environment variables are used to generate a JWT token that will be used to authenticate with the Models API.

2. **Run a confidence test**:
```shell
yarn test:confidence --file test/confidence/sf-deploy-metadata.yml
```

#### Test Structure

Confidence tests are defined in YAML files located in `test/confidence/`. Each test file specifies:

- **Models**: Which LLM models to test against. See the Agentforce Developer Guide for [available models](https://developer.salesforce.com/docs/einstein/genai/guide/supported-models.html).
- **Initial Context**: Background information provided to the model
- **Test Cases**: Natural language utterances with expected tool invocations and confidence thresholds

The tests run multiple iterations (default: 5) to calculate confidence levels and ensure consistent tool selection across different model runs. This can be adjusted by passing the `--runs` flag when running the tests, like this:

```shell
yarn test:confidence test/confidence/sf-deploy-metadata.yml --runs 2
```

#### Understanding Test Results

Tests measure two types of confidence:

- **Tool Confidence**: Whether the correct tool was invoked
- **Parameter Confidence**: Whether the tool was called with the expected parameters

Failed tests indicate that either:

1. The model selected the wrong tool for a given prompt
2. The model selected the correct tool but with incorrect parameters
3. The confidence level fell below the specified threshold

These failures help identify areas where tool descriptions or agent instructions need improvement.

## Debugging

> [!NOTE]
> [!NOTE]
> This section assumes you're using Visual Studio Code (VS Code).

You can use the VS Code debugger with the MCP Inspector CLI to step through the code of your MCP tools:
Expand All @@ -150,7 +200,7 @@ MCP_SERVER_REQUEST_TIMEOUT=120000 mcp-inspector --cli node --inspect-brk bin/run
We suggest you set `MCP_SERVER_REQUEST_TIMEOUT` to 120000ms (2 minutes) to allow longer debugging sessions without having the MCP Inspector client timeout.
For other configuration values see: https://github.com/modelcontextprotocol/inspector?tab=readme-ov-file#configuration

> [!IMPORTANT]
> [!IMPORTANT]
> You must compile the local MCP server using `yarn compile` after every change in a TypeScript file, otherwise breakpoints in the TypeScript files might not match the running JavaScript code.

## Useful yarn Commands
Expand Down
29 changes: 29 additions & 0 deletions confidence/.eslintrc.cjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
/*
* Copyright 2025, Salesforce, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

module.exports = {
extends: '../.eslintrc.cjs',
parserOptions: {
project: [
'./tsconfig.json',
'./test/tsconfig.json',
'./confidence/tsconfig.json', // Add this line
],
},
rules: {
'import/no-extraneous-dependencies': ['error', { devDependencies: true }],
},
};
23 changes: 23 additions & 0 deletions confidence/bin/dev.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/usr/bin/env -S node --loader ts-node/esm --disable-warning=ExperimentalWarning

import { dirname } from 'node:path';
import { execute } from '@oclif/core';

process.env.NODE_TLS_REJECT_UNAUTHORIZED = '0'; // Disable TLS verification for local testing
await execute({
development: true,
dir: import.meta.url,
loadOptions: {
root: dirname(import.meta.dirname),
pjson: {
name: 'mcp-test',
version: '1.0.0',
oclif: {
bin: 'mcp-test',
dirname: 'mcp-test',
commands: './lib/commands',
topicSeparator: ' ',
},
},
},
});
22 changes: 22 additions & 0 deletions confidence/bin/run.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/usr/bin/env node

import { dirname } from 'node:path';
import { execute } from '@oclif/core';

process.env.NODE_TLS_REJECT_UNAUTHORIZED = '0'; // Disable TLS verification for local testing
await execute({
dir: import.meta.url,
loadOptions: {
root: dirname(import.meta.dirname),
pjson: {
name: 'mcp-test',
version: '1.0.0',
oclif: {
bin: 'mcp-test',
dirname: 'mcp-test',
commands: './lib/commands',
topicSeparator: ' ',
},
},
},
});
Loading
Loading