-
Notifications
You must be signed in to change notification settings - Fork 46
Add neuron-device-plugin load test to the test bed #499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
tests/tekton-resources/pipelines/eks/awscli-cl2-load-with-addons-slos.yaml
Show resolved
Hide resolved
tests/tekton-resources/tasks/generators/clusterloader/load-neuron-device-plugin.yaml
Show resolved
Hide resolved
tests/tekton-resources/tasks/generators/clusterloader/load-neuron-device-plugin.yaml
Show resolved
Hide resolved
tests/tekton-resources/pipelines/eks/awscli-cl2-load-with-addons-slos.yaml
Show resolved
Hide resolved
marking as draft because I've synced the PR with the latest changes from main branch which moves the pipelines to SMNG. Will mark as ready once I do another test-run of the pipeline with latest changes. |
Confirmed that this works on Self managed node groups as well |
daemonsets are ready. This ensures that the load-tests don't start prematurely and inflate the pod startup latency numbers. Removed neuron-scheduler since it is not being used
tests/tekton-resources/tasks/generators/clusterloader/load-neuron-device-plugin.yaml
Outdated
Show resolved
Hide resolved
tests/assets/neuron/config.yaml
Outdated
Params: | ||
action: start | ||
labelSelector: group = neuron-worker | ||
threshold: 60s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream SLOs is <=5sec.
Wondering why are we doing 60sec
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduced it to 25s. From several test runs, I've observed that the time taken by the scheduler to schedule a pod is exceeding 10-15 seconds when running on 5k nodes. A safe upper bound should be 25 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
25seconds
is still not under upstream SLO. W should aim for upstream SLO unless we have a strong reason/doc saying otherwise why its not possible.
Params: | ||
action: start | ||
labelSelector: group = neuron-worker | ||
threshold: 25s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
25seconds
is still not under upstream SLO. W should aim for upstream SLO unless we have a strong reason/doc saying otherwise why its not possible.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.