Add neuron-device-plugin load test to the test bed #499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

shvbsle wants to merge 15 commits into awslabs:main from shvbsle:neuron

Contributor

shvbsle commented Mar 24, 2025

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

shvbsle added 4 commits

March 24, 2025 12:24


          add neuron plugin to the test bed

72ce8b9


          updated installation tasks and load-test tasks

92d1d25


          integrated neuron installation plugins with the main pipeline

f85a86a


          exposed more params in the pipeline

65bcba2

shvbsle marked this pull request as ready for review

March 26, 2025 04:53

shvbsle changed the title ~~WIP: add neuron load test to the test bed~~ Add neuron-device-plugin load test to the test bed

hakuna-matatah reviewed

View reviewed changes

tests/assets/neuron/config.yaml Show resolved Hide resolved

hakuna-matatah reviewed

View reviewed changes

tests/assets/neuron/pod.yaml Show resolved Hide resolved

hakuna-matatah reviewed

View reviewed changes

tests/tekton-resources/pipelines/eks/awscli-cl2-load-with-addons-slos.yaml Show resolved Hide resolved

hakuna-matatah reviewed

View reviewed changes

tests/tekton-resources/tasks/generators/clusterloader/load-neuron-device-plugin.yaml Show resolved Hide resolved

hakuna-matatah reviewed

View reviewed changes

tests/tekton-resources/tasks/generators/clusterloader/load-neuron-device-plugin.yaml Show resolved Hide resolved

hakuna-matatah reviewed

View reviewed changes

tests/tekton-resources/pipelines/eks/awscli-cl2-load-with-addons-slos.yaml Show resolved Hide resolved

shvbsle added 3 commits

April 11, 2025 04:41


          Set default cl2-neuron-pods value in pipeline definition. Emitted neuron

load test result outcome


          reordered tasks

9c363db


          increased threshhold for pod startup

c22ef1a

shvbsle marked this pull request as draft

April 11, 2025 17:35


          changed defaults for better usability

697fcd8

shvbsle marked this pull request as ready for review

April 17, 2025 00:40


          sycned with main and resolved conflicts

88c8b1f

shvbsle marked this pull request as draft

April 18, 2025 17:20

Contributor Author

shvbsle commented Apr 18, 2025

marking as draft because I've synced the PR with the latest changes from main branch which moves the pipelines to SMNG. Will mark as ready once I do another test-run of the pipeline with latest changes.


          moved default values from task to pipeline

0affed3

Contributor Author

shvbsle commented Apr 19, 2025

Confirmed that this works on Self managed node groups as well

shvbsle marked this pull request as ready for review

April 19, 2025 02:12


          Added a verification to ensure that all the neuron-device-plugin

a661b0c

daemonsets are ready. This ensures that the load-tests don't start
prematurely and inflate the pod startup latency numbers. Removed
neuron-scheduler since it is not being used

hakuna-matatah reviewed

View reviewed changes

tests/tekton-resources/tasks/generators/clusterloader/load-neuron-device-plugin.yaml Outdated Show resolved Hide resolved

shvbsle added 3 commits

April 22, 2025 14:27


          moved defaults to task definition and improvided pipeline defaults

1c040a1


          removed unused param from generate-neuron-load

733fcd5


          removed arbitrary default node value for generate-neuron-load task

f99d4f0

hakuna-matatah reviewed

View reviewed changes

tests/assets/neuron/config.yaml Outdated

+                  Params:
+                    action: start
+                    labelSelector: group = neuron-worker
+                    threshold: 60s

Contributor

hakuna-matatah Apr 22, 2025

Upstream SLOs is <=5sec.
Wondering why are we doing 60sec ?

Contributor Author

shvbsle May 2, 2025

Reduced it to 25s. From several test runs, I've observed that the time taken by the scheduler to schedule a pod is exceeding 10-15 seconds when running on 5k nodes. A safe upper bound should be 25 seconds.

Contributor

hakuna-matatah May 13, 2025

25seconds is still not under upstream SLO. W should aim for upstream SLO unless we have a strong reason/doc saying otherwise why its not possible.


          reduced threshhold to 25s

c830538

hakuna-matatah reviewed

View reviewed changes

tests/assets/neuron/config.yaml

+                  Params:
+                    action: start
+                    labelSelector: group = neuron-worker
+                    threshold: 25s

Contributor

hakuna-matatah May 13, 2025

25seconds is still not under upstream SLO. W should aim for upstream SLO unless we have a strong reason/doc saying otherwise why its not possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet