Skip to content

Conversation

shvbsle
Copy link
Contributor

@shvbsle shvbsle commented Mar 24, 2025

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@shvbsle shvbsle marked this pull request as ready for review March 26, 2025 04:53
@shvbsle shvbsle changed the title WIP: add neuron load test to the test bed Add neuron-device-plugin load test to the test bed Mar 26, 2025
@shvbsle shvbsle marked this pull request as draft April 11, 2025 17:35
@shvbsle shvbsle marked this pull request as ready for review April 17, 2025 00:40
@shvbsle shvbsle marked this pull request as draft April 18, 2025 17:20
@shvbsle
Copy link
Contributor Author

shvbsle commented Apr 18, 2025

marking as draft because I've synced the PR with the latest changes from main branch which moves the pipelines to SMNG. Will mark as ready once I do another test-run of the pipeline with latest changes.

@shvbsle
Copy link
Contributor Author

shvbsle commented Apr 19, 2025

Confirmed that this works on Self managed node groups as well

@shvbsle shvbsle marked this pull request as ready for review April 19, 2025 02:12
daemonsets are ready. This ensures that the load-tests don't start
prematurely and inflate the pod startup latency numbers. Removed
neuron-scheduler since it is not being used
Params:
action: start
labelSelector: group = neuron-worker
threshold: 60s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream SLOs is <=5sec.
Wondering why are we doing 60sec ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced it to 25s. From several test runs, I've observed that the time taken by the scheduler to schedule a pod is exceeding 10-15 seconds when running on 5k nodes. A safe upper bound should be 25 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

25seconds is still not under upstream SLO. W should aim for upstream SLO unless we have a strong reason/doc saying otherwise why its not possible.

Params:
action: start
labelSelector: group = neuron-worker
threshold: 25s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

25seconds is still not under upstream SLO. W should aim for upstream SLO unless we have a strong reason/doc saying otherwise why its not possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants