Enhance model engine with max concurrency support for sync endpoints #701
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces improvements to how the maximum concurrency per worker is configured, validated, and propagated across the model engine deployment and autoscaling logic. The main changes ensure that the concurrency setting is passed from configuration templates through to runtime, validated for autoscaling safety, and surfaced in resource reporting.
Configuration and Runtime Propagation:
--max-concurrency
argument to the forwarder container command inservice_template_config_map.yaml
, allowing per-worker concurrency to be set via environment variable (CONCURRENT_REQUESTS_PER_WORKER
). [1] [2] [3] [4]--max-concurrency
as a command-line argument, storing it in the environment for use in concurrency limiter setup.Validation Enhancements:
validate_concurrent_requests_per_worker
function to check that, for sync/streaming endpoints,per_worker
is less than half ofconcurrent_requests_per_worker
to prevent autoscaling issues. This validation is now called with the additionalper_worker
parameter during endpoint creation and update. [1] [2] [3] [4]Resource Reporting and Autoscaling:
concurrent_requests_per_worker
value from the forwarder container’s command in deployment configs, making it available for autoscaling parameter calculations and resource reporting. [1] [2] [3]These changes collectively improve the robustness and transparency of concurrency configuration and autoscaling for model endpoints.
Pull Request Summary
Addressing MLI-3412
Test Plan and Usage Guide
How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.