Skip to content

Conversation

saeidbarati-scale
Copy link
Collaborator

This pull request introduces improvements to how the maximum concurrency per worker is configured, validated, and propagated across the model engine deployment and autoscaling logic. The main changes ensure that the concurrency setting is passed from configuration templates through to runtime, validated for autoscaling safety, and surfaced in resource reporting.

Configuration and Runtime Propagation:

  • Added the --max-concurrency argument to the forwarder container command in service_template_config_map.yaml, allowing per-worker concurrency to be set via environment variable (CONCURRENT_REQUESTS_PER_WORKER). [1] [2] [3] [4]
  • Updated the HTTP forwarder entrypoint to accept --max-concurrency as a command-line argument, storing it in the environment for use in concurrency limiter setup.
  • Modified the concurrency limiter initialization to prefer the environment variable over config file settings, ensuring runtime concurrency matches deployment configuration.

Validation Enhancements:

  • Improved the validate_concurrent_requests_per_worker function to check that, for sync/streaming endpoints, per_worker is less than half of concurrent_requests_per_worker to prevent autoscaling issues. This validation is now called with the additional per_worker parameter during endpoint creation and update. [1] [2] [3] [4]

Resource Reporting and Autoscaling:

  • Added logic to extract the concurrent_requests_per_worker value from the forwarder container’s command in deployment configs, making it available for autoscaling parameter calculations and resource reporting. [1] [2] [3]
  • Updated autoscaling parameter constructors to use the extracted concurrency value instead of a hardcoded default, ensuring accurate scaling and reporting for sync/streaming endpoints. [1] [2] [3]

These changes collectively improve the robustness and transparency of concurrency configuration and autoscaling for model endpoints.

Pull Request Summary

Addressing MLI-3412

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

- Added `--max-concurrency` argument to service templates for configuring maximum concurrent requests per worker.
- Updated validation logic in `validate_concurrent_requests_per_worker` to handle sync and streaming endpoints.
- Modified `get_concurrency_limiter` to check for concurrency settings from environment variables.
- Implemented extraction of `concurrent_requests_per_worker` from deployment configurations for autoscaling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant