-
Notifications
You must be signed in to change notification settings - Fork 48
[Ready For Review][AQUA] Add Supporting Fine-Tuned Models in Multi-Model Deployment #1186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Base Model for Fine-Tuned Models (#1185)
…ation to Support FT Models (#1188) Co-authored-by: Liz Johnson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 👍
ads/aqua/app.py
Outdated
@@ -284,8 +284,11 @@ def if_artifact_exist(self, model_id: str, **kwargs) -> bool: | |||
logger.info(f"Artifact not found in model {model_id}.") | |||
return False | |||
|
|||
@cached(cache=TTLCache(maxsize=1, ttl=timedelta(minutes=1), timer=datetime.now)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't 1 min TTL too short here? Makes sense if we have multiple request within a 1 minute interval, probably we can extend to 5 mins or so if user tries out different model combinations for a bit and we want to return cached config here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ads/aqua/app.py
Outdated
raise AquaRuntimeError(f"Target model {oci_model.id} is not an Aqua model.") | ||
logger.debug(f"Target model {oci_model.id} is not an Aqua model.") | ||
return ModelConfigResult(config=config, model_details=oci_model) | ||
# raise AquaRuntimeError(f"Target model {oci_model.id} is not an Aqua model.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove commented code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
total_gpus_available (int, optional): The total number of GPUs available for this shape. | ||
""" | ||
|
||
models: Optional[List[GPUModelAllocation]] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to add protected_namespaces = () in Config since we have an attribute here with model*
to avoid warnings when running CLI commands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
for model_id, config in deployment_configs.items(): | ||
# For multi model deployment, we cannot rely on .shape because some models, like Falcon-7B, can only be deployed on a single GPU card (A10.1). | ||
# However, Falcon can also be deployed on a single card in other A10 shapes, such as A10.2. | ||
# Our current configuration does not support this flexibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we should have only relied on configuration for shape info, but this make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added minor comment, rest looks good.
The model-by-reference path to the LoRA Module within the model artifact | ||
""" | ||
|
||
model_id: Optional[str] = Field(None, description="The fine tuned model OCID to deploy.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs protected_namespaces = () in config to avoid warning messages showing up when running via CLI.
Description
The current implementation of Multi-Model Deployment in AQUA supports base models only. Fine-tuned models, however, are a critical part of many customer workflows - allowing them to adapt base models to domain-specific use cases.
This PR introduces support for deploying fine-tuned LLM models as part of a multi-model deployment group on the VLLM container.
Implementation
In the first iteration, we will treat each selected model, whether it's a base model or a fine-tuned variant—as an independent entity. Even if multiple fine-tuned models share the same base model, each one will be deployed in its own isolated VLLM instance.
On the SMC side, we will leverage VLLM's capability to dynamically merge LoRA adapter weights during runtime. This means each VLLM instance will load the base model and its corresponding fine-tuned weights independently.
To avoid routing conflicts caused by multiple instances using the same base model name, we will route the base model name to one instance only, but we will not advertise this base model as an endpoint to users (This is current behavior with Single Model Deployment).
This configuration structure will prepare us for future enhancements, such as stacked fine-tuned deployments, where multiple fine-tuned variants are hosted under a single base model within one VLLM instance. However, this future enhancement will apply to single-model deployments initially.
In a second iteration, we will explore expanding this capability to multi-model deployments, enabling grouped deployment of fine-tuned variants with shared GPU allocation. That enhancement will require additional work across the ADS SDK, AQUA UI, and validation logic.
Related PRs