How to solve the problem of errors when loading qwen1.5-7B (using two GPUs) and llama3-8B (using two GPUs) models simultaneously using tritonserver?

### System Info

env: NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     V100 16G*8
docker images:   nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3



### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

command:  mpirun -n 2 --allow-run-as-root tritonserver --model-control-mode=explicit     --modelepository=/data/multi_model_repo/  --load-model=Qwen1.5-7B-Chat   --load-model=Llama3-8B-Chinese-Chat

### Expected behavior

 ![图片](https://github.com/triton-inference-server/tensorrtllm_backend/assets/67726763/f5aaf478-b10e-4c26-8519-496cce81a8da)


### actual behavior

Two models can be loaded successfully, but when I call the Qwen1.5-7B Chat model using the openai interface, an error occurs

### additional notes

 no

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to solve the problem of errors when loading qwen1.5-7B (using two GPUs) and llama3-8B (using two GPUs) models simultaneously using tritonserver? #510

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to solve the problem of errors when loading qwen1.5-7B (using two GPUs) and llama3-8B (using two GPUs) models simultaneously using tritonserver? #510

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions