Skip to content

How to solve the problem of errors when loading qwen1.5-7B (using two GPUs) and llama3-8B (using two GPUs) models simultaneously using tritonserver? #510

Open
@ChengShuting

Description

@ChengShuting

System Info

env: NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 V100 16G*8
docker images: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

command: mpirun -n 2 --allow-run-as-root tritonserver --model-control-mode=explicit --modelepository=/data/multi_model_repo/ --load-model=Qwen1.5-7B-Chat --load-model=Llama3-8B-Chinese-Chat

Expected behavior

图片

actual behavior

Two models can be loaded successfully, but when I call the Qwen1.5-7B Chat model using the openai interface, an error occurs

additional notes

no

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions