Description
System Info
CPU Architecture: AMD EPYC 7V13 64-Core Processor
CPU/Host memory size: 440
GPU properties: A800 80GB
GPU name: NVIDIA A800 80GB x2
GPU mem size: 80Gb x 2
clock frequencies
Libraries
TensorRT-LLM branch or tag: main
TensorRT-LLM commit: ae52bce
Versions of TensorRT, CUDA: (10.0.1, 12.4)
container used: Built container from tensorrtllm_backend main branch using dockerfile/Dockerfile.trt_llm_backend
nvidia driver version: 535.161.07
OS: Ubuntu 22.04.4 LTS
docker image version: custom built from main branch
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
-
Build the trt llm container by running
DOCKER_BUILDKIT=1 TORCH_CUDA_ARCH_LIST= docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend
-
Launch the container with this command
sudo docker run -it --net host --shm-size=20g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/keith:/home triton_trt_llm:latest /bin/bash
-
Build trt engines following the guide https://github.com/NVIDIA/TensorRT-LLM/tree/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/examples/llama
-
Launch triton server
set decoding_mode: top_k_top_p
python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/path/to/llama2-70b/repo --log
- Query the serser twice
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "背一首诗", "max_tokens": 20, "bad_words": "", "stop_words": "</s>", "top_k": 100, "top_p": 1}'
Expected behavior
- response1:
{
"model_name": "ensemble",
"model_version": "1",
"sequence_end": false,
"sequence_id": 0,
"sequence_start": false,
"text_output": "背诵一首诗\n\n《登高》 杜甫\n\n风急天高猿啸哀,渚清沙白鸟飞回。无边落木萧萧下,不尽长江滚滚来。万里悲秋常作客,百年多病独登台。艰难苦恨繁霜鬓,潦倒新停浊酒杯。"
}
- response2
response1:
{
"model_name": "ensemble",
"model_version": "1",
"sequence_end": false,
"sequence_id": 0,
"sequence_start": false,
"text_output": "背诵一首诗\n\n《赋得古原草送别》白居易离离原上草,一岁一枯荣。野火烧不尽,春风吹又生。远芳侵古道,晴翠接荒城。又送王孙去,萋萋满别情。。"
}
actual behavior
two responses are same:
{
"model_name": "ensemble",
"model_version": "1",
"sequence_end": false,
"sequence_id": 0,
"sequence_start": false,
"text_output": "背诵一首诗\n\n《登高》 杜甫\n\n风急天高猿啸哀,渚清沙白鸟飞回。无边落木萧萧下,不尽长江滚滚来。万里悲秋常作客,百年多病独登台。艰难苦恨繁霜鬓,潦倒新停浊酒杯。"
}
additional notes
while I set decoding mode to top_p
top_k
The result is still no effect