Skip to content

decoding_mode top_k_top_p does not take effect for llama2 not same with huggingface #461

Open
@yjjiang11

Description

@yjjiang11

System Info

CPU Architecture: AMD EPYC 7V13 64-Core Processor
CPU/Host memory size: 440
GPU properties: A800 80GB
GPU name: NVIDIA A800 80GB x2
GPU mem size: 80Gb x 2
clock frequencies
Libraries
TensorRT-LLM branch or tag: main
TensorRT-LLM commit: ae52bce
Versions of TensorRT, CUDA: (10.0.1, 12.4)
container used: Built container from tensorrtllm_backend main branch using dockerfile/Dockerfile.trt_llm_backend
nvidia driver version: 535.161.07
OS: Ubuntu 22.04.4 LTS
docker image version: custom built from main branch

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Build the trt llm container by running
    DOCKER_BUILDKIT=1 TORCH_CUDA_ARCH_LIST= docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend

  2. Launch the container with this command
    sudo docker run -it --net host --shm-size=20g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/keith:/home triton_trt_llm:latest /bin/bash

  3. Build trt engines following the guide https://github.com/NVIDIA/TensorRT-LLM/tree/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/examples/llama

  4. Launch triton server

set decoding_mode: top_k_top_p

python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/path/to/llama2-70b/repo --log

  1. Query the serser twice

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "背一首诗", "max_tokens": 20, "bad_words": "", "stop_words": "</s>", "top_k": 100, "top_p": 1}'

Expected behavior

  • response1:

{
"model_name": "ensemble",
"model_version": "1",
"sequence_end": false,
"sequence_id": 0,
"sequence_start": false,
"text_output": "背诵一首诗\n\n《登高》 杜甫\n\n风急天高猿啸哀,渚清沙白鸟飞回。无边落木萧萧下,不尽长江滚滚来。万里悲秋常作客,百年多病独登台。艰难苦恨繁霜鬓,潦倒新停浊酒杯。"
}

  • response2
    response1:
    {
    "model_name": "ensemble",
    "model_version": "1",
    "sequence_end": false,
    "sequence_id": 0,
    "sequence_start": false,
    "text_output": "背诵一首诗\n\n《赋得古原草送别》白居易离离原上草,一岁一枯荣。野火烧不尽,春风吹又生。远芳侵古道,晴翠接荒城。又送王孙去,萋萋满别情。。"
    }

actual behavior

two responses are same:
{
"model_name": "ensemble",
"model_version": "1",
"sequence_end": false,
"sequence_id": 0,
"sequence_start": false,
"text_output": "背诵一首诗\n\n《登高》 杜甫\n\n风急天高猿啸哀,渚清沙白鸟飞回。无边落木萧萧下,不尽长江滚滚来。万里悲秋常作客,百年多病独登台。艰难苦恨繁霜鬓,潦倒新停浊酒杯。"
}

additional notes

while I set decoding mode to top_p top_k
The result is still no effect

Metadata

Metadata

Assignees

Labels

triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions