decoding_mode top_k_top_p does not  take effect for llama2   not same with huggingface

### System Info

CPU Architecture: AMD EPYC 7V13 64-Core Processor
CPU/Host memory size: 440
GPU properties: A800 80GB
GPU name: NVIDIA A800 80GB x2
GPU mem size: 80Gb x 2
clock frequencies
Libraries
TensorRT-LLM branch or tag: main
TensorRT-LLM commit: https://github.com/triton-inference-server/tensorrtllm_backend/commit/ae52bce3ed8ecea468a16483e0dacd3d156ae4fe
Versions of TensorRT, CUDA: (10.0.1, 12.4)
container used: Built container from tensorrtllm_backend main branch using dockerfile/Dockerfile.trt_llm_backend
nvidia driver version: 535.161.07
OS: Ubuntu 22.04.4 LTS
docker image version: custom built from main branch

### Who can help?

_No response_

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

1. Build the trt llm container by running
`DOCKER_BUILDKIT=1 TORCH_CUDA_ARCH_LIST= docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend`
2. Launch the container with this command
`sudo docker run -it --net host --shm-size=20g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/keith:/home triton_trt_llm:latest /bin/bash`
3. Build trt engines following the guide https://github.com/NVIDIA/TensorRT-LLM/tree/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/examples/llama

4. Launch triton server 

set decoding_mode: `top_k_top_p` 

`python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/path/to/llama2-70b/repo  --log`


5. Query the serser twice

`curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "背一首诗", "max_tokens": 20, "bad_words": "", "stop_words": "</s>", "top_k": 100, "top_p": 1}'`

### Expected behavior

- response1:

{
  "model_name": "ensemble",
  "model_version": "1",
  "sequence_end": false,
  "sequence_id": 0,
  "sequence_start": false,
  "text_output": "背诵一首诗\n\n《登高》 杜甫\n\n风急天高猿啸哀，渚清沙白鸟飞回。无边落木萧萧下，不尽长江滚滚来。万里悲秋常作客，百年多病独登台。艰难苦恨繁霜鬓，潦倒新停浊酒杯。"
}

- response2
response1:
{
  "model_name": "ensemble",
  "model_version": "1",
  "sequence_end": false,
  "sequence_id": 0,
  "sequence_start": false,
  "text_output": "背诵一首诗\n\n《赋得古原草送别》白居易离离原上草，一岁一枯荣。野火烧不尽，春风吹又生。远芳侵古道，晴翠接荒城。又送王孙去，萋萋满别情。。"
}

### actual behavior

two responses are same:
{
  "model_name": "ensemble",
  "model_version": "1",
  "sequence_end": false,
  "sequence_id": 0,
  "sequence_start": false,
  "text_output": "背诵一首诗\n\n《登高》 杜甫\n\n风急天高猿啸哀，渚清沙白鸟飞回。无边落木萧萧下，不尽长江滚滚来。万里悲秋常作客，百年多病独登台。艰难苦恨繁霜鬓，潦倒新停浊酒杯。"
}


### additional notes

while I set decoding mode to `top_p` `top_k` 
The result is still no effect

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

decoding_mode top_k_top_p does not take effect for llama2 not same with huggingface #461

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

decoding_mode top_k_top_p does not take effect for llama2 not same with huggingface #461

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions