Skip to content

How is the performance of the model with pytorch as the backend #4745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
oppolll opened this issue May 29, 2025 · 2 comments
Open

How is the performance of the model with pytorch as the backend #4745

oppolll opened this issue May 29, 2025 · 2 comments
Assignees
Labels
Investigating Performance TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts. triaged Issue has been triaged by maintainers

Comments

@oppolll
Copy link

oppolll commented May 29, 2025

Which one has better performance, using pythorch as the backend or tensorrt-llm as the backend? During the actual test with qwen3, I found that the performance of using pythorch as the backend was not good, and the performance of a single gpu and multiple gpus was the same. Is this normal? Did I miss any details of the inference configuration?

@QiJune
Copy link
Collaborator

QiJune commented May 29, 2025

@oppolll Could you please share your scripts? cc @byshiue

@oppolll
Copy link
Author

oppolll commented May 29, 2025

@oppolll Could you please share your scripts? cc @byshiue

Pytorch as backend reasoning code reference: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.20.0rc3/examples/pytorch/quickstart.py
Lora reasoning code reference: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.20.0rc3/tests/unittest/llmapi/test_llm_pytorch.py

With an input length of 8192 and an accuracy of bf16, I conducted the following several groups of experiments on the H20:

  1. Using the qwen2.5-32b model and tensorrt-llm and pytorch as the backends, under the same experimental conditions, the inference performance with tensorrt-llm as the backend is superior.

  2. Using the qwen3-32b model and pytorch as the backend, the performance is not much different from that of vllm inference. Usually, the inference performance of tensorrt-llm is superior to that of vllm.

  3. Using the qwen3-32b model and pytorch as the backend inference, the performance was surprisingly the same on one gpu and two gpus.

  4. Use qwen3-fp8 model on this link: https://huggingface.co/Qwen/Qwen3-32B-FP8, on a gpu, like bf16 reasoning performance.

  5. Using the qwen3-32B model and pytorch as the backend to load lora inference did not work. Is it because qwen3 does not support lora inference?

Are the above results normal? Could it be that I overlooked some parameter configurations? May I ask if there are two types of model performance data for back-end inference to refer to?

@hchings hchings added the Performance TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts. label May 30, 2025
@github-actions github-actions bot added triaged Issue has been triaged by maintainers Investigating labels May 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Investigating Performance TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts. triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants