-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
System Info
CPU architecture: x86_64
GPU type: NVIDIA A100-SXM4-40GB
CUDA Version: 12.7
Driver Version: 565.57.01
Who can help?
Hey, @byshiue!
I saw you responding to other Encoder Model related issue, hope you might be the right person for this question.
Thank you!
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Hello!
I am trying to use the TensorRT-LLM/examples/bert
to convert a RoBERTa model (FacebookAI/roberta-base
) to a TRT-LLM engine.
I am using v.0.16.0
tag and am following the instructions from TensorRT-LLM/examples/bert/README.md
- First, as a sanity check I verify that converting a BERT model (
google-bert/bert-base-uncased
) passes therun.py
huggingface comparisson test (with intermediate checks). With the following commands
CUDA_VISIBLE_DEVICES=0 python /code/tensorrt_llm/examples/bert/convert_checkpoint.py --model=BertModel --model_dir=google-bert/bert-base-uncased --output_dir=trt_checkpoints/bert-base
CUDA_VISIBLE_DEVICES=0 trtllm-build --checkpoint_dir trt_checkpoints/bert-base/ --output_dir engines/bert-base --remove_input_padding=disable --max_batch_size=128 --max_seq_len=512 --bert_attention_plugin=disable --context_fmha=disable --enable_debug_output
CUDA_VISIBLE_DEVICES=0 python /code/tensorrt_llm/examples/bert/run.py --engine_dir engines/bert-base/ --hf_model_dir=google-bert/bert-base-uncased --run_hf_test --debug
This results with final hidden outputs as well as intermediate layer outputs passing the torch.all_close
checks with the huggingface model outputs (as implemented in run.py
)
- Next, I try to perform the same comparison with a RobertaModel
CUDA_VISIBLE_DEVICES=0 python /code/tensorrt_llm/examples/bert/convert_checkpoint.py --model=RobertaModel --model_dir=FacebookAI/roberta-base --output_dir=trt_checkpoints/roberta-base
CUDA_VISIBLE_DEVICES=0 trtllm-build --checkpoint_dir trt_checkpoints/roberta-base/ --output_dir engines/roberta-base --remove_input_padding=disable --max_batch_size=128 --max_seq_len=512 --bert_attention_plugin=disable --context_fmha=disable --enable_debug_output
CUDA_VISIBLE_DEVICES=0 python /code/tensorrt_llm/examples/bert/run.py --engine_dir engines/roberta-base/ --hf_model_dir=FacebookAI/roberta-base --run_hf_test --debug
Even though I get a pass on the final check results:RobertaModel result is all close to HF reference!
, I observe that the intermediate layer outputs do not match, starting from the 4th layer with default tolerance = 1e-2 (or from the 0th encoder layer with a lower tolerance of 1e-3). Here is the output of the default script:
Embedding are all close
BertEncoderLayer_0_output is close: True
BertEncoderLayer_1_output is close: True
BertEncoderLayer_2_output is close: True
BertEncoderLayer_3_output is close: True
BertEncoderLayer_4_output is close: False
BertEncoderLayer_5_output is close: True
BertEncoderLayer_6_output is close: False
BertEncoderLayer_7_output is close: False
BertEncoderLayer_8_output is close: False
BertEncoderLayer_9_output is close: False
BertEncoderLayer_10_output is close: False
- When I try to use a fine-tuned
roberta-base
checkpoint I encounter both the intermediate and the final checks failing
Expected behavior
Getting the result of
CUDA_VISIBLE_DEVICES=0 python /code/tensorrt_llm/examples/bert/run.py --engine_dir engines/roberta-base/ --hf_model_dir=FacebookAI/roberta-base --run_hf_test --debug
as:
Embedding are all close
BertEncoderLayer_0_output is close: True
BertEncoderLayer_1_output is close: True
BertEncoderLayer_2_output is close: True
BertEncoderLayer_3_output is close: True
BertEncoderLayer_4_output is close: True
BertEncoderLayer_5_output is close: True
BertEncoderLayer_6_output is close: True
BertEncoderLayer_7_output is close: True
BertEncoderLayer_8_output is close: True
BertEncoderLayer_9_output is close: True
BertEncoderLayer_10_output is close: True
as well as the outputs of the final layer:
RobertaModel result is all close to HF reference!
actual behavior
Embedding are all close
BertEncoderLayer_0_output is close: True
BertEncoderLayer_1_output is close: True
BertEncoderLayer_2_output is close: True
BertEncoderLayer_3_output is close: True
BertEncoderLayer_4_output is close: False
BertEncoderLayer_5_output is close: True
BertEncoderLayer_6_output is close: False
BertEncoderLayer_7_output is close: False
BertEncoderLayer_8_output is close: False
BertEncoderLayer_9_output is close: False
BertEncoderLayer_10_output is close: False
additional notes
Based on the EmbedderLayer
outputs matching in all cases, I would expect the difference be in the EncoderLayer
, however based on the huggingface implementations those should be exactly the same for BERT and RoBERTa modelling_bert.py, modelling_roberta.py
I would appreciate any help or pointers on how to debug this and make the RobertaModel work with TensorRT-LLM.