We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
There was an error while loading. Please reload this page.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA B200
No response
examples
root@32ab8238e348:/app/tensorrt_llm# cat extra-llm-api-config.yml enable_attention_dp: false max_batch_size: 196 max_num_tokens: 50000 max_seq_len: 40000 tp_size: 8 ep_size: 8 speculative_config: decoding_type: "MTP" num_nextn_predict_layers: 1 kv_cache_config: free_gpu_memory_fraction: 0.6 pytorch_backend_config: print_iter_log: true disable_overlap_scheduler: false use_cuda_graph: true cuda_graph_batch_sizes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256] 2. trtllm-serve /data/weights/nvidia--DeepSeek-R1-FP4 --host 0.0.0.0 --backend pytorch --max_batch_size 64 --max_num_tokens 50000 --max_seq_len 40960 --tp_size 8 --pp_size 1 --ep_size 8 --extra_llm_api_options extra-llm-api-config.yml
trtllm-serve /data/weights/nvidia--DeepSeek-R1-FP4 --host 0.0.0.0 --backend pytorch --max_batch_size 64 --max_num_tokens 50000 --max_seq_len 40960 --tp_size 8 --pp_size 1 --ep_size 8 --extra_llm_api_options extra-llm-api-config.yml
server starts
[05/27/2025-21:12:35] [TRT-LLM] [I] [Autotuner]: Autotuning process starts ... [05/27/2025-21:12:35] [TRT-LLM] [I] Run autotuning warmup for batch size=1 [05/27/2025-21:12:36] [TRT-LLM] [I] Autotuner Cache size after warmup 112 [05/27/2025-21:12:36] [TRT-LLM] [I] [Autotuner]: Autotuning process ends [05/27/2025-21:12:36] [TRT-LLM] [I] Creating CUDA graph instances for 196 batch sizes. [05/27/2025-21:12:36] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=196 [05/27/2025-21:12:36] [TRT-LLM] [E] Failed to initialize executor on rank 3: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemsetAsync(fc1_act_sf_flat, 0x0, getOffsetFlatSFArray(num_experts_per_node, num_rows, cols), stream): an illegal memory access was encountered (/builds/ftp/internalcutlasskernels/cpp/tensorrt_llm/kernels/internal_cutlass_kernels/src/moe_gemm/moe_kernels.cu:1282) 1 0x7a2d00e89f0a /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x1b60f0a) [0x7a2d00e89f0a] 2 0x7a2d025b85d5 void tensorrt_llm::kernels::expandInputRowsKernelLauncher<__nv_fp4_e2m1, __nv_fp4_e2m1>(__nv_fp4_e2m1 const*, __nv_fp4_e2m1*, float const*, float*, int const*, int*, long, long const*, long, int, int, float const*, long*, unsigned char*, unsigned char const*, CUstream_st*) + 405 3 0x7a2d02613afd tensorrt_llm::kernels::CutlassMoeFCRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_fp4_e2m1, __nv_bfloat16, void>::runMoe(void const*, void const*, int const*, float const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, tensorrt_llm::kernels::QuantParams, long, long, long, int, int, char*, void*, int*, tensorrt_llm::kernels::MOEParallelismConfig, bool, tensorrt_llm::kernels::LoraParams&, bool, bool, tensorrt_llm::kernels::MoeMinLatencyParams&, CUstream_st*) + 3229 4 0x7a2b8fa149c3 torch_ext::FusedMoeRunner::runMoe(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >) + 2611 5 0x7a2b8fa1901e /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(+0x1e101e) [0x7a2b8fa1901e] 6 0x7a2b8fa19722 std::Function_handler<void (std::vector<c10::IValue, std::allocatorc10::IValue >&), torch::class<torch_ext::FusedMoeRunner>::defineMethod<torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)> >(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)>, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::initializer_listtorch::arg)::{lambda(std::vector<c10::IValue, std::allocatorc10::IValue >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocatorc10::IValue >&) + 50 7 0x7a2e28f50804 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x945804) [0x7a2e28f50804] 8 0x7a2e28f51085 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x946085) [0x7a2e28f51085] 9 0x7a2e28fe4e26 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9d9e26) [0x7a2e28fe4e26] 10 0x7a2e28fe504d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9da04d) [0x7a2e28fe504d] 11 0x7a2e2898919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7a2e2898919d] 12 0x58208f /usr/bin/python() [0x58208f] 13 0x549185 _PyObject_MakeTpCall + 117 14 0x54ce49 /usr/bin/python() [0x54ce49] 15 0x5a374a /usr/bin/python() [0x5a374a] 16 0x549185 _PyObject_MakeTpCall + 117 17 0x5d73c9 _PyEval_EvalFrameDefault + 2697 18 0x7a2e28e7b0de /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8700de) [0x7a2e28e7b0de] 19 0x7a2e2917fe1f /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so([05/27/2025-21:12:36] [TRT-LLM] [E] Failed to initialize executor on rank 1: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemsetAsync(fc1_act_sf_flat, 0x0, getOffsetFlatSFArray(num_experts_per_node, num_rows, cols), stream): an illegal memory access was encountered (/builds/ftp/internalcutlasskernels/cpp/tensorrt_llm/kernels/internal_cutlass_kernels/src/moe_gemm/moe_kernels.cu:1282) 1 0x736fb6f89f0a /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x1b60f0a) [0x736fb6f89f0a] 2 0x736fb86b85d5 void tensorrt_llm::kernels::expandInputRowsKernelLauncher<__nv_fp4_e2m1, __nv_fp4_e2m1>(__nv_fp4_e2m1 const*, __nv_fp4_e2m1*, float const*, float*, int const*, int*, long, long const*, long, int, int, float const*, long*, unsigned char*, unsigned char const*, CUstream_st*) + 405 3 0x736fb8713afd tensorrt_llm::kernels::CutlassMoeFCRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_fp4_e2m1, __nv_bfloat16, void>::runMoe(void const*, void const*, int const*, float const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, tensorrt_llm::kernels::QuantParams, long, long, long, int, int, char*, void*, int*, tensorrt_llm::kernels::MOEParallelismConfig, bool, tensorrt_llm::kernels::LoraParams&, bool, bool, tensorrt_llm::kernels::MoeMinLatencyParams&, CUstream_st*) + 3229 4 0x736e45bb89c3 torch_ext::FusedMoeRunner::runMoe(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >) + 2611 5 0x736e45bbd01e /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(+0x1e101e) [0x736e45bbd01e] 6 0x736e45bbd722 std::Function_handler<void (std::vector<c10::IValue, std::allocatorc10::IValue >&), torch::class<torch_ext::FusedMoeRunner>::defineMethod<torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)> >(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)>, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::initializer_listtorch::arg)::{lambda(std::vector<c10::IValue, std::allocatorc10::IValue >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocatorc10::IValue >&) + 50 7 0x7370def50804 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x945804) [0x7370def50804] 8 0x7370def51085 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x946085) [0x7370def51085] 9 0x7370defe4e26 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9d9e26) [0x7370defe4e26] 10 0x7370defe504d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9da04d) [0x7370defe504d] 11 0x7370de98919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7370de98919d] 12 0x58208f /usr/bin/python() [0x58208f] 13 0x549185 _PyObject_MakeTpCall + 117 14 0x54ce49 /usr/bin/python() [0x54ce49] 15 0x5a374a /usr/bin/python() [0x5a374a] 16 0x549185 _PyObject_MakeTpCall + 117 17 0x5d73c9 _PyEval_EvalFrameDefault + 2697 18 0x7370dee7b0de /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8700de) [0x7370dee7b0de] 19 0x7370df17fe1f /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xb74e1f) [0x7a2e2917fe1f] 20 0x7a2e20ce23e1 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(+0x57be3e1) [0x7a2e20ce23e1] 21 0x7a2e28f30c7f torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args const&, pybind11::kwargs const&, std::optionalc10::DispatchKey) + 239 22 0x7a2e28f30f29 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) + 553 23 0x7a2e28e422ee /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8372ee) [0x7a2e28e422ee] 24 0x7a2e2898919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7a2e2898919d] 25 0x58208f /usr/bin/python() [0x58208f] 26 0x54b30c PyObject_Call + 108 27 0x5db55b _PyEval_EvalFrameDefault + 19483 28 0x54aa9a _PyObject_Call_Prepend + 394 29 0x5a3628 /usr/bin/python() [0x5a3628] 30 0x54924e _PyObject_MakeTpCall + 318 31 0x5d73c9 _PyEval_EvalFrameDefault + 2697 32 0x54cd94 /usr/bin/python() [0x54cd94] 33 0x54b3b5 PyObject_Call + 277 34 0x5db55b _PyEval_EvalFrameDefault + 19483 35 0x54cd94 /usr/bin/python() [0x54cd94] 36 0x54b3b5 PyObject_Call + 277 37 0x5db55b _PyEval_EvalFrameDefault + 19483 38 0x54aa9a _PyObject_Call_Prepend + 394 39 0x5a3628 /usr/bin/python() [0x5a3628] 40 0x54924e _PyObject_MakeTpCall + 318 41 0x5d73c9 _PyEval_EvalFrameDefault + 2697 42 0x54cd94 /usr/bin/python() [0x54cd94] 43 0x54b3b5 PyObject_Call + 277 44 0x5db55b _PyEval_EvalFrameDefault + 19483 45 0x54cd94 /usr/bin/python() [0x54cd94] 46 0x54b3b5 PyObject_Call + 277 47 0x5db55b _PyEval_EvalFrameDefault + 19483 48 0x54aa9a _PyObject_Call_Prepend + 394 49 0x5a3628 /usr/bin/python() [0x5a3628] 50 0x54924e _PyObject_MakeTpCall + 318 51 0x5d73c9 _PyEval_EvalFrameDefault + 2697 52 0x54cd94 /usr/bin/python() [0x54cd94] 53 0x54b3b5 PyObject_Call + 277 54 0x5db55b _PyEval_EvalFrameDefault + 19483 55 0x54cd94 /usr/bin/python() [0x54cd94] 56 0x54b3b5 PyObject_Call + 277 57 0x5db55b _PyEval_EvalFrameDefault + 19483 58 0x54aa9a _PyObject_Call_Prepend + 394 59 0x5a3628 /usr/bin/python() [0x5a3628] 60 0x54924e _PyObject_MakeTpCall + 318 61 0x5d73c9 _PyEval_EvalFrameDefault + 2697 62 0x54cd94 /usr/bin/python() [0x54cd94] 63 0x54b3b5 PyObject_Call + 277 64 0x5db55b _PyEval_EvalFrameDefault + 19483 65 0x54cd94 /usr/bin/python() [0x54cd94] 66 0x54b3b5 PyObject_Call + 277 67 0x5db55b _PyEval_EvalFrameDefault + 19483 68 0x54aa9a _PyObject_Call_Prepend + 394 69 0x5a3628 /usr/bin/python() [0x5a3628] 70 0x54924e _PyObject_MakeTpCall + 318 71 0x5d73c9 _PyEval_EvalFrameDefault + 2697 72 0x54cd94 /usr/bin/python() [0x54cd94] 73 0x54b3b5 PyObject_Call + 277 74 0x5db55b _PyEval_EvalFrameDefault + 19483 75 0x54cd94 /usr/bin/python() [0x54cd94] 76 0x54b3b5 PyObject_Call + 277 77 0x5db55b _PyEval_EvalFrameDefault + 19483 78 0x54aa9a _PyObject_Call_Prepend + 394 79 0x59e09f /usr/bin/python() [0x59e09f] 80 0x599b63 /usr/bin/python() [0x599b63] 81 0x54924e _PyObject_MakeTpCall + 318 82 0x5d73c9 _PyEval_EvalFrameDefault + 2697 83 0x54aa9a _PyObject_Call_Prepend + 394 84 0x59e09f /usr/bin/python() [0x59e09f] 85 0x599b63 /usr/bin/python() [0x599b63] 86 0x54924e _PyOb+0xb74e1f) [0x7370df17fe1f] 20 0x7370d6ce23e1 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(+0x57be3e1) [0x7370d6ce23e1] 21 0x7370def30c7f torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args const&, pybind11::kwargs const&, std::optionalc10::DispatchKey) + 239 22 0x7370def30f29 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) + 553 23 0x7370dee422ee /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8372ee) [0x7370dee422ee] 24 0x7370de98919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7370de98919d] 25 0x58208f /usr/bin/python() [0x58208f] 26 0x54b30c PyObject_Call + 108
This only happens with MTP enabled with disable_overlap_scheduler set to false
The text was updated successfully, but these errors were encountered:
Maybe @kaiyux
Sorry, something went wrong.
@lfr-0531 Can you help take a look?
lfr-0531
No branches or pull requests
Uh oh!
There was an error while loading. Please reload this page.
System Info
NVIDIA B200
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
root@32ab8238e348:/app/tensorrt_llm# cat extra-llm-api-config.yml
enable_attention_dp: false
max_batch_size: 196
max_num_tokens: 50000
max_seq_len: 40000
tp_size: 8
ep_size: 8
speculative_config:
decoding_type: "MTP"
num_nextn_predict_layers: 1
kv_cache_config:
free_gpu_memory_fraction: 0.6
pytorch_backend_config:
print_iter_log: true
disable_overlap_scheduler: false
use_cuda_graph: true
cuda_graph_batch_sizes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256]
2.
trtllm-serve /data/weights/nvidia--DeepSeek-R1-FP4 --host 0.0.0.0 --backend pytorch --max_batch_size 64 --max_num_tokens 50000 --max_seq_len 40960 --tp_size 8 --pp_size 1 --ep_size 8 --extra_llm_api_options extra-llm-api-config.yml
Expected behavior
server starts
actual behavior
[05/27/2025-21:12:35] [TRT-LLM] [I] [Autotuner]: Autotuning process starts ...
[05/27/2025-21:12:35] [TRT-LLM] [I] Run autotuning warmup for batch size=1
[05/27/2025-21:12:36] [TRT-LLM] [I] Autotuner Cache size after warmup 112
[05/27/2025-21:12:36] [TRT-LLM] [I] [Autotuner]: Autotuning process ends
[05/27/2025-21:12:36] [TRT-LLM] [I] Creating CUDA graph instances for 196 batch sizes.
[05/27/2025-21:12:36] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=196
[05/27/2025-21:12:36] [TRT-LLM] [E] Failed to initialize executor on rank 3: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemsetAsync(fc1_act_sf_flat, 0x0, getOffsetFlatSFArray(num_experts_per_node, num_rows, cols), stream): an illegal memory access was encountered (/builds/ftp/internalcutlasskernels/cpp/tensorrt_llm/kernels/internal_cutlass_kernels/src/moe_gemm/moe_kernels.cu:1282)
1 0x7a2d00e89f0a /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x1b60f0a) [0x7a2d00e89f0a]
2 0x7a2d025b85d5 void tensorrt_llm::kernels::expandInputRowsKernelLauncher<__nv_fp4_e2m1, __nv_fp4_e2m1>(__nv_fp4_e2m1 const*, __nv_fp4_e2m1*, float const*, float*, int const*, int*, long, long const*, long, int, int, float const*, long*, unsigned char*, unsigned char const*, CUstream_st*) + 405
3 0x7a2d02613afd tensorrt_llm::kernels::CutlassMoeFCRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_fp4_e2m1, __nv_bfloat16, void>::runMoe(void const*, void const*, int const*, float const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, tensorrt_llm::kernels::QuantParams, long, long, long, int, int, char*, void*, int*, tensorrt_llm::kernels::MOEParallelismConfig, bool, tensorrt_llm::kernels::LoraParams&, bool, bool, tensorrt_llm::kernels::MoeMinLatencyParams&, CUstream_st*) + 3229
4 0x7a2b8fa149c3 torch_ext::FusedMoeRunner::runMoe(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >) + 2611
5 0x7a2b8fa1901e /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(+0x1e101e) [0x7a2b8fa1901e]
6 0x7a2b8fa19722 std::Function_handler<void (std::vector<c10::IValue, std::allocatorc10::IValue >&), torch::class<torch_ext::FusedMoeRunner>::defineMethod<torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)> >(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)>, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::initializer_listtorch::arg)::{lambda(std::vector<c10::IValue, std::allocatorc10::IValue >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocatorc10::IValue >&) + 50
7 0x7a2e28f50804 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x945804) [0x7a2e28f50804]
8 0x7a2e28f51085 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x946085) [0x7a2e28f51085]
9 0x7a2e28fe4e26 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9d9e26) [0x7a2e28fe4e26]
10 0x7a2e28fe504d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9da04d) [0x7a2e28fe504d]
11 0x7a2e2898919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7a2e2898919d]
12 0x58208f /usr/bin/python() [0x58208f]
13 0x549185 _PyObject_MakeTpCall + 117
14 0x54ce49 /usr/bin/python() [0x54ce49]
15 0x5a374a /usr/bin/python() [0x5a374a]
16 0x549185 _PyObject_MakeTpCall + 117
17 0x5d73c9 _PyEval_EvalFrameDefault + 2697
18 0x7a2e28e7b0de /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8700de) [0x7a2e28e7b0de]
19 0x7a2e2917fe1f /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so([05/27/2025-21:12:36] [TRT-LLM] [E] Failed to initialize executor on rank 1: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemsetAsync(fc1_act_sf_flat, 0x0, getOffsetFlatSFArray(num_experts_per_node, num_rows, cols), stream): an illegal memory access was encountered (/builds/ftp/internalcutlasskernels/cpp/tensorrt_llm/kernels/internal_cutlass_kernels/src/moe_gemm/moe_kernels.cu:1282)
1 0x736fb6f89f0a /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x1b60f0a) [0x736fb6f89f0a]
2 0x736fb86b85d5 void tensorrt_llm::kernels::expandInputRowsKernelLauncher<__nv_fp4_e2m1, __nv_fp4_e2m1>(__nv_fp4_e2m1 const*, __nv_fp4_e2m1*, float const*, float*, int const*, int*, long, long const*, long, int, int, float const*, long*, unsigned char*, unsigned char const*, CUstream_st*) + 405
3 0x736fb8713afd tensorrt_llm::kernels::CutlassMoeFCRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_fp4_e2m1, __nv_bfloat16, void>::runMoe(void const*, void const*, int const*, float const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, tensorrt_llm::kernels::QuantParams, long, long, long, int, int, char*, void*, int*, tensorrt_llm::kernels::MOEParallelismConfig, bool, tensorrt_llm::kernels::LoraParams&, bool, bool, tensorrt_llm::kernels::MoeMinLatencyParams&, CUstream_st*) + 3229
4 0x736e45bb89c3 torch_ext::FusedMoeRunner::runMoe(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >) + 2611
5 0x736e45bbd01e /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(+0x1e101e) [0x736e45bbd01e]
6 0x736e45bbd722 std::Function_handler<void (std::vector<c10::IValue, std::allocatorc10::IValue >&), torch::class<torch_ext::FusedMoeRunner>::defineMethod<torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)> >(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)>, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::initializer_listtorch::arg)::{lambda(std::vector<c10::IValue, std::allocatorc10::IValue >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocatorc10::IValue >&) + 50
7 0x7370def50804 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x945804) [0x7370def50804]
8 0x7370def51085 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x946085) [0x7370def51085]
9 0x7370defe4e26 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9d9e26) [0x7370defe4e26]
10 0x7370defe504d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9da04d) [0x7370defe504d]
11 0x7370de98919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7370de98919d]
12 0x58208f /usr/bin/python() [0x58208f]
13 0x549185 _PyObject_MakeTpCall + 117
14 0x54ce49 /usr/bin/python() [0x54ce49]
15 0x5a374a /usr/bin/python() [0x5a374a]
16 0x549185 _PyObject_MakeTpCall + 117
17 0x5d73c9 _PyEval_EvalFrameDefault + 2697
18 0x7370dee7b0de /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8700de) [0x7370dee7b0de]
19 0x7370df17fe1f /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xb74e1f) [0x7a2e2917fe1f]
20 0x7a2e20ce23e1 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(+0x57be3e1) [0x7a2e20ce23e1]
21 0x7a2e28f30c7f torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args const&, pybind11::kwargs const&, std::optionalc10::DispatchKey) + 239
22 0x7a2e28f30f29 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) + 553
23 0x7a2e28e422ee /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8372ee) [0x7a2e28e422ee]
24 0x7a2e2898919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7a2e2898919d]
25 0x58208f /usr/bin/python() [0x58208f]
26 0x54b30c PyObject_Call + 108
27 0x5db55b _PyEval_EvalFrameDefault + 19483
28 0x54aa9a _PyObject_Call_Prepend + 394
29 0x5a3628 /usr/bin/python() [0x5a3628]
30 0x54924e _PyObject_MakeTpCall + 318
31 0x5d73c9 _PyEval_EvalFrameDefault + 2697
32 0x54cd94 /usr/bin/python() [0x54cd94]
33 0x54b3b5 PyObject_Call + 277
34 0x5db55b _PyEval_EvalFrameDefault + 19483
35 0x54cd94 /usr/bin/python() [0x54cd94]
36 0x54b3b5 PyObject_Call + 277
37 0x5db55b _PyEval_EvalFrameDefault + 19483
38 0x54aa9a _PyObject_Call_Prepend + 394
39 0x5a3628 /usr/bin/python() [0x5a3628]
40 0x54924e _PyObject_MakeTpCall + 318
41 0x5d73c9 _PyEval_EvalFrameDefault + 2697
42 0x54cd94 /usr/bin/python() [0x54cd94]
43 0x54b3b5 PyObject_Call + 277
44 0x5db55b _PyEval_EvalFrameDefault + 19483
45 0x54cd94 /usr/bin/python() [0x54cd94]
46 0x54b3b5 PyObject_Call + 277
47 0x5db55b _PyEval_EvalFrameDefault + 19483
48 0x54aa9a _PyObject_Call_Prepend + 394
49 0x5a3628 /usr/bin/python() [0x5a3628]
50 0x54924e _PyObject_MakeTpCall + 318
51 0x5d73c9 _PyEval_EvalFrameDefault + 2697
52 0x54cd94 /usr/bin/python() [0x54cd94]
53 0x54b3b5 PyObject_Call + 277
54 0x5db55b _PyEval_EvalFrameDefault + 19483
55 0x54cd94 /usr/bin/python() [0x54cd94]
56 0x54b3b5 PyObject_Call + 277
57 0x5db55b _PyEval_EvalFrameDefault + 19483
58 0x54aa9a _PyObject_Call_Prepend + 394
59 0x5a3628 /usr/bin/python() [0x5a3628]
60 0x54924e _PyObject_MakeTpCall + 318
61 0x5d73c9 _PyEval_EvalFrameDefault + 2697
62 0x54cd94 /usr/bin/python() [0x54cd94]
63 0x54b3b5 PyObject_Call + 277
64 0x5db55b _PyEval_EvalFrameDefault + 19483
65 0x54cd94 /usr/bin/python() [0x54cd94]
66 0x54b3b5 PyObject_Call + 277
67 0x5db55b _PyEval_EvalFrameDefault + 19483
68 0x54aa9a _PyObject_Call_Prepend + 394
69 0x5a3628 /usr/bin/python() [0x5a3628]
70 0x54924e _PyObject_MakeTpCall + 318
71 0x5d73c9 _PyEval_EvalFrameDefault + 2697
72 0x54cd94 /usr/bin/python() [0x54cd94]
73 0x54b3b5 PyObject_Call + 277
74 0x5db55b _PyEval_EvalFrameDefault + 19483
75 0x54cd94 /usr/bin/python() [0x54cd94]
76 0x54b3b5 PyObject_Call + 277
77 0x5db55b _PyEval_EvalFrameDefault + 19483
78 0x54aa9a _PyObject_Call_Prepend + 394
79 0x59e09f /usr/bin/python() [0x59e09f]
80 0x599b63 /usr/bin/python() [0x599b63]
81 0x54924e _PyObject_MakeTpCall + 318
82 0x5d73c9 _PyEval_EvalFrameDefault + 2697
83 0x54aa9a _PyObject_Call_Prepend + 394
84 0x59e09f /usr/bin/python() [0x59e09f]
85 0x599b63 /usr/bin/python() [0x599b63]
86 0x54924e _PyOb+0xb74e1f) [0x7370df17fe1f]
20 0x7370d6ce23e1 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(+0x57be3e1) [0x7370d6ce23e1]
21 0x7370def30c7f torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args const&, pybind11::kwargs const&, std::optionalc10::DispatchKey) + 239
22 0x7370def30f29 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) + 553
23 0x7370dee422ee /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8372ee) [0x7370dee422ee]
24 0x7370de98919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7370de98919d]
25 0x58208f /usr/bin/python() [0x58208f]
26 0x54b30c PyObject_Call + 108
additional notes
This only happens with MTP enabled with disable_overlap_scheduler set to false
The text was updated successfully, but these errors were encountered: