DeepSeek-R1-FP4 crashes when MTP is enabled #4708

Shang-Pin · 2025-05-27T21:16:29Z

System Info

NVIDIA B200

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

root@32ab8238e348:/app/tensorrt_llm# cat extra-llm-api-config.yml
enable_attention_dp: false
max_batch_size: 196
max_num_tokens: 50000
max_seq_len: 40000
tp_size: 8
ep_size: 8
speculative_config:
decoding_type: "MTP"
num_nextn_predict_layers: 1
kv_cache_config:
free_gpu_memory_fraction: 0.6
pytorch_backend_config:
print_iter_log: true
disable_overlap_scheduler: false
use_cuda_graph: true
cuda_graph_batch_sizes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256]
2.
trtllm-serve /data/weights/nvidia--DeepSeek-R1-FP4 --host 0.0.0.0 --backend pytorch --max_batch_size 64 --max_num_tokens 50000 --max_seq_len 40960 --tp_size 8 --pp_size 1 --ep_size 8 --extra_llm_api_options extra-llm-api-config.yml

Expected behavior

server starts

actual behavior

[05/27/2025-21:12:35] [TRT-LLM] [I] [Autotuner]: Autotuning process starts ...
[05/27/2025-21:12:35] [TRT-LLM] [I] Run autotuning warmup for batch size=1
[05/27/2025-21:12:36] [TRT-LLM] [I] Autotuner Cache size after warmup 112
[05/27/2025-21:12:36] [TRT-LLM] [I] [Autotuner]: Autotuning process ends
[05/27/2025-21:12:36] [TRT-LLM] [I] Creating CUDA graph instances for 196 batch sizes.
[05/27/2025-21:12:36] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=196
[05/27/2025-21:12:36] [TRT-LLM] [E] Failed to initialize executor on rank 3: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemsetAsync(fc1_act_sf_flat, 0x0, getOffsetFlatSFArray(num_experts_per_node, num_rows, cols), stream): an illegal memory access was encountered (/builds/ftp/internalcutlasskernels/cpp/tensorrt_llm/kernels/internal_cutlass_kernels/src/moe_gemm/moe_kernels.cu:1282)
1 0x7a2d00e89f0a /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x1b60f0a) [0x7a2d00e89f0a]
2 0x7a2d025b85d5 void tensorrt_llm::kernels::expandInputRowsKernelLauncher<__nv_fp4_e2m1, __nv_fp4_e2m1>(__nv_fp4_e2m1 const*, __nv_fp4_e2m1*, float const*, float*, int const*, int*, long, long const*, long, int, int, float const*, long*, unsigned char*, unsigned char const*, CUstream_st*) + 405
3 0x7a2d02613afd tensorrt_llm::kernels::CutlassMoeFCRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_fp4_e2m1, __nv_bfloat16, void>::runMoe(void const*, void const*, int const*, float const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, tensorrt_llm::kernels::QuantParams, long, long, long, int, int, char*, void*, int*, tensorrt_llm::kernels::MOEParallelismConfig, bool, tensorrt_llm::kernels::LoraParams&, bool, bool, tensorrt_llm::kernels::MoeMinLatencyParams&, CUstream_st*) + 3229
4 0x7a2b8fa149c3 torch_ext::FusedMoeRunner::runMoe(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >) + 2611
5 0x7a2b8fa1901e /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(+0x1e101e) [0x7a2b8fa1901e]
6 0x7a2b8fa19722 std::Function_handler<void (std::vector<c10::IValue, std::allocatorc10::IValue >&), torch::class<torch_ext::FusedMoeRunner>::defineMethod<torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)> >(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)>, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::initializer_listtorch::arg)::{lambda(std::vector<c10::IValue, std::allocatorc10::IValue >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocatorc10::IValue >&) + 50
7 0x7a2e28f50804 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x945804) [0x7a2e28f50804]
8 0x7a2e28f51085 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x946085) [0x7a2e28f51085]
9 0x7a2e28fe4e26 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9d9e26) [0x7a2e28fe4e26]
10 0x7a2e28fe504d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9da04d) [0x7a2e28fe504d]
11 0x7a2e2898919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7a2e2898919d]
12 0x58208f /usr/bin/python() [0x58208f]
13 0x549185 _PyObject_MakeTpCall + 117
14 0x54ce49 /usr/bin/python() [0x54ce49]
15 0x5a374a /usr/bin/python() [0x5a374a]
16 0x549185 _PyObject_MakeTpCall + 117
17 0x5d73c9 _PyEval_EvalFrameDefault + 2697
18 0x7a2e28e7b0de /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8700de) [0x7a2e28e7b0de]
19 0x7a2e2917fe1f /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so([05/27/2025-21:12:36] [TRT-LLM] [E] Failed to initialize executor on rank 1: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemsetAsync(fc1_act_sf_flat, 0x0, getOffsetFlatSFArray(num_experts_per_node, num_rows, cols), stream): an illegal memory access was encountered (/builds/ftp/internalcutlasskernels/cpp/tensorrt_llm/kernels/internal_cutlass_kernels/src/moe_gemm/moe_kernels.cu:1282)
1 0x736fb6f89f0a /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x1b60f0a) [0x736fb6f89f0a]
2 0x736fb86b85d5 void tensorrt_llm::kernels::expandInputRowsKernelLauncher<__nv_fp4_e2m1, __nv_fp4_e2m1>(__nv_fp4_e2m1 const*, __nv_fp4_e2m1*, float const*, float*, int const*, int*, long, long const*, long, int, int, float const*, long*, unsigned char*, unsigned char const*, CUstream_st*) + 405
3 0x736fb8713afd tensorrt_llm::kernels::CutlassMoeFCRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_fp4_e2m1, __nv_bfloat16, void>::runMoe(void const*, void const*, int const*, float const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, tensorrt_llm::kernels::QuantParams, long, long, long, int, int, char*, void*, int*, tensorrt_llm::kernels::MOEParallelismConfig, bool, tensorrt_llm::kernels::LoraParams&, bool, bool, tensorrt_llm::kernels::MoeMinLatencyParams&, CUstream_st*) + 3229
4 0x736e45bb89c3 torch_ext::FusedMoeRunner::runMoe(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >) + 2611
5 0x736e45bbd01e /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(+0x1e101e) [0x736e45bbd01e]
6 0x736e45bbd722 std::Function_handler<void (std::vector<c10::IValue, std::allocatorc10::IValue >&), torch::class<torch_ext::FusedMoeRunner>::defineMethod<torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)> >(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)>, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::initializer_listtorch::arg)::{lambda(std::vector<c10::IValue, std::allocatorc10::IValue >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocatorc10::IValue >&) + 50
7 0x7370def50804 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x945804) [0x7370def50804]
8 0x7370def51085 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x946085) [0x7370def51085]
9 0x7370defe4e26 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9d9e26) [0x7370defe4e26]
10 0x7370defe504d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9da04d) [0x7370defe504d]
11 0x7370de98919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7370de98919d]
12 0x58208f /usr/bin/python() [0x58208f]
13 0x549185 _PyObject_MakeTpCall + 117
14 0x54ce49 /usr/bin/python() [0x54ce49]
15 0x5a374a /usr/bin/python() [0x5a374a]
16 0x549185 _PyObject_MakeTpCall + 117
17 0x5d73c9 _PyEval_EvalFrameDefault + 2697
18 0x7370dee7b0de /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8700de) [0x7370dee7b0de]
19 0x7370df17fe1f /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xb74e1f) [0x7a2e2917fe1f]
20 0x7a2e20ce23e1 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(+0x57be3e1) [0x7a2e20ce23e1]
21 0x7a2e28f30c7f torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args const&, pybind11::kwargs const&, std::optionalc10::DispatchKey) + 239
22 0x7a2e28f30f29 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) + 553
23 0x7a2e28e422ee /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8372ee) [0x7a2e28e422ee]
24 0x7a2e2898919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7a2e2898919d]
25 0x58208f /usr/bin/python() [0x58208f]
26 0x54b30c PyObject_Call + 108
27 0x5db55b _PyEval_EvalFrameDefault + 19483
28 0x54aa9a _PyObject_Call_Prepend + 394
29 0x5a3628 /usr/bin/python() [0x5a3628]
30 0x54924e _PyObject_MakeTpCall + 318
31 0x5d73c9 _PyEval_EvalFrameDefault + 2697
32 0x54cd94 /usr/bin/python() [0x54cd94]
33 0x54b3b5 PyObject_Call + 277
34 0x5db55b _PyEval_EvalFrameDefault + 19483
35 0x54cd94 /usr/bin/python() [0x54cd94]
36 0x54b3b5 PyObject_Call + 277
37 0x5db55b _PyEval_EvalFrameDefault + 19483
38 0x54aa9a _PyObject_Call_Prepend + 394
39 0x5a3628 /usr/bin/python() [0x5a3628]
40 0x54924e _PyObject_MakeTpCall + 318
41 0x5d73c9 _PyEval_EvalFrameDefault + 2697
42 0x54cd94 /usr/bin/python() [0x54cd94]
43 0x54b3b5 PyObject_Call + 277
44 0x5db55b _PyEval_EvalFrameDefault + 19483
45 0x54cd94 /usr/bin/python() [0x54cd94]
46 0x54b3b5 PyObject_Call + 277
47 0x5db55b _PyEval_EvalFrameDefault + 19483
48 0x54aa9a _PyObject_Call_Prepend + 394
49 0x5a3628 /usr/bin/python() [0x5a3628]
50 0x54924e _PyObject_MakeTpCall + 318
51 0x5d73c9 _PyEval_EvalFrameDefault + 2697
52 0x54cd94 /usr/bin/python() [0x54cd94]
53 0x54b3b5 PyObject_Call + 277
54 0x5db55b _PyEval_EvalFrameDefault + 19483
55 0x54cd94 /usr/bin/python() [0x54cd94]
56 0x54b3b5 PyObject_Call + 277
57 0x5db55b _PyEval_EvalFrameDefault + 19483
58 0x54aa9a _PyObject_Call_Prepend + 394
59 0x5a3628 /usr/bin/python() [0x5a3628]
60 0x54924e _PyObject_MakeTpCall + 318
61 0x5d73c9 _PyEval_EvalFrameDefault + 2697
62 0x54cd94 /usr/bin/python() [0x54cd94]
63 0x54b3b5 PyObject_Call + 277
64 0x5db55b _PyEval_EvalFrameDefault + 19483
65 0x54cd94 /usr/bin/python() [0x54cd94]
66 0x54b3b5 PyObject_Call + 277
67 0x5db55b _PyEval_EvalFrameDefault + 19483
68 0x54aa9a _PyObject_Call_Prepend + 394
69 0x5a3628 /usr/bin/python() [0x5a3628]
70 0x54924e _PyObject_MakeTpCall + 318
71 0x5d73c9 _PyEval_EvalFrameDefault + 2697
72 0x54cd94 /usr/bin/python() [0x54cd94]
73 0x54b3b5 PyObject_Call + 277
74 0x5db55b _PyEval_EvalFrameDefault + 19483
75 0x54cd94 /usr/bin/python() [0x54cd94]
76 0x54b3b5 PyObject_Call + 277
77 0x5db55b _PyEval_EvalFrameDefault + 19483
78 0x54aa9a _PyObject_Call_Prepend + 394
79 0x59e09f /usr/bin/python() [0x59e09f]
80 0x599b63 /usr/bin/python() [0x599b63]
81 0x54924e _PyObject_MakeTpCall + 318
82 0x5d73c9 _PyEval_EvalFrameDefault + 2697
83 0x54aa9a _PyObject_Call_Prepend + 394
84 0x59e09f /usr/bin/python() [0x59e09f]
85 0x599b63 /usr/bin/python() [0x599b63]
86 0x54924e _PyOb+0xb74e1f) [0x7370df17fe1f]
20 0x7370d6ce23e1 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(+0x57be3e1) [0x7370d6ce23e1]
21 0x7370def30c7f torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args const&, pybind11::kwargs const&, std::optionalc10::DispatchKey) + 239
22 0x7370def30f29 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) + 553
23 0x7370dee422ee /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8372ee) [0x7370dee422ee]
24 0x7370de98919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7370de98919d]
25 0x58208f /usr/bin/python() [0x58208f]
26 0x54b30c PyObject_Call + 108

additional notes

This only happens with MTP enabled with disable_overlap_scheduler set to false

The text was updated successfully, but these errors were encountered:

brb-nv · 2025-05-29T23:44:46Z

Maybe @kaiyux

kaiyux · 2025-05-29T23:55:16Z

@lfr-0531 Can you help take a look?

Shang-Pin added the bug Something isn't working label May 27, 2025

brb-nv assigned Kefeng-Duan May 29, 2025

brb-nv assigned kaiyux and unassigned Kefeng-Duan May 29, 2025

kaiyux assigned lfr-0531 and unassigned kaiyux May 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSeek-R1-FP4 crashes when MTP is enabled #4708

DeepSeek-R1-FP4 crashes when MTP is enabled #4708

Shang-Pin commented May 27, 2025 •

edited

Loading

brb-nv commented May 29, 2025

Uh oh!

kaiyux commented May 29, 2025

Uh oh!

DeepSeek-R1-FP4 crashes when MTP is enabled #4708

DeepSeek-R1-FP4 crashes when MTP is enabled #4708

Comments

Shang-Pin commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

brb-nv commented May 29, 2025

Uh oh!

kaiyux commented May 29, 2025

Uh oh!

Shang-Pin commented May 27, 2025 •

edited

Loading