Skip to content

DeepSeek-R1-FP4 crashes when MTP is enabled #4708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 4 tasks
Shang-Pin opened this issue May 27, 2025 · 2 comments
Open
1 of 4 tasks

DeepSeek-R1-FP4 crashes when MTP is enabled #4708

Shang-Pin opened this issue May 27, 2025 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@Shang-Pin
Copy link

Shang-Pin commented May 27, 2025

System Info

NVIDIA B200

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

root@32ab8238e348:/app/tensorrt_llm# cat extra-llm-api-config.yml
enable_attention_dp: false
max_batch_size: 196
max_num_tokens: 50000
max_seq_len: 40000
tp_size: 8
ep_size: 8
speculative_config:
decoding_type: "MTP"
num_nextn_predict_layers: 1
kv_cache_config:
free_gpu_memory_fraction: 0.6
pytorch_backend_config:
print_iter_log: true
disable_overlap_scheduler: false
use_cuda_graph: true
cuda_graph_batch_sizes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256]
2.
trtllm-serve /data/weights/nvidia--DeepSeek-R1-FP4 --host 0.0.0.0 --backend pytorch --max_batch_size 64 --max_num_tokens 50000 --max_seq_len 40960 --tp_size 8 --pp_size 1 --ep_size 8 --extra_llm_api_options extra-llm-api-config.yml

Expected behavior

server starts

actual behavior

[05/27/2025-21:12:35] [TRT-LLM] [I] [Autotuner]: Autotuning process starts ...
[05/27/2025-21:12:35] [TRT-LLM] [I] Run autotuning warmup for batch size=1
[05/27/2025-21:12:36] [TRT-LLM] [I] Autotuner Cache size after warmup 112
[05/27/2025-21:12:36] [TRT-LLM] [I] [Autotuner]: Autotuning process ends
[05/27/2025-21:12:36] [TRT-LLM] [I] Creating CUDA graph instances for 196 batch sizes.
[05/27/2025-21:12:36] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=196
[05/27/2025-21:12:36] [TRT-LLM] [E] Failed to initialize executor on rank 3: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemsetAsync(fc1_act_sf_flat, 0x0, getOffsetFlatSFArray(num_experts_per_node, num_rows, cols), stream): an illegal memory access was encountered (/builds/ftp/internalcutlasskernels/cpp/tensorrt_llm/kernels/internal_cutlass_kernels/src/moe_gemm/moe_kernels.cu:1282)
1 0x7a2d00e89f0a /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x1b60f0a) [0x7a2d00e89f0a]
2 0x7a2d025b85d5 void tensorrt_llm::kernels::expandInputRowsKernelLauncher<__nv_fp4_e2m1, __nv_fp4_e2m1>(__nv_fp4_e2m1 const*, __nv_fp4_e2m1*, float const*, float*, int const*, int*, long, long const*, long, int, int, float const*, long*, unsigned char*, unsigned char const*, CUstream_st*) + 405
3 0x7a2d02613afd tensorrt_llm::kernels::CutlassMoeFCRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_fp4_e2m1, __nv_bfloat16, void>::runMoe(void const*, void const*, int const*, float const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, tensorrt_llm::kernels::QuantParams, long, long, long, int, int, char*, void*, int*, tensorrt_llm::kernels::MOEParallelismConfig, bool, tensorrt_llm::kernels::LoraParams&, bool, bool, tensorrt_llm::kernels::MoeMinLatencyParams&, CUstream_st*) + 3229
4 0x7a2b8fa149c3 torch_ext::FusedMoeRunner::runMoe(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >) + 2611
5 0x7a2b8fa1901e /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(+0x1e101e) [0x7a2b8fa1901e]
6 0x7a2b8fa19722 std::Function_handler<void (std::vector<c10::IValue, std::allocatorc10::IValue >&), torch::class<torch_ext::FusedMoeRunner>::defineMethod<torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)> >(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)>, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::initializer_listtorch::arg)::{lambda(std::vector<c10::IValue, std::allocatorc10::IValue >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocatorc10::IValue >&) + 50
7 0x7a2e28f50804 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x945804) [0x7a2e28f50804]
8 0x7a2e28f51085 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x946085) [0x7a2e28f51085]
9 0x7a2e28fe4e26 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9d9e26) [0x7a2e28fe4e26]
10 0x7a2e28fe504d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9da04d) [0x7a2e28fe504d]
11 0x7a2e2898919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7a2e2898919d]
12 0x58208f /usr/bin/python() [0x58208f]
13 0x549185 _PyObject_MakeTpCall + 117
14 0x54ce49 /usr/bin/python() [0x54ce49]
15 0x5a374a /usr/bin/python() [0x5a374a]
16 0x549185 _PyObject_MakeTpCall + 117
17 0x5d73c9 _PyEval_EvalFrameDefault + 2697
18 0x7a2e28e7b0de /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8700de) [0x7a2e28e7b0de]
19 0x7a2e2917fe1f /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so([05/27/2025-21:12:36] [TRT-LLM] [E] Failed to initialize executor on rank 1: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemsetAsync(fc1_act_sf_flat, 0x0, getOffsetFlatSFArray(num_experts_per_node, num_rows, cols), stream): an illegal memory access was encountered (/builds/ftp/internalcutlasskernels/cpp/tensorrt_llm/kernels/internal_cutlass_kernels/src/moe_gemm/moe_kernels.cu:1282)
1 0x736fb6f89f0a /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x1b60f0a) [0x736fb6f89f0a]
2 0x736fb86b85d5 void tensorrt_llm::kernels::expandInputRowsKernelLauncher<__nv_fp4_e2m1, __nv_fp4_e2m1>(__nv_fp4_e2m1 const*, __nv_fp4_e2m1*, float const*, float*, int const*, int*, long, long const*, long, int, int, float const*, long*, unsigned char*, unsigned char const*, CUstream_st*) + 405
3 0x736fb8713afd tensorrt_llm::kernels::CutlassMoeFCRunner<__nv_fp4_e2m1, __nv_fp4_e2m1, __nv_bfloat16, __nv_fp4_e2m1, __nv_bfloat16, void>::runMoe(void const*, void const*, int const*, float const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, tensorrt_llm::kernels::QuantParams, long, long, long, int, int, char*, void*, int*, tensorrt_llm::kernels::MOEParallelismConfig, bool, tensorrt_llm::kernels::LoraParams&, bool, bool, tensorrt_llm::kernels::MoeMinLatencyParams&, CUstream_st*) + 3229
4 0x736e45bb89c3 torch_ext::FusedMoeRunner::runMoe(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >) + 2611
5 0x736e45bbd01e /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(+0x1e101e) [0x736e45bbd01e]
6 0x736e45bbd722 std::Function_handler<void (std::vector<c10::IValue, std::allocatorc10::IValue >&), torch::class<torch_ext::FusedMoeRunner>::defineMethod<torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)> >(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, torch::detail::WrapMethod<at::Tensor (torch_ext::FusedMoeRunner::)(at::Tensor const&, at::Tensor const&, std::optionalat::Tensor, at::Tensor const&, at::Tensor const&, std::optional<c10::ArrayRefat::Tensor >, std::optionalat::Tensor, long, long, long, long, long, long, bool, std::optional<c10::ArrayRef >)>, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::initializer_listtorch::arg)::{lambda(std::vector<c10::IValue, std::allocatorc10::IValue >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocatorc10::IValue >&) + 50
7 0x7370def50804 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x945804) [0x7370def50804]
8 0x7370def51085 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x946085) [0x7370def51085]
9 0x7370defe4e26 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9d9e26) [0x7370defe4e26]
10 0x7370defe504d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x9da04d) [0x7370defe504d]
11 0x7370de98919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7370de98919d]
12 0x58208f /usr/bin/python() [0x58208f]
13 0x549185 _PyObject_MakeTpCall + 117
14 0x54ce49 /usr/bin/python() [0x54ce49]
15 0x5a374a /usr/bin/python() [0x5a374a]
16 0x549185 _PyObject_MakeTpCall + 117
17 0x5d73c9 _PyEval_EvalFrameDefault + 2697
18 0x7370dee7b0de /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8700de) [0x7370dee7b0de]
19 0x7370df17fe1f /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0xb74e1f) [0x7a2e2917fe1f]
20 0x7a2e20ce23e1 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(+0x57be3e1) [0x7a2e20ce23e1]
21 0x7a2e28f30c7f torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args const&, pybind11::kwargs const&, std::optionalc10::DispatchKey) + 239
22 0x7a2e28f30f29 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) + 553
23 0x7a2e28e422ee /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8372ee) [0x7a2e28e422ee]
24 0x7a2e2898919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7a2e2898919d]
25 0x58208f /usr/bin/python() [0x58208f]
26 0x54b30c PyObject_Call + 108
27 0x5db55b _PyEval_EvalFrameDefault + 19483
28 0x54aa9a _PyObject_Call_Prepend + 394
29 0x5a3628 /usr/bin/python() [0x5a3628]
30 0x54924e _PyObject_MakeTpCall + 318
31 0x5d73c9 _PyEval_EvalFrameDefault + 2697
32 0x54cd94 /usr/bin/python() [0x54cd94]
33 0x54b3b5 PyObject_Call + 277
34 0x5db55b _PyEval_EvalFrameDefault + 19483
35 0x54cd94 /usr/bin/python() [0x54cd94]
36 0x54b3b5 PyObject_Call + 277
37 0x5db55b _PyEval_EvalFrameDefault + 19483
38 0x54aa9a _PyObject_Call_Prepend + 394
39 0x5a3628 /usr/bin/python() [0x5a3628]
40 0x54924e _PyObject_MakeTpCall + 318
41 0x5d73c9 _PyEval_EvalFrameDefault + 2697
42 0x54cd94 /usr/bin/python() [0x54cd94]
43 0x54b3b5 PyObject_Call + 277
44 0x5db55b _PyEval_EvalFrameDefault + 19483
45 0x54cd94 /usr/bin/python() [0x54cd94]
46 0x54b3b5 PyObject_Call + 277
47 0x5db55b _PyEval_EvalFrameDefault + 19483
48 0x54aa9a _PyObject_Call_Prepend + 394
49 0x5a3628 /usr/bin/python() [0x5a3628]
50 0x54924e _PyObject_MakeTpCall + 318
51 0x5d73c9 _PyEval_EvalFrameDefault + 2697
52 0x54cd94 /usr/bin/python() [0x54cd94]
53 0x54b3b5 PyObject_Call + 277
54 0x5db55b _PyEval_EvalFrameDefault + 19483
55 0x54cd94 /usr/bin/python() [0x54cd94]
56 0x54b3b5 PyObject_Call + 277
57 0x5db55b _PyEval_EvalFrameDefault + 19483
58 0x54aa9a _PyObject_Call_Prepend + 394
59 0x5a3628 /usr/bin/python() [0x5a3628]
60 0x54924e _PyObject_MakeTpCall + 318
61 0x5d73c9 _PyEval_EvalFrameDefault + 2697
62 0x54cd94 /usr/bin/python() [0x54cd94]
63 0x54b3b5 PyObject_Call + 277
64 0x5db55b _PyEval_EvalFrameDefault + 19483
65 0x54cd94 /usr/bin/python() [0x54cd94]
66 0x54b3b5 PyObject_Call + 277
67 0x5db55b _PyEval_EvalFrameDefault + 19483
68 0x54aa9a _PyObject_Call_Prepend + 394
69 0x5a3628 /usr/bin/python() [0x5a3628]
70 0x54924e _PyObject_MakeTpCall + 318
71 0x5d73c9 _PyEval_EvalFrameDefault + 2697
72 0x54cd94 /usr/bin/python() [0x54cd94]
73 0x54b3b5 PyObject_Call + 277
74 0x5db55b _PyEval_EvalFrameDefault + 19483
75 0x54cd94 /usr/bin/python() [0x54cd94]
76 0x54b3b5 PyObject_Call + 277
77 0x5db55b _PyEval_EvalFrameDefault + 19483
78 0x54aa9a _PyObject_Call_Prepend + 394
79 0x59e09f /usr/bin/python() [0x59e09f]
80 0x599b63 /usr/bin/python() [0x599b63]
81 0x54924e _PyObject_MakeTpCall + 318
82 0x5d73c9 _PyEval_EvalFrameDefault + 2697
83 0x54aa9a _PyObject_Call_Prepend + 394
84 0x59e09f /usr/bin/python() [0x59e09f]
85 0x599b63 /usr/bin/python() [0x599b63]
86 0x54924e _PyOb+0xb74e1f) [0x7370df17fe1f]
20 0x7370d6ce23e1 /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(+0x57be3e1) [0x7370d6ce23e1]
21 0x7370def30c7f torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args const&, pybind11::kwargs const&, std::optionalc10::DispatchKey) + 239
22 0x7370def30f29 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) + 553
23 0x7370dee422ee /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8372ee) [0x7370dee422ee]
24 0x7370de98919d /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x37e19d) [0x7370de98919d]
25 0x58208f /usr/bin/python() [0x58208f]
26 0x54b30c PyObject_Call + 108

additional notes

This only happens with MTP enabled with disable_overlap_scheduler set to false

@Shang-Pin Shang-Pin added the bug Something isn't working label May 27, 2025
@brb-nv
Copy link
Collaborator

brb-nv commented May 29, 2025

Maybe @kaiyux

@brb-nv brb-nv assigned kaiyux and unassigned Kefeng-Duan May 29, 2025
@kaiyux
Copy link
Member

kaiyux commented May 29, 2025

@lfr-0531 Can you help take a look?

@kaiyux kaiyux assigned lfr-0531 and unassigned kaiyux May 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants