Replies: 5 comments 39 replies
-
Thanks for share , I will test epyc 9755 & 768G DDR5 6800 ECC RAM , feedback later. |
Beta Was this translation helpful? Give feedback.
-
I am testing the exact setup(AMD 9654 with 1.5T ram max memory channel) you are suggesting but using dual socket setup, using 671B Q8 model, have been getting 4-5 tokens/s. Going to test the single socket setup see if there is any speed increase. |
Beta Was this translation helpful? Give feedback.
-
2xAMD EPYC 7k62(96cores), 16x64GB RAM, DeepSeek-R1-Q5_K_S.gguf (671B, 461.81GB) , 96 threads ------>>>>> 2.9 t/s for this mentioned system try to disable NUMA in system bios and let us know your cpu only inference results I have dual CPU system as well and disabling NUMA in bios increase my token output |
Beta Was this translation helpful? Give feedback.
-
with a very large 671B parameter model my token output increased from 2 t/s to 3 t/s with NUMA disabled in system bios NUMA enabled = Non-Uniform Memory Access |
Beta Was this translation helpful? Give feedback.
-
@jasonsi1993 I'm still investigating this problem. My current hypothesis is that multiplication of small matrices (expert tensor matrices are only 2048 x 7168) scales very bad on dual CPU systems. To verify this can someone check the steps below on a dual CPU system? The model to try is llama-3.2 1B as it has FFN matrices of similar size (2048 x 8192) to DeepSeek R1 experts. If I'm right it will scale equally bad as DeepSeek R1.
Post the output in replies please (and thanks!). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Could someone help in figuring out the best hardware configuration for LLM inference (CPU only) ?
I have done 3 tests:
I have tested the same big size model on different configurations and got the above results. So that means that llama.cpp is not optimized at all for dual-cpu-socket motherboards, and I can not use full power of such configurations to speed up LLM inference. It happened that running single instance of llama.cpp on one node (cpu) of dual-cpu setup is far better than on both of them.
A lot of different optimizations did not give any significant inference boost. So based on the above, for the best t/s inference of LLM i.e. DeepSeek-R1-Q5_K_S.gguf (671B, 461.81GB) I suggest the following hardware configuration:
With this setup I am optimistically expecting something around 10 t/s inference speed of the same big model, DeepSeek-R1-Q5_K_S.gguf (671B, 461.81GB). Could someone correct if Im wrong, or mb suggest yours ideas and thoughts?
Beta Was this translation helpful? Give feedback.
All reactions