-
Notifications
You must be signed in to change notification settings - Fork 639
Enable tf32 precision acceleration for h100 #4767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
This pull request was exported from Phabricator. Differential Revision: D80908603 |
Summary: X-link: facebookresearch/FBGEMM#1789 As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403): ``` TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. ``` By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963) Differential Revision: D80908603
03bd8bb
to
e713580
Compare
This pull request was exported from Phabricator. Differential Revision: D80908603 |
e713580
to
430d6e4
Compare
Summary: X-link: facebookresearch/FBGEMM#1789 As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403): ``` TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. ``` By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963) Differential Revision: D80908603
This pull request was exported from Phabricator. Differential Revision: D80908603 |
Summary: Pull Request resolved: pytorch#4766 X-link: facebookresearch/FBGEMM#1788 For H100, add new option to persistent triton fp8 gemm to boost perf for CMF 500x gemm shapes: - M=512, N=1024, K=19712, boost flops from 504 to 559, by 11%. - M=512, N=1024, K=171712, boost flops from 437 to 481, by 10%. E2E - CMF 500x QPS was [25077.38](https://www.internalfb.com/intern/paste/P1915458454/), now it is [25319.47](https://www.internalfb.com/intern/paste/P1916364403/). Differential Revision: D80881599
Summary: X-link: facebookresearch/FBGEMM#1789 As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403): ``` TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. ``` By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963) Differential Revision: D80908603
430d6e4
to
5870830
Compare
Summary: X-link: facebookresearch/FBGEMM#1789 As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403): ``` TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. ``` By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963) Differential Revision: D80908603
5870830
to
f72b17e
Compare
This pull request was exported from Phabricator. Differential Revision: D80908603 |
f72b17e
to
3d31bd5
Compare
Summary: Pull Request resolved: pytorch#4767 X-link: facebookresearch/FBGEMM#1789 As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403): ``` TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. ``` By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963) Differential Revision: D80908603
Summary: Pull Request resolved: pytorch#4767 X-link: facebookresearch/FBGEMM#1789 As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403): ``` TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. ``` By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963) Reviewed By: beginner137 Differential Revision: D80908603
This pull request was exported from Phabricator. Differential Revision: D80908603 |
3d31bd5
to
68134fa
Compare
Summary:
As indicated by previous MTS benchmark:
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from 25319.47 to 25737.61
Differential Revision: D80908603