Enable tf32 precision acceleration for h100 #4767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

RandySheriff wants to merge 2 commits into pytorch:main from RandySheriff:export-D80908603

RandySheriff commented Aug 24, 2025

Summary:
As indicated by previous MTS benchmark:

TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.

By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from 25319.47 to 25737.61

Differential Revision: D80908603

netlify bot commented Aug 24, 2025 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`68134fa`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68ac9885c7837000080b2b0e
😎 Deploy Preview	https://deploy-preview-4767--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

meta-cla bot added the cla signed label

Contributor

facebook-github-bot commented Aug 24, 2025

This pull request was exported from Phabricator. Differential Revision: D80908603

facebook-github-bot added the fb-exported label

RandySheriff pushed a commit to RandySheriff/FBGEMM-1 that referenced this pull request


          Enable tf32 precision acceleration for CMF 500x on h100 (pytorch#4767)

e713580

Summary:

X-link: facebookresearch/FBGEMM#1789

As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403):
```
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
```
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963)

Differential Revision: D80908603

RandySheriff force-pushed the export-D80908603 branch from 03bd8bb to e713580 Compare

August 25, 2025 16:41

Contributor

facebook-github-bot commented Aug 25, 2025

This pull request was exported from Phabricator. Differential Revision: D80908603

RandySheriff force-pushed the export-D80908603 branch from e713580 to 430d6e4 Compare

August 25, 2025 16:42

RandySheriff pushed a commit to RandySheriff/FBGEMM-1 that referenced this pull request


          Enable tf32 precision acceleration for CMF 500x on h100 (pytorch#4767)

430d6e4

Summary:

X-link: facebookresearch/FBGEMM#1789

As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403):
```
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
```
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963)

Differential Revision: D80908603

Contributor

facebook-github-bot commented Aug 25, 2025

This pull request was exported from Phabricator. Differential Revision: D80908603


          Fine tune CMF 500x large&medium K shapes to boost flops (pytorch#4766)

c0731b4

Summary:
Pull Request resolved: pytorch#4766

X-link: facebookresearch/FBGEMM#1788

For H100, add new option to persistent triton fp8 gemm to boost perf for CMF 500x gemm shapes:
- M=512, N=1024, K=19712, boost flops from 504 to 559, by 11%.
- M=512, N=1024, K=171712,  boost flops from 437 to 481, by 10%.

E2E - CMF 500x QPS was [25077.38](https://www.internalfb.com/intern/paste/P1915458454/), now it is [25319.47](https://www.internalfb.com/intern/paste/P1916364403/).

Differential Revision: D80881599

RandySheriff pushed a commit to RandySheriff/FBGEMM-1 that referenced this pull request


          Enable tf32 precision acceleration for CMF 500x on h100 (pytorch#4767)

Summary:

X-link: facebookresearch/FBGEMM#1789

As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403):
```
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
```
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963)

Differential Revision: D80908603

RandySheriff force-pushed the export-D80908603 branch from 430d6e4 to 5870830 Compare

August 25, 2025 16:46

RandySheriff pushed a commit to RandySheriff/FBGEMM-1 that referenced this pull request


          Enable tf32 precision acceleration for CMF 500x on h100 (pytorch#4767)

f72b17e

Summary:

X-link: facebookresearch/FBGEMM#1789

As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403):
```
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
```
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963)

Differential Revision: D80908603

RandySheriff force-pushed the export-D80908603 branch from 5870830 to f72b17e Compare

August 25, 2025 16:47

Contributor

facebook-github-bot commented Aug 25, 2025

This pull request was exported from Phabricator. Differential Revision: D80908603

RandySheriff force-pushed the export-D80908603 branch from f72b17e to 3d31bd5 Compare

August 25, 2025 16:50

RandySheriff pushed a commit to RandySheriff/FBGEMM-1 that referenced this pull request


          Enable tf32 precision acceleration for CMF 500x on h100 (pytorch#4767)

3d31bd5

Summary:
Pull Request resolved: pytorch#4767

X-link: facebookresearch/FBGEMM#1789

As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403):
```
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
```
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963)

Differential Revision: D80908603


          Enable tf32 precision acceleration for CMF 500x on h100 (pytorch#4767)

68134fa

Summary:
Pull Request resolved: pytorch#4767

X-link: facebookresearch/FBGEMM#1789

As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403):
```
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
```
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963)

Reviewed By: beginner137

Differential Revision: D80908603

Contributor

facebook-github-bot commented Aug 25, 2025

This pull request was exported from Phabricator. Differential Revision: D80908603

RandySheriff force-pushed the export-D80908603 branch from 3d31bd5 to 68134fa Compare

August 25, 2025 17:08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported