Skip to content

Conversation

RandySheriff
Copy link

Summary:
As indicated by previous MTS benchmark:

TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.

By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from 25319.47 to 25737.61

Differential Revision: D80908603

Copy link

netlify bot commented Aug 24, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 68134fa
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68ac9885c7837000080b2b0e
😎 Deploy Preview https://deploy-preview-4767--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-cla meta-cla bot added the cla signed label Aug 24, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80908603

RandySheriff pushed a commit to RandySheriff/FBGEMM-1 that referenced this pull request Aug 25, 2025
Summary:

X-link: facebookresearch/FBGEMM#1789

As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403):
```
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
```
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963)

Differential Revision: D80908603
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80908603

RandySheriff pushed a commit to RandySheriff/FBGEMM-1 that referenced this pull request Aug 25, 2025
Summary:

X-link: facebookresearch/FBGEMM#1789

As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403):
```
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
```
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963)

Differential Revision: D80908603
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80908603

Summary:
Pull Request resolved: pytorch#4766

X-link: facebookresearch/FBGEMM#1788

For H100, add new option to persistent triton fp8 gemm to boost perf for CMF 500x gemm shapes:
- M=512, N=1024, K=19712, boost flops from 504 to 559, by 11%.
- M=512, N=1024, K=171712,  boost flops from 437 to 481, by 10%.

E2E - CMF 500x QPS was [25077.38](https://www.internalfb.com/intern/paste/P1915458454/), now it is [25319.47](https://www.internalfb.com/intern/paste/P1916364403/).

Differential Revision: D80881599
RandySheriff pushed a commit to RandySheriff/FBGEMM-1 that referenced this pull request Aug 25, 2025
Summary:

X-link: facebookresearch/FBGEMM#1789

As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403):
```
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
```
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963)

Differential Revision: D80908603
RandySheriff pushed a commit to RandySheriff/FBGEMM-1 that referenced this pull request Aug 25, 2025
Summary:

X-link: facebookresearch/FBGEMM#1789

As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403):
```
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
```
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963)

Differential Revision: D80908603
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80908603

RandySheriff pushed a commit to RandySheriff/FBGEMM-1 that referenced this pull request Aug 25, 2025
Summary:
Pull Request resolved: pytorch#4767

X-link: facebookresearch/FBGEMM#1789

As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403):
```
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
```
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963)

Differential Revision: D80908603
Summary:
Pull Request resolved: pytorch#4767

X-link: facebookresearch/FBGEMM#1789

As indicated by previous MTS [benchmark](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403):
```
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.
Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
```
By this diff, enable fp32 high precision. E2E, the diff pushes CMF500x QPS from [25319.47](https://www.internalfb.com/intern/everpaste/?handle=GEHWGSBdCZC6Q5QFALLQJHKASQUnbsIXAAAB&phabricator_paste_number=1916364403) to [25737.61](https://www.internalfb.com/intern/everpaste/?handle=GAsZ7R9ASEeoe9sIABHwORv6SG89bsIXAAAB&phabricator_paste_number=1916396963)

Reviewed By: beginner137

Differential Revision: D80908603
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80908603

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants