Skip to content

[SWDEV-539215] - Autotune support for persistent reduction and no_x_dim removal #2417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 30, 2025

Conversation

jataylo
Copy link

@jataylo jataylo commented Jul 25, 2025

We noticed persistent reduction kernels can be extremely poor performing https://ontrack-internal.amd.com/browse/SWDEV-539215

The root cause is that in certain size restrictions and kernels "no_x_dim" mode is enabled, which embeds static XBLOCK=1 into the kernel. This means tuning is not optimal. Removing this mode and enabling autotune we achieve 2x performance proving that new heuristics must be made.

We will bring this into 2.7 for perf uplift, discussion is undergoing with upstream on removing no_x_dim, if there is no perf regression they are in agreement. Draft PR shows no perf loss on ROCm for any inductor benchmark pytorch#159048

Removing tests because no longer relevant.

Cherry-picked to release/2.8 branch via #2454

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 25, 2025

Jenkins build for 34ad38c7edb1f03037c4c124daa0f5ee3c8b4ddf commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 28, 2025

Jenkins build for 21a8eded2cbea28dce93e962ae0ed253432b3e22 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@jataylo
Copy link
Author

jataylo commented Jul 30, 2025

UTs seem good

@jithunnair-amd jithunnair-amd merged commit 6c845c6 into ROCm:release/2.7 Jul 30, 2025
1 of 6 checks passed
@pragupta
Copy link

@jataylo can you please cherry-pick this into release/2.8 as well? This is not cherry-picking cleanly.

@jataylo
Copy link
Author

jataylo commented Aug 4, 2025

! cherry-pick --onto release/2.8

@okakarpa
Copy link
Collaborator

okakarpa commented Aug 4, 2025

Created branch autogenerated/release/2.8_cherry-pick_pr-2417 and #2454. It contains a merge conflict. Please resolve it

jataylo added a commit to jataylo/pytorch that referenced this pull request Aug 11, 2025
…im removal (ROCm#2417)

We noticed persistent reduction kernels can be extremely poor performing
https://ontrack-internal.amd.com/browse/SWDEV-539215

The root cause is that in certain size restrictions and kernels
"no_x_dim" mode is enabled, which embeds static XBLOCK=1 into the
kernel. This means tuning is not optimal. Removing this mode and
enabling autotune we achieve 2x performance proving that new heuristics
must be made.

We will bring this into 2.7 for perf uplift, discussion is undergoing
with upstream on removing no_x_dim, if there is no perf regression they
are in agreement. Draft PR shows no perf loss on ROCm for any inductor
benchmark pytorch#159048

Removing tests because no longer relevant.

(cherry picked from commit 6c845c6)
jataylo added a commit that referenced this pull request Aug 11, 2025
…ersistent reduction and no_x_dim removal (#2454)

Cherry-pick of #2417 
Need to resolve conflicts

---------

Co-authored-by: Jack Taylor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants