Skip to content

[rocm6.4_internal_testing][SWDEV-535305] Fixed test_extra_cuda_context in test_c10d_nccl.py and refactored is_navi3_arch function #2341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

akashveramd
Copy link

@akashveramd akashveramd commented Jul 10, 2025

In this PR, I have added a sleep statement before collectives. We need this extra sleep for NAVI_ARCH because rccl_init inside init_process_group is happening in a separate process and it is taking longer to finish on NAVI_ARCH. Sleeping here ensures that the init is competed successfully and mem_get_info can get stable numbers.
Note that in the test the sleep statement was already there after collectives.
Also, refactored is_navi3_arch function to is_arch that takes arch list as an argument and compares with existing arch.

Tested with docker image-
compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.4:108_ubuntu22.04_py3.10_pytorch_rocm6.4_internal_testing_aeb5d79

Cherry-picked to release/2.5 branch via #2380

Cherry-picked to release/2.6 branch via #2381

Cherry-picked to release/2.7 branch via #2382

Cherry-picked to rocm7.0_internal_testing branch via #2383

…da_context test in distributed.test_c10d_nccl to pass. Refactored is_navi3_arch function to is_arch that takes arch list as an argument and compares with existing arch.
@akashveramd akashveramd self-assigned this Jul 10, 2025
@pragupta pragupta changed the title [ROCm6.4_Internal_Testing][Jira 535305] Fixed test_extra_cuda_c ontext test in distributed.test_c10d_nccl and refactored is_navi3_arch function [rocm6.4_internal_testing][SWDEV-535305] Fixed test_extra_cuda_context in test_c10d_nccl.py and refactored is_navi3_arch function Jul 10, 2025
@rocm-repo-management-api
Copy link

Jenkins build for 16d0ecfca0cb49846a295be13460936db4e8c1a6 commit is in progress
Links: Blue Ocean view / Build artifacts

…est_extra_cuda_context test to match PR comment.
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 11, 2025

Jenkins build for 9bfc384bacc57baec5752f9beac6056288a2b7c7 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@akashveramd akashveramd merged commit 71a21d9 into rocm6.4_internal_testing Jul 14, 2025
0 of 5 checks passed
@akashveramd akashveramd deleted the av_rocm6.4_internal_testing_jira_535305 branch July 14, 2025 16:56
@pragupta
Copy link

! cherry-pick --onto release/2.5 release/2.6 release/2.7 rocm7.0_internal_testing

@pragupta pragupta restored the av_rocm6.4_internal_testing_jira_535305 branch July 16, 2025 16:43
okakarpa pushed a commit that referenced this pull request Jul 16, 2025
…xt` in `test_c10d_nccl.py` and refactored is_navi3_arch function (#2341)

In this PR, I have added a sleep statement before collectives. We need
this extra sleep for NAVI_ARCH because rccl_init inside
init_process_group is happening in a separate process and it is taking
longer to finish on NAVI_ARCH. Sleeping here ensures that the init is
competed successfully and mem_get_info can get stable numbers.
Note that in the test the sleep statement was already there after
collectives.
Also, refactored is_navi3_arch function to is_arch that takes arch list
as an argument and compares with existing arch.

Tested with docker image-

compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.4:108_ubuntu22.04_py3.10_pytorch_rocm6.4_internal_testing_aeb5d79
@okakarpa
Copy link
Collaborator

Created branch autogenerated/release/2.5_cherry-pick_pr-2341 and #2380. It contains a merge conflict. Please resolve it

Created branch autogenerated/release/2.6_cherry-pick_pr-2341 and #2381

Created branch autogenerated/release/2.7_cherry-pick_pr-2341 and #2382. It contains a merge conflict. Please resolve it

Created branch autogenerated/rocm7.0_internal_testing_cherry-pick_pr-2341 and #2383. It contains a merge conflict. Please resolve it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants