Skip to content

[AUTOGENERATED] [release/2.6] [rocm6.4_internal_testing][SWDEV-535305] Fixed test_extra_cuda_context in test_c10d_nccl.py and refactored is_navi3_arch function #2381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: release/2.6
Choose a base branch
from

Conversation

okakarpa
Copy link
Collaborator

Cherry-pick of #2341

…xt` in `test_c10d_nccl.py` and refactored is_navi3_arch function (#2341)

In this PR, I have added a sleep statement before collectives. We need
this extra sleep for NAVI_ARCH because rccl_init inside
init_process_group is happening in a separate process and it is taking
longer to finish on NAVI_ARCH. Sleeping here ensures that the init is
competed successfully and mem_get_info can get stable numbers.
Note that in the test the sleep statement was already there after
collectives.
Also, refactored is_navi3_arch function to is_arch that takes arch list
as an argument and compares with existing arch.

Tested with docker image-

compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.4:108_ubuntu22.04_py3.10_pytorch_rocm6.4_internal_testing_aeb5d79
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 16, 2025

Jenkins build for 6f7c92e6bed5d46e7f536096e2ffa96777d7103c commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akashveramd - I am seeing this UT fail on MI350 as well. Lets remove the check for navi arch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants