-
Notifications
You must be signed in to change notification settings - Fork 71
[rocm6.4_internal_testing][SWDEV-535305] Fixed test_extra_cuda_context
in test_c10d_nccl.py
and refactored is_navi3_arch function
#2341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…da_context test in distributed.test_c10d_nccl to pass. Refactored is_navi3_arch function to is_arch that takes arch list as an argument and compares with existing arch.
test_extra_cuda_context
in test_c10d_nccl.py
and refactored is_navi3_arch function
…est_extra_cuda_context test.
…est_extra_cuda_context test.
Jenkins build for 16d0ecfca0cb49846a295be13460936db4e8c1a6 commit is in progress |
…est_extra_cuda_context test to match PR comment.
Jenkins build for 9bfc384bacc57baec5752f9beac6056288a2b7c7 commit finished as FAILURE |
! cherry-pick --onto release/2.5 release/2.6 release/2.7 rocm7.0_internal_testing |
…xt` in `test_c10d_nccl.py` and refactored is_navi3_arch function (#2341) In this PR, I have added a sleep statement before collectives. We need this extra sleep for NAVI_ARCH because rccl_init inside init_process_group is happening in a separate process and it is taking longer to finish on NAVI_ARCH. Sleeping here ensures that the init is competed successfully and mem_get_info can get stable numbers. Note that in the test the sleep statement was already there after collectives. Also, refactored is_navi3_arch function to is_arch that takes arch list as an argument and compares with existing arch. Tested with docker image- compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.4:108_ubuntu22.04_py3.10_pytorch_rocm6.4_internal_testing_aeb5d79
Created branch autogenerated/release/2.5_cherry-pick_pr-2341 and #2380. It contains a merge conflict. Please resolve it Created branch autogenerated/release/2.6_cherry-pick_pr-2341 and #2381 Created branch autogenerated/release/2.7_cherry-pick_pr-2341 and #2382. It contains a merge conflict. Please resolve it Created branch autogenerated/rocm7.0_internal_testing_cherry-pick_pr-2341 and #2383. It contains a merge conflict. Please resolve it |
In this PR, I have added a sleep statement before collectives. We need this extra sleep for NAVI_ARCH because rccl_init inside init_process_group is happening in a separate process and it is taking longer to finish on NAVI_ARCH. Sleeping here ensures that the init is competed successfully and mem_get_info can get stable numbers.
Note that in the test the sleep statement was already there after collectives.
Also, refactored is_navi3_arch function to is_arch that takes arch list as an argument and compares with existing arch.
Tested with docker image-
compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.4:108_ubuntu22.04_py3.10_pytorch_rocm6.4_internal_testing_aeb5d79
Cherry-picked to release/2.5 branch via #2380
Cherry-picked to release/2.6 branch via #2381
Cherry-picked to release/2.7 branch via #2382
Cherry-picked to rocm7.0_internal_testing branch via #2383