Skip to content

Conversation

k-artem
Copy link

@k-artem k-artem commented Jul 23, 2025

Fixes for regular and distributed unit tests, including cuda&native code.
Also including partial cherry-pick of [release/2.5][SWDEV-489778] NAVI4x UT parity for distributed config (#2327)
Fixes #SWDEV-523736

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 23, 2025

Jenkins build for 7a4c23426839d6a6e563e2a289d840b45ab1f054 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 24, 2025

Jenkins build for 233a668bd590fe54f865050c15bd24ef32c8b15a commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 27, 2025

Jenkins build for b934af2855b0de92ef517c18433a2e90d293e619 commit finished as NOT_BUILT
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 27, 2025

Jenkins build for b934af2855b0de92ef517c18433a2e90d293e619 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 28, 2025

Jenkins build for b30a77e88246eae96edf436215a6825f8e6e2dd9 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

pragupta added 2 commits July 28, 2025 16:21
[release/2.5][SWDEV-489778] NAVI4x UT parity for distributed config (#2327)
I did a sweep of all the distributed failures on NAVI4x. On a high
level, we were running into following issues:
- MEM_EFF_ATTENTION is not supported on NAVI4x for 2.5 causing tensors
not alike issues
- Some UTs pass in future releases, skipped those
- Some had slight tolerance fixes as we use hipblas in this branch as
compared to hipblaslt in future branches

Fixes #ISSUE_NUMBER
[release/2.5][SWDEV-489778] NAVI4x UT parity for distributed config (#2327)
I did a sweep of all the distributed failures on NAVI4x. On a high
level, we were running into following issues:
- MEM_EFF_ATTENTION is not supported on NAVI4x for 2.5 causing tensors
not alike issues
- Some UTs pass in future releases, skipped those
- Some had slight tolerance fixes as we use hipblas in this branch as
compared to hipblaslt in future branches

Fixes #ISSUE_NUMBER
@k-artem k-artem requested a review from pruthvistony July 31, 2025 13:52
@k-artem
Copy link
Author

k-artem commented Jul 31, 2025

@pruthvistony please review

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 31, 2025

Jenkins build for e059c6cae3043510c49bc05ce343ea28e90935cf commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Aug 1, 2025

Jenkins build for 66fdc1da3e3264605c13f3341f220100b34ac937 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

'subgraph_1', (y0, x0)); invoke_subgraph_3 = None
invoke_subgraph = torch.ops.higher_order.invoke_subgraph(subgraph_0, \
'subgraph_0', (l_y_, l_x_))
invoke_subgraph_3 = torch.ops.higher_order.invoke_subgraph(subgraph_1, 'subgraph_1', (x0, y0)); invoke_subgraph_3 = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were all these changes required?
Maintenance will be a problem going forward if we make these changes.

Copy link
Author

@k-artem k-artem Aug 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release/2.6 missed this commit 5c3996c#diff-356a32190ce9ad6efc3e6fd7cb30a179394ce2170c0472b79edc7b24a47b7e95R271 which sorted graph regions(without it assert triggers), so after these changes code will be synced with main/release/2.7 branches. Also rerun tests with EXPECTTEST_ACCEPT=1 helps to adjust expected values automatically according to main code changes.

@@ -15,6 +15,7 @@
from torch.testing import FileCheck
from torch.testing._internal.common_cuda import xfailIfSM89
from torch.testing._internal.common_device_type import expectedFailureXPU
from torch.testing._internal.common_utils import skipIfRocm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is required, for next testcase of this file line 431:

    @skipIfRocm #This test requires triton version 3.3+
    def test_split_scan(self):

using dtype = OpaqueType<sizeof(scalar_t)>;
index_kernel_impl<dtype>(iter, index_size, index_stride);
});
static void index_kernel(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change needs to be done for Navi only.

Copy link
Author

@k-artem k-artem Aug 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In main branch it is done without Navi only -> d02c396. Do you expect that it can brake MI on 2.6?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is FP8 enablement for index operator.
why is it required to cherry-pick this feature into rel/2.6? Feature back porting shouldnt be done unless it is very important customer request. So can we skip this change?

Copy link
Author

@k-artem k-artem Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, no feature request from customer, I will drop it. done

using dtype = OpaqueType<sizeof(scalar_t)>;
index_kernel_impl<dtype>(iter, index_size, index_stride);
});
static void index_kernel(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is FP8 enablement for index operator.
why is it required to cherry-pick this feature into rel/2.6? Feature back porting shouldnt be done unless it is very important customer request. So can we skip this change?

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Aug 6, 2025

Jenkins build for 870cde981d39ed1850340042f3aff0112816dc96 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during Pytorch building:

[7469/8040] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/CudaDMAConnectivity.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[7470/8040] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[7471/8040] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCUtils.cpp.o
FAILED: caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCUtils.cpp.o 
/opt/cache/bin/sccache /opt/cache/bin/c++ -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DPYTORCH_LAYERNORM_FAST_RECIPROCAL -DROCM_VERSION=60401 -DTORCH_ENABLE_LLVM -DTORCH_HIP_BUILD_MAIN_LIB -DTORCH_HIP_VERSION=604 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_PROF_API=1 -DUSE_ROCM -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -Dtorch_hip_EXPORTS -I/var/lib/jenkins/pytorch/build/aten/src -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch -I/var/lib/jenkins/pytorch/cmake/../third_party/benchmark/include -I/opt/llvm/include -I/var/lib/jenkins/pytorch/third_party/onnx -I/var/lib/jenkins/pytorch/build/third_party/onnx -I/var/lib/jenkins/pytorch/nlohmann -I/opt/rocm/hcc/include -I/opt/rocm/rocblas/include -I/opt/rocm/hipsparse/include -I/opt/rocm/include/rccl -I/var/lib/jenkins/pytorch/aten/src/THH -I/var/lib/jenkins/pytorch/aten/src/ATen/hip -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/library/include -I/var/lib/jenkins/pytorch/third_party/fmt/include -I/var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck -I/var/lib/jenkins/pytorch/build/caffe2/aten/src -I/var/lib/jenkins/pytorch/aten/src/ATen/.. -I/var/lib/jenkins/pytorch/torch/include -I/var/lib/jenkins/pytorch/c10/hip/../.. -I/var/lib/jenkins/pytorch/c10/.. -I/var/lib/jenkins/pytorch/torch/csrc/api -I/var/lib/jenkins/pytorch/torch/csrc/api/include -I/var/lib/jenkins/pytorch/build/third_party/gloo/hip -isystem /opt/rocm-6.4.1/include -isystem /var/lib/jenkins/pytorch/build/third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googletest/include -isystem /var/lib/jenkins/pytorch/third_party/protobuf/src -isystem /opt/conda/envs/py_3.12/include -isystem /var/lib/jenkins/pytorch/third_party/XNNPACK/include -isystem /var/lib/jenkins/pytorch/third_party/ittapi/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/eigen -isystem /var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /var/lib/jenkins/pytorch/third_party/ideep/include -isystem /var/lib/jenkins/pytorch/INTERFACE -isystem /var/lib/jenkins/pytorch/third_party/nlohmann/include -isystem /opt/rocm/include -isystem /opt/rocm-6.4.1/include/hiprand -isystem /opt/rocm-6.4.1/include/rocrand -isystem /opt/rocm/magma/include -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-error=dangling-reference -Wno-error=redundant-move -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -std=gnu++17 -fPIC -DMKL_HAS_SBGEMM -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -fvisibility=hidden -O2 -fPIC -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=604 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLAS_V2 -DHIPBLASLT_VEC_EXT -D_GLIBCXX_USE_CXX11_ABI=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIP_VERSION=6 -DUSE_MIOPEN -MD -MT caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCUtils.cpp.o -MF caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCUtils.cpp.o.d -o caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/UCCUtils.cpp.o -c /var/lib/jenkins/pytorch/torch/csrc/distributed/c10d/UCCUtils.cpp
sccache: encountered fatal error
sccache: error : Invalid checksum
sccache:  cause: Invalid checksum
[7472/8040] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/Distributions.cpp.o

@k-artem k-artem requested a review from pruthvistony August 7, 2025 08:20
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Aug 11, 2025

Jenkins build for 8e6b0ab5d4751dafdddc28e963d7bec14331aba5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@fjankovi
Copy link

@pruthvistony Can you please review?

@k-artem
Copy link
Author

k-artem commented Aug 19, 2025

@pruthvistony please review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants