Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
main b79b3e9
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
08e41ed 3rd-party/openpmix (v1.1.3-4067-g08e41ed5)
30cadc6746ebddd69ea42ca78b964398f782e4e3 3rd-party/prrte (psrvr-v2.0.0rc1-4839-g30cadc6746)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)
Please describe the system on which you are running
- Operating system/version: SLES
- Computer hardware: x86, using CPU only
- Network type: Single node using
--mca btl tcp,sm,self
Details of the problem
MPIX_Comm_shrink intermittently never returns, despite all ranks participating (verified with GDB). I'm using these MCA paramaters:
--mca opal_base_help_aggregate 0 --mca coll ^han --mca btl tcp,sm,self --mca pml ob1 --mca opal_abort_delay -1
and these configuration flags:
--enable-static --enable-shared --disable-oshmem --disable-mpi-fortran --with-slurm --with-ft --with-libevent=internal --with-hwloc=internal --with-prrte=internal --with-pmix=internal --disable-sphinx
This is true with and without configuring with --debug
, but with debug I also get the following error from a nondeterministic rank:
../../../../opal/class/opal_list.h:545: _opal_list_append: Assertion `0 == item->opal_list_item_refcount' failed.
Here's the stack trace from a rank that failed that assertion:
0x00007fd5efcbf121 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#0 0x00007fd5efcbf121 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fd5efcc4e43 in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fd5efcc4d5a in sleep () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fd5efa92d2a in opal_delay_abort () at ../../../../../opal/util/error.c:230
delay = -1
pid = 30618
msg = "[nid00"...
#4 0x00007fd5efa9d649 in show_stackframe (signo=6, info=0x7ffd695af470, p=0x7ffd695af340) at ../../../../../opal/util/stacktrace.c:498
print_buffer = "[nid00"...
tmp = 0x7ffd695aef3b ""
size = 949
ret = 47
si_code_str = 0x7fd5efb4be78 ""
#5 <signal handler called>
No symbol table info available.
#6 0x00007fd5efc2dd2b in raise () from /lib64/libc.so.6
No symbol table info available.
#7 0x00007fd5efc2f3e5 in abort () from /lib64/libc.so.6
No symbol table info available.
#8 0x00007fd5efc25c6a in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#9 0x00007fd5efc25cf2 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#10 0x00007fd5f08564a6 in _opal_list_append (list=0x7fd5f0d5bda0 <ompi_comm_requests_active>, item=0x6eaa78, FILE_NAME=0x7fd5f0c10db8 "../.."..., LINENO=186) at ../../../../opal/class/opal_list.h:545
sentinel = 0x7fd5f0d5bdc8 <ompi_comm_requests_active+40>
__PRETTY_FUNCTION__ = "_opal"...
#11 0x00007fd5f0857612 in ompi_comm_request_start (request=0x6eaa78) at ../../../../ompi/communicator/comm_request.c:186
No locals.
#12 0x00007fd5f0855c30 in ompi_comm_ft_allreduce_intra_nb (inbuf=0x71c894, outbuf=0x71c890, count=1, op=0x432580 <ompi_mpi_op_max>, cid_context=0x71c840, req=0x7ffd695b0460) at ../../../../ompi/communicator/comm_cid.c:1723
rc = 0
context = 0x7581e0
request = 0x6eaa78
subreq = 0x72d080
comm = 0x47c690
__PRETTY_FUNCTION__ = "ompi_"...
failed_group = 0x758208
#13 0x00007fd5f0852589 in ompi_comm_allreduce_getnextcid (request=0x6eaa78) at ../../../../ompi/communicator/comm_cid.c:688
context = 0x71c840
my_id = 34359738368
subreq = 0x7ffd695b0490
flag = true
ret = 0
participate = 1
#14 0x00007fd5f085740f in ompi_comm_request_progress () at ../../../../ompi/communicator/comm_request.c:154
request_item = 0x754fd0
item_complete = 1
rc = 0
request = 0x6eaa78
next = 0x7fd5f0d5bdc8 <ompi_comm_requests_active+40>
progressing = 1
completed = 0
__PRETTY_FUNCTION__ = "ompi_"...
#15 0x00007fd5efa590ed in opal_progress () at ../../../../opal/runtime/opal_progress.c:224
num_calls = 553242877
i = 3
events = 1
#16 0x00007fd5f0850058 in ompi_request_wait_completion (req=0x6eaa78) at ../../../../ompi/request/request.h:493
__PRETTY_FUNCTION__ = "ompi_"...
#17 0x00007fd5f0852366 in ompi_comm_nextcid (newcomm=0x754df0, comm=0x47c690, bridgecomm=0x0, arg0=0x0, arg1=0x0, send_first=true, mode=2048) at ../../../../ompi/communicator/comm_cid.c:632
req = 0x6eaa78
rc = 0
#18 0x00007fd5f08590e0 in ompi_comm_shrink_internal (comm=0x47c690, newcomm=0x7ffd695b0708) at ../../../../ompi/communicator/ft/comm_ft.c:342
rc = 0
exit_status = 0
flag = 1
failed_group = 0x70ac80
comm_group = 0x751fe0
alive_group = 0x755040
alive_rgroup = 0x0
newcomp = 0x754df0
mode = 2048
start = 49.176719257000002
stop = 49.176719192999997
__PRETTY_FUNCTION__ = "ompi_"...
#19 0x00007fd5f0c06ef9 in MPIX_Comm_shrink (comm=0x47c690, newcomm=0x7ffd695b0708) at ../../../../../../../ompi/mpiext/ftmpi/c/comm_shrink.c:48
rc = 0
...
Note that in frame 14 we are working on progress for request 0x6eaa78
, but in frame 12 that same request was pulled from the free list and returned by ompi_comm_request_get
. So I think the problem is that, somehow, the request is in both the ompi_comm_requests
free_list and the ompi_comm_requests_active
list at the same time.
You can also see that this request is first obtained in frame 17 by ompi_comm_nextcid
.
I'll try to get a good MWE going, but I'm hoping this info helps point in the right direction.