Skip to content

IcoswISC240_WOA23_performance_test failing on Chicoma #882

Open
@altheaden

Description

@altheaden

IcoswISC240_WOA23_performance_test is failing on Chicoma.
Test log:

compass calling: compass.ocean.tests.global_ocean.performance_test.PerformanceTest.run()
  inherited from: compass.testcase.TestCase.run()
  in /users/althea/code/compass/main/compass/testcase.py

compass calling: compass.run.serial._run_test()
  in /users/althea/code/compass/main/compass/run/serial.py

Running steps:
  prognostic_ice_shelf_melt
  data_ice_shelf_melt

  * step: prognostic_ice_shelf_melt

compass calling: compass.ocean.tests.global_ocean.forward.ForwardStep.runtime_setup()
  in /users/althea/code/compass/main/compass/ocean/tests/global_ocean/forward.py

Warning: replacing namelist options in namelist.ocean
config_dt = 02:00:00
config_btr_dt = 00:06:00

compass calling: compass.ocean.tests.global_ocean.forward.ForwardStep.run()
  in /users/althea/code/compass/main/compass/ocean/tests/global_ocean/forward.py

Warning: replacing namelist options in namelist.ocean
config_pio_num_iotasks = 1
config_pio_stride = 36
Running: gpmetis graph.info 36
******************************************************************************
METIS 5.0 Copyright 1998-13, Regents of the University of Minnesota
 (HEAD: , Built on: Jan  8 2025, 16:43:49)
 size of idx_t: 64bits, real_t: 64bits, idx_t *: 64bits

Graph Information -----------------------------------------------------------
 Name: graph.info, #Vertices: 7301, #Edges: 21002, #Parts: 36

Options ---------------------------------------------------------------------
 ptype=kway, objtype=cut, ctype=shem, rtype=greedy, iptype=metisrb
 dbglvl=0, ufactor=1.030, no2hop=NO, minconn=NO, contig=NO, nooutput=NO
 seed=-1, niter=10, ncuts=1

Direct k-way Partitioning ---------------------------------------------------
 - Edgecut: 1446, communication volume: 1535.

 - Balance:
     constraint #0:  1.026 out of 0.005

 - Most overweight partition:
     pid: 25, actual: 208, desired: 202, ratio: 1.03.

 - Subdomain connectivity: max: 6, min: 2, avg: 4.33

 - Each partition is contiguous.

Timing Information ----------------------------------------------------------
  I/O:          		   0.004 sec
  Partitioning: 		   0.016 sec   (METIS time)
  Reporting:    		   0.001 sec

Memory Information ----------------------------------------------------------
  Max memory used:		   1.575 MB
******************************************************************************

Running: srun -c 1 -N 1 -n 36 ./ocean_model -n namelist.ocean -s streams.ocean
PE 0: MPICH processor detected:
PE 0:   AMD Rome (23:49:0) (family:model:stepping)
MPI VERSION    : CRAY MPICH version 8.1.28.29 (ANL base 3.4a2)
MPI BUILD INFO : Wed Nov 15 20:57 2023 (git hash 1cde46f) (CH4)
PE 0: MPICH environment settings =====================================
PE 0:   MPICH_ENV_DISPLAY                              = 1
PE 0:   MPICH_VERSION_DISPLAY                          = 1
PE 0:   MPICH_ABORT_ON_ERROR                           = 0
PE 0:   MPICH_CPUMASK_DISPLAY                          = 0
PE 0:   MPICH_STATS_DISPLAY                            = 0
PE 0:   MPICH_RANK_REORDER_METHOD                      = 1
PE 0:   MPICH_RANK_REORDER_DISPLAY                     = 0
PE 0:   MPICH_MEMCPY_MEM_CHECK                         = 0
PE 0:   MPICH_USE_SYSTEM_MEMCPY                        = 0
PE 0:   MPICH_OPTIMIZED_MEMCPY                         = 1
PE 0:   MPICH_ALLOC_MEM_PG_SZ                          = 4096
PE 0:   MPICH_ALLOC_MEM_POLICY                         = PREFERRED
PE 0:   MPICH_ALLOC_MEM_AFFINITY                       = SYS_DEFAULT
PE 0:   MPICH_MALLOC_FALLBACK                          = 0
PE 0:   MPICH_MEM_DEBUG_FNAME                          = 
PE 0:   MPICH_INTERNAL_MEM_AFFINITY                    = SYS_DEFAULT
PE 0:   MPICH_NO_BUFFER_ALIAS_CHECK                    = 0
PE 0:   MPICH_COLL_SYNC                                = MPI_Bcast
PE 0:   MPICH_SINGLE_HOST_ENABLED                        = 1
PE 0:   MPICH_USE_PERSISTENT_TOPS                      = 0
PE 0:   MPICH_DISABLE_PERSISTENT_RECV_TOPS             = 0
PE 0:   MPICH_MAX_TOPS_COUNTERS                        = 0
PE 0:   MPICH_ENABLE_ACTIVE_WAIT                       = 0
PE 0: MPICH/RMA environment settings =================================
PE 0:   MPICH_RMA_MAX_PENDING                          = 128
PE 0:   MPICH_RMA_SHM_ACCUMULATE                       = 0
PE 0: MPICH/Dynamic Process Management environment settings ==========
PE 0:   MPICH_DPM_DIR                                  = 
PE 0:   MPICH_LOCAL_SPAWN_SERVER                       = 0
PE 0:   MPICH_SPAWN_USE_RANKPOOL                       = 0
PE 0: MPICH/SMP environment settings =================================
PE 0:   MPICH_SMP_SINGLE_COPY_MODE                     = XPMEM
PE 0:   MPICH_SMP_SINGLE_COPY_SIZE                     = 8192
PE 0:   MPICH_SHM_PROGRESS_MAX_BATCH_SIZE              = 8
PE 0: MPICH/COLLECTIVE environment settings ==========================
PE 0:   MPICH_COLL_OPT_OFF                             = 0
PE 0:   MPICH_BCAST_ONLY_TREE                          = 1
PE 0:   MPICH_BCAST_INTERNODE_RADIX                    = 4
PE 0:   MPICH_BCAST_INTRANODE_RADIX                    = 4
PE 0:   MPICH_ALLTOALL_SHORT_MSG                       = 64-512
PE 0:   MPICH_ALLTOALL_SYNC_FREQ                       = 1-24
PE 0:   MPICH_ALLTOALLV_THROTTLE                       = 8
PE 0:   MPICH_ALLGATHER_VSHORT_MSG                     = 1024-4096
PE 0:   MPICH_ALLGATHERV_VSHORT_MSG                    = 1024-4096
PE 0:   MPICH_GATHERV_SHORT_MSG                        = 131072
PE 0:   MPICH_GATHERV_MIN_COMM_SIZE                    = 64
PE 0:   MPICH_GATHERV_MAX_TMP_SIZE                     = 536870912
PE 0:   MPICH_GATHERV_SYNC_FREQ                        = 16
PE 0:   MPICH_IGATHERV_MIN_COMM_SIZE                   = 1000
PE 0:   MPICH_IGATHERV_SYNC_FREQ                       = 100
PE 0:   MPICH_IGATHERV_RAND_COMMSIZE                   = 2048
PE 0:   MPICH_IGATHERV_RAND_RECVLIST                   = 0
PE 0:   MPICH_SCATTERV_SHORT_MSG                       = 2048-8192
PE 0:   MPICH_SCATTERV_MIN_COMM_SIZE                   = 64
PE 0:   MPICH_SCATTERV_MAX_TMP_SIZE                    = 536870912
PE 0:   MPICH_SCATTERV_SYNC_FREQ                       = 16
PE 0:   MPICH_SCATTERV_SYNCHRONOUS                     = 0
PE 0:   MPICH_ALLREDUCE_MAX_SMP_SIZE                   = 262144
PE 0:   MPICH_ALLREDUCE_BLK_SIZE                       = 716800
PE 0:   MPICH_GPU_ALLGATHER_VSHORT_MSG_ALGORITHM       = 1
PE 0:   MPICH_GPU_ALLREDUCE_USE_KERNEL                 = 0
PE 0:   MPICH_GPU_COLL_STAGING_BUF_SIZE                = 1048576
PE 0:   MPICH_GPU_ALLREDUCE_STAGING_THRESHOLD          = 256
PE 0:   MPICH_ALLREDUCE_NO_SMP                         = 0
PE 0:   MPICH_REDUCE_NO_SMP                            = 0
PE 0:   MPICH_REDUCE_SCATTER_COMMUTATIVE_LONG_MSG_SIZE = 524288
PE 0:   MPICH_REDUCE_SCATTER_MAX_COMMSIZE              = 1000
PE 0:   MPICH_SHARED_MEM_COLL_OPT                      = 1
PE 0:   MPICH_SHARED_MEM_COLL_NCELLS                   = 8
PE 0:   MPICH_SHARED_MEM_COLL_CELLSZ                   = 256
PE 0: MPICH MPIIO environment settings ===============================
PE 0:   MPICH_MPIIO_HINTS_DISPLAY                      = 0
PE 0:   MPICH_MPIIO_HINTS                              = NULL
PE 0:   MPICH_MPIIO_ABORT_ON_RW_ERROR                  = disable
PE 0:   MPICH_MPIIO_CB_ALIGN                           = 2
PE 0:   MPICH_MPIIO_DVS_MAXNODES                       = -1
PE 0:   MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY       = 0
PE 0:   MPICH_MPIIO_AGGREGATOR_PLACEMENT_STRIDE        = -1
PE 0:   MPICH_MPIIO_MAX_NUM_IRECV                      = 50
PE 0:   MPICH_MPIIO_MAX_NUM_ISEND                      = 50
PE 0:   MPICH_MPIIO_MAX_SIZE_ISEND                     = 10485760
PE 0:   MPICH_MPIIO_OFI_STARTUP_CONNECT                = disable
PE 0:   MPICH_MPIIO_OFI_STARTUP_NODES_AGGREGATOR        = 2
PE 0: MPICH MPIIO statistics environment settings ====================
PE 0:   MPICH_MPIIO_STATS                              = 0
PE 0:   MPICH_MPIIO_TIMERS                             = 0
PE 0:   MPICH_MPIIO_WRITE_EXIT_BARRIER                 = 1
PE 0: MPICH Thread Safety settings ===================================
PE 0:   MPICH_ASYNC_PROGRESS                           = 0
PE 0:   MPICH_OPT_THREAD_SYNC                          = 1
PE 0:   rank 0 required = funneled, was provided = funneled
MPICH ERROR [Rank 0] [job id 21208684.35] [Fri Jan 10 09:53:02 2025] [nid001265] - Abort(1734831948) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1734831948) - process 0

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1734831948) - process 0
srun: error: nid001265: task 0: Exited with exit code 255
srun: Terminating StepId=21208684.35
slurmstepd: error: *** STEP 21208684.35 ON nid001265 CANCELLED AT 2025-01-10T09:53:02 ***
srun: error: nid001265: tasks 1-35: Terminated
srun: Force Terminated StepId=21208684.35

      Failed
Exception raised while running the steps of the test case
Traceback (most recent call last):
  File "/users/althea/code/compass/main/compass/run/serial.py", line 322, in _log_and_run_test
    _run_test(test_case, available_resources)
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/althea/code/compass/main/compass/run/serial.py", line 419, in _run_test
    _run_step(test_case, step, test_case.new_step_log_file,
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
              available_resources)
              ^^^^^^^^^^^^^^^^^^^^
  File "/users/althea/code/compass/main/compass/run/serial.py", line 470, in _run_step
    step.run()
    ~~~~~~~~^^
  File "/users/althea/code/compass/main/compass/ocean/tests/global_ocean/forward.py", line 224, in run
    run_model(self, update_pio=update_pio)
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/althea/code/compass/main/compass/model.py", line 60, in run_model
    run_command(args=args, cpus_per_task=cpus_per_task, ntasks=ntasks,
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                openmp_threads=openmp_threads, config=config, logger=logger)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/althea/code/compass/main/compass/parallel.py", line 149, in run_command
    check_call(command_line_args, logger, env=env)
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/althea/miniforge3/envs/dev_compass_1.7.0-alpha.1/lib/python3.13/site-packages/mpas_tools/logging.py", line 59, in check_call
    raise subprocess.CalledProcessError(process.returncode,
                                        print_args)
subprocess.CalledProcessError: Command 'srun -c 1 -N 1 -n 36 ./ocean_model -n namelist.ocean -s streams.ocean' returned non-zero exit status 143.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions