Jobstart-related software and information
A. Get Slurm deploy scripts:
- Go to the root directory of the experiment:
cd <rootdir>- Clone deploy scripts
$ git clone https://github.com/artpol84/jobstart.git - Go to the deploy directory:
cd jobstart/slurm_deploy/- Setup configuration in
deploy_ctl.confNOTE: You need to set theINSTALL_DIRto the directory that is unique for each node (like/tmp/slurm_deploy). Otherwise Slurm daemon instances will conflict for the common files.
B. Bild and start the installation
- Allocate resources:
$ salloc -N <x> -t <y>- Download all of the packages:
$ ./deploy_cmd.sh source_prepare- Build and install all of the packages:
./deploy_cmd.sh build_all- Distribute everything
$ ./deploy_cmd.sh distribute_all- Configure Slurm, please see
jobstart/slurm_deploy/files/slurm.conf.infor the general configuration and provide the customization file <local.conf> with control machine and partitions description (seejobstart/slurm_deploy/files/local.confas an example)
./deploy_cmd.sh slurm_config ./files/local.conf- Start the Slurm instance:
./deploy_cmd.sh slurm_startC. Check the installation
NOTE: From another terminal!
- Check that deploy is functional.
$ export SLURMDEP_INST=<INSTALL_DIR from deploy_ctl.conf>
$ cd $SLURMDEP_INST/slurm/bin
$ ./sinfo
<check that the output is correct>- Allocate nodes inside the deployed Slurm installation:
$ ./salloc -N <X> <other options>- Run hostname to test:
$ ./srun hostname5.Run hostname with pmix plugin:
./srun --mpi=pmix hostnameD. Check with the distributed application
NOTE: from the allocation of deployed Slurm (same terminal as C.)
- Go to the test app directory
$ cd <rootdir>/jobstart/shmem/- compile the program
$ $SLURMDEP_INST/ompi/bin/oshcc -o hello_oshmem_c -g hello_oshmem_c.c # INSTALL_DIR from deploy_ctl.conf- Launch the application
$ cd <rootdir>/jobstart/launch/
$ ./run.sh {dtcp|ducx|sapi} [early|noearly] [openib] [timing] -N <nnodes> -n <nprocs> <other-slurm-opts> ./hello_oshmem_cThe following set of commands can be used to re-deploy Slurm after the initial allocation was lost:
export SLURMDEP_INST=<INSTALL_DIR from deploy_ctl.conf>
./deploy_cmd.sh slurm_stop
./deploy_cmd.sh cleanup_remote
rm --preserve-root ${SLURMDEP_INST}/slurm/tmp/*
rm --preserve-root ${SLURMDEP_INST}/slurm/var/*
./deploy_cmd.sh distribute_all
./deploy_cmd.sh slurm_config
./deploy_cmd.sh slurm_start