Skip to content

Releases: aws/aws-parallelcluster-cookbook

AWS ParallelCluster v2.10.0

18 Nov 16:21
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Cookbook 2.10.0

This is associated with AWS ParallelCluster v2.10.0.

ENHANCEMENTS

  • Add support for CentOS 8.
  • Add support for instance types with multiple network cards (e.g. p4d.24xlarge).
  • Enable FSx Lustre in China regions.
  • Add validation step for AMI creation process to fail when using a base AMI created by a different version of
    ParallelCluster.
  • Add validation step for AMI creation process to fail if the selected OS and the base AMI OS are not consistent.
  • Add possibility to use a post installation script when building an AMI.
  • Install NVIDIA Fabric manager to enable NVIDIA NVSwitch on supported platforms.

CHANGES

  • Upgrade EFA installer to version 1.10.1
    • EFA configuration: efa-config-1.5 (from efa-config-1.4)
    • EFA profile: efa-profile-1.1 (from efa-profile-1.0.0)
    • EFA kernel module: efa-1.10.2 (from efa-1.6.0)
    • RDMA core: rdma-core-31.amzn0 (from rdma-core-28.amzn0)
    • Libfabric: libfabric-1.11.1amzn1.1 (from libfabric-1.10.1amzn1.1)
    • Open MPI: openmpi40-aws-4.0.5 (from openmpi40-aws-4.0.3)
    • Unifies installer runtime options across x86 and aarch64
    • Introduces -g/--enable-gdr switch to install packages with GPUDirect RDMA support
    • Updates to OMPI collectives decision file packaging, migrated from efa-config to efa-profile
    • Introduces CentOS 8 support
  • CentOS 6 is no longer supported.
  • Upgrade NVIDIA driver to version 450.80.02.
  • Upgrade Intel Parallel Studio XE Runtime to version 2020.2.
  • Upgrade Munge to version 0.5.14.
  • Retrieve FSx Lustre DNS name dynamically.
  • Slurm: change SlurmctldPort to 6820-6829 to not overlap with default slurmdbd port (6819).
  • Slurm: add compute_resource name and efa as node features.
  • Improve Slurm and Munge installation process by cleaning up existing installations from OS repositories.
  • Install Python 3 version of aws-cfn-bootstrap scripts.
  • Do not force compute fleet into STOPPED state when performing a cluster update. This allows to update the queue
    size without forcing a termination of the existing instances.

BUG FIXES

  • Fix ephemeral drives setup to avoid failures when partition changes require a reboot.
  • Fix Chrony service management.
  • Retrieve the right number of compute instance slots when instance type is updated.
  • Fix compute fleet status initialization to be configured before daemons are started by supervisord.

AWS ParallelCluster v2.9.1

15 Sep 14:38
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Cookbook 2.9.1.

This is associated with AWS ParallelCluster v2.9.1.

CHANGES

  • There were no notable changes for this version.

AWS ParallelCluster v2.9.0

11 Sep 18:53
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Cookbook 2.9.0.

This is associated with AWS ParallelCluster v2.9.0

ENHANCEMENTS

  • Add support for multiple queues and multiple instance types feature with the Slurm scheduler.
  • Extend NICE DCV support to ARM instances.
  • Extend support to disable hyperthreading on instances (like *.metal) that don't support CpuOptions in
    LaunchTemplate.
  • Enable support for NFS 4 for the filesystems shared from the head node.
  • Add script wrapper to support Torque-like commands with the Slurm scheduler.

CHANGES

  • A Route53 private hosted zone is now created together with the cluster and used in DNS resolution inside cluster nodes
    when using Slurm scheduler.
  • Upgrade EFA installer to version 1.9.5:
    • EFA configuration: efa-config-1.4 (from efa-config-1.3)
    • EFA profile: efa-profile-1.0.0
    • EFA kernel module: efa-1.6.0 (no change)
    • RDMA core: rdma-core-28.amzn0 (no change)
    • Libfabric: libfabric-1.10.1amazon1.1 (no change)
    • Open MPI: openmpi40-aws-4.0.3 (no change)
  • Upgrade Slurm to version 20.02.4.
  • Apply the following changes to Slurm configuration:
    • Assign a range of 10 ports to Slurmctld in order to better perform with large cluster settings
    • Configure cloud scheduling logic
    • Set ReconfigFlags=KeepPartState
    • Set MessageTimeout=60
    • Set TaskPlugin=task/affinity,task/cgroup together with TaskAffinity=no and ConstrainCores=yes in cgroup.conf
  • Upgrade NICE DCV to version 2020.1-9012.
  • Use private ip instead of master node hostname when mounting shared NFS drives.
  • Add new log streams to CloudWatch: chef-client, clustermgtd, computemgtd, slurm_resume, slurm_suspend.
  • Remove dependency on cfn-init in compute nodes bootstrap.
  • Add support for queue names in pre/post install scripts.

BUG FIXES

  • Solve dpkg lock issue with Ubuntu that prevented custom AMI creation in some cases.

AWS ParallelCluster v2.8.1

04 Aug 15:49
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Cookbook 2.8.1.

This is associated with AWS ParallelCluster v2.8.1.

CHANGES

  • Disable screen lock for DCV desktop sessions to prevent users from being locked out.

AWS ParallelCluster v2.8.0

24 Jul 00:52
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Cookbook 2.8.0.

This is associated with AWS ParallelCluster v2.8.0

ENHANCEMENTS

  • Enable support for ARM instances on Ubuntu 18.04 and Amazon Linux 2.
  • Install PMIx v3.1.5 and provide slurm support for it on all supported operating systems except for
    CentOS 6.
  • Install glibc-static, which is required to support certain options for the Intel MPI compiler.

CHANGES

  • Disable libvirtd service on Centos 7. Virtual bridge interfaces are incorrectly detected by Open MPI and
    cause MPI applications to hang, see https://www.open-mpi.org/faq/?category=tcp#tcp-selection for details
  • Use CINC instead of Chef for provisioning instances. See https://cinc.sh/about/ for details.
  • Retry when mounting an NFS mount fails.
  • Install the pyenv virtual environments used by ParallelCluster cookbook and node daemon code under
    /opt/parallelcluster instead of under /usr/local.
  • Avoid downloading the source for env2 at installation time.
  • Drop dependency on the gems ridley and ffi-libarchive.
  • Vendor cookbooks as part of instance provisioning, rather than doing so before copying the cookbook into an
    instance. Users no longer need to have berks installed locally.
  • Drop the dependencies on the poise-python, tar and hostname third-party cookbooks.
  • Use the new official CentOS 7 AMI as the base images for ParallelCluster AMI.
  • Upgrade NVIDIA driver to Tesla version 440.95.01 on CentOS 6 and version 450.51.05 on all other distros.
  • Upgrade CUDA library to version 11.0 on all distros besides CentOS 6.
  • Install third-party cookbook dependencies via local source, rather than using the Chef supermarket.
  • Use https wherever possible in download URLs.
  • Upgrade EFA installer to version 1.9.4:
    • Kernel module: efa-1.6.0 (from efa-1.5.1)
    • RDMA core: rdma-core-28.amzn0 (from rdma-core-25.0)
    • Libfabric: libfabric-1.10.1amazon1.1 (updated from libfabric-aws-1.9.0amzn1.1)
    • Open MPI: openmpi40-aws-4.0.3 (no change)

BUG FIXES

  • Fix issue that was preventing concurrent use of custom node and pcluster CLI packages.
  • Use the correct domain name when contacting AWS services from the China partition.
  • Avoid pinning to a specific release of the Intel HPC platform.

AWS ParallelCluster v2.7.0

19 May 08:25
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Cookbook 2.7.0.

This is associated with AWS ParallelCluster v2.7.0.

CHANGES

  • Upgrade NICE DCV to version 2020.0-8428.
  • Upgrade Intel MPI to version U7.
  • Upgrade NVIDIA driver to version 440.64.00.
  • Upgrade EFA installer to version 1.8.4:
    • Kernel module: efa-1.5.1 (no change)
    • RDMA core: rdma-core-25.0 (no change)
    • Libfabric: libfabric-aws-1.9.0amzn1.1 (no change)
    • Open MPI: openmpi40-aws-4.0.3 (updated from openmpi40-aws-4.0.2)
  • Upgrade CentOS 7 AMI to version 7.8

BUG FIXES

  • Fix recipes installation at runtime by adding the bootstrapped file at the end of the last chef run.
  • Fix installation of Lustre client on Centos 7
  • FSx Lustre: Exit with error when failing to retrieve FSx mountpoint.

AWS ParallelCluster v2.6.1

09 Apr 23:02
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster 2.6.1.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

ENHANCEMENTS

  • Change ProctrackType from proctrack/gpid to proctrack/cgroup in slurm.conf in order to better handle termination of
    stray processes when running MPI applications. This also includes the creation of a cgroup Slurm configuration in
    in order to enable the cgroup plugin.
  • Skip execution, at node bootstrap time, of all those install recipes that are already applied at AMI creation time.
    The old behaviour can be restored setting the property "skip_install_recipes" to "no" through extra_json. The old
    behaviour is required in case a custom_node_package is specified and could be needed in case custom_cookbook is used
    (depending or not if the custom cookbook contains changes into any *_install recipes)
  • Start CloudWatch agent earlier in the node bootstrapping phase so that cookbook execution failures are correctly
    uploaded and are available for troubleshooting.

CHANGES

  • FSx Lustre: remove x-systemd.requires=lnet.service from mount options in order to rely on default lnet setup
    provided by Lustre.
  • Enforce Packer version to be >= 1.4.0 when building an AMI. This is also required for customers using pcluster createami command.
  • Remove /tmp/proxy.sh file. Proxy configuration is now written into /etc/profile.d/proxy.sh
  • Omit cfn-init-cmd and cfn-wire from the files stored in CloudWatch logs.

BUG FIXES

  • Fix installation of Intel Parallel Studio XE Runtime that requires yum4 since version 2019.5.
  • Fix compilation of Torque scheduler on Ubuntu 18.04.

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

AWS ParallelCluster v2.6.0

26 Feb 20:41
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Cookbook 2.6.0.

This is associated with AWS ParallelCluster v2.6.0.

ENHANCEMENTS

  • Add support for Amazon Linux 2
  • Install and setup CloudWatch agent for logging capability
  • Install NICE DCV on Ubuntu 18.04 (this includes ubuntu-desktop, lightdm, mesa-util packages)
  • Install and setup Amazon Time Sync on all OSs
  • Enable accounting plugin in Slurm for all OSes. Note: accounting is not enabled nor configured by default
  • Enable FSx Lustre on Ubuntu 18.04 and Ubuntu 16.04

CHANGES

  • Upgrade Slurm to version 19.05.5
  • Upgrade Intel MPI to version U6
  • Upgrade EFA installer to version 1.8.3:
    • Kernel module: efa-1.5.1 (updated from efa-1.4.1)
    • RDMA core: rdma-core-25.0 (distributed only) (no change)
    • Libfabric: libfabric-aws-1.9.0amzn1.1 (updated from libfabric-aws-1.8.1amzn1.3)
    • Open MPI: openmpi40-aws-4.0.2 (no change)
  • Add SHA256 checksum verification to verify integrity of NICE DCV packages
  • Install Python 2.7.17 on CentOS 6 and set it as default through pyenv
  • Install Ganglia from repository on Amazon Linux, Amazon Linux 2, CentOS 6 and CentOS 7
  • Disable StrictHostKeyChecking for SSH client when target host is inside cluster VPC for all OSs except CentOS 6
  • Pin Intel Python 2 and Intel Python 3 to version 2019.4
  • Automatically disable ptrace protection on Ubuntu 18.04 and Ubuntu 16.04 compute nodes when EFA is enabled
  • Packer version >= 1.4.0 is required for AMI creation

BUG FIXES

  • Fix issue with slurmd daemon not being restarted correctly when a compute node is rebooted
  • Fix errors causing Torque not able to locate jobs, setting server_name to fqdn on master node
  • Fix Torque issue that was limiting the max number of running jobs to the max size of the cluster
  • Slurm: configured StateSaveLocation and SlurmdSpoolDir directories to be writable only to slurm user

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

AWS ParallelCluster v2.5.1

13 Dec 16:34
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Cookbook 2.5.1.

This is associated with AWS ParallelCluster v2.5.1.

Changes

  • Upgrade NVIDIA driver to Tesla version 440.33.01.
  • Upgrade CUDA library to version 10.2.
  • Upgrade EFA installer to version 1.7.1:
    • Kernel module: efa-1.4.1
    • RDMA core: rdma-core-25.0
    • Libfabric: libfabric-aws-1.8.1amzn1.3
    • Open MPI: openmpi40-aws-4.0.2

Bug Fixes

  • Fix installation of NVIDIA drivers on Ubuntu 18.
  • Fix installation of CUDA toolkit on Centos 6.
  • Fix installation of Munge on Amazon Linux, Centos 6, Centos 7 and Ubuntu 16.
  • Export shared directories to all CIDR blocks in a VPC rather than just the first one.

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

AWS ParallelCluster v2.5.0

15 Nov 22:38
1f8ab59
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Cookbook 2.5.0.

This is associated with AWS ParallelCluster v2.5.0.

Enhancements

  • Install NICE DCV on Centos 7 (this includes Gnome and Xorg packages).
  • Install Intel Parallel Studio 2019.5 Runtime in Centos 7 AMI and share /opt/intel over NFS.
  • Add support for Ubuntu 18.

Changes

  • Remove support for Ubuntu 14.
  • Upgrade Intel MPI to version U5.
  • Upgrade EFA Installer to version 1.6.2, this also upgrades Open MPI to 4.0.2.
  • Upgrade NVIDIA driver to Tesla version 418.87.
  • Upgrade CUDA library to version 10.1.
  • Upgrade Slurm to version 19.05.3-2.
  • Slurm: changed following parameters in global configuration:
    • SelectType=cons_tres, SelectTypeParameter=CR_CPU_Memory, GresTypes=gpu: needed to enable support for GPU scheduling.
    • EnforcePartLimits=ALL: jobs which exceed a partition's size and/or time limits will be rejected at submission time.
    • Removed FastSchedule since deprecated.
    • SlurmdTimeout=180, UnkillableStepTimeout=180: to allow scheduler to recover especially when under heavy load.
  • Echo compute instance type and memory information in COMPUTE_READY message
  • Changes to sshd config:
  • Increase default root volume to 25GB.
  • Enable flock user_xattr noatime Lustre options by default everywhere and
    x-systemd.automount x-systemd.requires=lnet.service for systemd based systems.
  • Install EFA in China AMIs.

Bug Fixes

  • Fix Ganglia not starting on Ubuntu 16
  • Fix bug that was preventing nodes to mount partitioned EBS volumes.

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192