Skip to content

Upgrade xpk version from v0.8.0 to v0.10.1 #1608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 11, 2025
42 changes: 27 additions & 15 deletions .github/actions/gke-xpk/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ inputs:
XPK_VERSION:
description: 'XPK release tag'
required: false
default: 'v0.8.0'
default: 'v0.10.1'
type: string
XPK_PYTHON:
description: 'Python version for XPK'
Expand Down Expand Up @@ -119,9 +119,8 @@ runs:
shell: bash -x -e -u {0}
run: |
sed -i 's/{{ IMAGE_PULL_SECRET_NAME }}/${{ inputs.IMAGE_PULL_SECRET_NAME }}/g' .github/gke-workflow/xpk/${{ inputs.XPK_VERSION}}/workload.patch
git apply --unsafe-paths .github/gke-workflow/xpk/${{ inputs.XPK_VERSION}}/tcpxo_decorator.patch --directory ${WORKLOAD_NAME}/xpk
git apply --unsafe-paths .github/gke-workflow/xpk/${{ inputs.XPK_VERSION}}/docker_resources.patch --directory ${WORKLOAD_NAME}/xpk
git apply --unsafe-paths .github/gke-workflow/xpk/${{ inputs.XPK_VERSION}}/workload.patch --directory ${WORKLOAD_NAME}/xpk
PATCH_PATH=.github/gke-workflow/xpk/${{ inputs.XPK_VERSION}}
ls ${PATCH_PATH} | xargs -I {} git apply --unsafe-paths ${PATCH_PATH}/{} --directory ${WORKLOAD_NAME}/xpk

- name: Set workload commands
shell: bash -x -e -u {0}
Expand Down Expand Up @@ -158,18 +157,31 @@ runs:
run: |
source ${WORKLOAD_NAME}/.venv/bin/activate
cd ${WORKLOAD_NAME}/xpk

args=(
--project=${{ inputs.GCP_PROJECT }}
--cluster=${{ inputs.GKE_CLUSTER }}
--zone=${{ inputs.GCP_ZONE }}
--workload=${WORKLOAD_NAME}
--docker-image=${{ inputs.IMAGE }}
--device-type=${{ inputs.CLUSTER_DEVICE }}
--num-nodes=${{ inputs.NUM_NODES }}
--num-slices=${{ inputs.NUM_NODES }}
--priority=high
--scheduler=gke.io/topology-aware-auto
)

if [[ "${{ inputs.XPK_VERSION }}" == "v0.10.1" ]]; then
args+=(
--docker-image-pull-secret=${{ inputs.IMAGE_PULL_SECRET_NAME }}
--env="JAX_COORDINATOR_PORT=3389"
--env="JAX_COORDINATOR_ADDRESS=\$(JOBSET_NAME)-\$(REPLICATED_JOB_NAME)-0-0.\$(JOBSET_NAME):3389"
)
fi

python xpk.py workload create \
--project ${{ inputs.GCP_PROJECT }} \
--cluster ${{ inputs.GKE_CLUSTER }} \
--zone ${{ inputs.GCP_ZONE }} \
--workload ${WORKLOAD_NAME} \
--docker-image ${{ inputs.IMAGE }} \
--device-type ${{ inputs.CLUSTER_DEVICE }} \
--num-nodes ${{ inputs.NUM_NODES }} \
--num-slices ${{ inputs.NUM_NODES }} \
--priority=high \
--scheduler=gke.io/topology-aware-auto \
--command "${PRELUDE} ${CMD} ${POSTLUDE}"
${args[@]} \
--command="${PRELUDE} ${CMD} ${POSTLUDE}"

- name: Wait for JobSet to unsuspend on cluster
shell: bash -u {0}
Expand Down
13 changes: 13 additions & 0 deletions .github/gke-workflow/xpk/v0.10.1/tcpxo_decorator.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
diff --git a/src/xpk/core/workload_decorators/tcpxo_decorator.py b/src/xpk/core/workload_decorators/tcpxo_decorator.py
index 771cafe..e455f2a 100644
--- a/src/xpk/core/workload_decorators/tcpxo_decorator.py
+++ b/src/xpk/core/workload_decorators/tcpxo_decorator.py
@@ -180,7 +180,7 @@ def update_gpu_containers(job_manifest):
if 'nvidia.com/gpu' in container.get('resources', {}).get('limits', {}):
container.setdefault('env', [])
container['env'].append(
- {'name': 'LD_LIBRARY_PATH', 'value': '/usr/local/nvidia/lib64'}
+ {'name': 'LD_LIBRARY_PATH', 'value': '/opt/nvidia/nccl/lib:/usr/local/nvidia/lib64'}
)
container['env'].append({
'name': 'NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY',
16 changes: 16 additions & 0 deletions .github/gke-workflow/xpk/v0.10.1/workload.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
diff --git a/src/xpk/commands/workload.py b/src/xpk/commands/workload.py
index 0231bab..25e34eb 100644
--- a/src/xpk/commands/workload.py
+++ b/src/xpk/commands/workload.py
@@ -482,8 +482,9 @@ def workload_create(args) -> None:
flex=True if capacity_type == CapacityType.FLEX_START else False,
)
else (
- 'kueue.x-k8s.io/podset-preferred-topology:'
- ' "cloud.google.com/gce-topology-host"'
+ """
+ kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-host"
+ """
)
)

Loading