From 006d6e63a63ca5a37ccfc337baf759dac3ba6f97 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Mon, 2 Jun 2025 13:02:17 +0800 Subject: [PATCH 1/7] mutable-pv-affinity init --- .../5381-mutable-pv-affinity/README.md | 872 ++++++++++++++++++ .../5381-mutable-pv-affinity/kep.yaml | 48 + 2 files changed, 920 insertions(+) create mode 100644 keps/sig-storage/5381-mutable-pv-affinity/README.md create mode 100644 keps/sig-storage/5381-mutable-pv-affinity/kep.yaml diff --git a/keps/sig-storage/5381-mutable-pv-affinity/README.md b/keps/sig-storage/5381-mutable-pv-affinity/README.md new file mode 100644 index 00000000000..f586fe5c7ae --- /dev/null +++ b/keps/sig-storage/5381-mutable-pv-affinity/README.md @@ -0,0 +1,872 @@ + +# KEP-5381: Mutable PersistentVolume Node Affinity + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This KEP proposes to make `PersistentVolume.spec.nodeAffinity` field mutable, +making it possible to change the affinity with VolumeAttributeClass. +This allows user to migrate data or enabling features without +interrupting workloads. + +## Motivation + + + +Currently, `PersistentVolume.spec.nodeAffinity` is set at creation time and cannot be changed. +But user may modify the volume to +taking advantage of new features provided by the storage provider, +or accommodate to the changes of business requirements. +These modification can be expressed by `VolumeAttributeClass` in Kubernetes. +But sometimes, A modification to volume comes with change to its accessibility, such as: +1. migration of data from one zone to regional storage; +2. enabling features that is not supported by all the client nodes. + +In these scenarios, the `nodeAffinity` becomes inaccurate, +causing the scheduler to make decisions based on outdated information. +This results in pods: +* being scheduled to nodes that cannot access the volume, getting stuck in a `ContainerCreating` state; +* or being rejected from nodes that actually can access the volume, getting stuck in a `Pending` state. + +By making `PersistentVolume.spec.nodeAffinity` field mutable, +we give storage providers a chance to propagate latest accessibility requirement to the scheduler, +improving the reliability of stateful pod scheduling. + +### Goals + +- Make `PersistentVolume.spec.nodeAffinity` field mutable. +- Enable CSI drivers to return a new accessibility requirement on ControllerModifyVolume. + +### Non-Goals + +- Modifying the core scheduling logic of Kubernetes. +- Implementing cloud provider-specific solutions within Kubernetes core. +- Re-scheduling running pods with volumes being modified, + or interrupting workloads in any way. + +## Proposal + + + +1. Change APIServer validation to allow `PersistentVolume.spec.nodeAffinity` to be mutable. + +2. Change CSI Specification to allow `ControllerModifyVolume` to return a new accessibility requirement. + +3. Change external-resizer to set `PersistentVolume.spec.nodeAffinity` to the new accessibility requirement if returned by CSI driver. + +### User Stories (Optional) + +#### Story 1 + +As the owner of a stateful workload, I want to take advantage of the new +regional storage provided by the storage provider, +to improve the availability of my application. +I need a way to tell scheduler that the volume is now accessible in every zone, +so that the pod can be scheduled to another zone when the current zone is down. + +#### Story 2 + +As a cluster operator, I'm conducting an upgrade to new storage category provided by our storage provider. +However, once upgraded, the volume cannot be attached to some legacy nodes in the cluster. +I need a way to convey this new requirement to the scheduler, +so that my pod will not getting stuck in a `ContainerCreating` state. + +### Notes/Constraints/Caveats (Optional) + + + +`PersistentVolume.spec.nodeAffinity.required` should works like +`Pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution`. +It is never re-evaluated when the pod is already running. +It is storage provider's responsibility to ensure that the running workload is not interrupted. + +Because of the limitation of protobuf (used by CSI spec), we cannot differentiate between +unset or empty repeated fields. So we cannot clear the nodeAffinity field. +If the storage provider want to indicate the volume is accessible everywhere, +it may return a very generic nodeAffinity, such as: topology.kubernetes.io/region. + +### Risks and Mitigations + + + +## Design Details + + + +We will extend CSI specification to add: +```protobuf +message ControllerModifyVolumeResponse { + option (alpha_message) = true; + + // Specifies where (regions, zones, racks, etc.) the modified + // volume is accessible from. Replaces the accessible_topology + // returned by CreateVolume. + // A plugin that returns this field MUST also set the + // VOLUME_ACCESSIBILITY_CONSTRAINTS plugin capability. + // An SP MAY specify multiple topologies to indicate the volume is + // accessible from multiple locations. + // COs MAY use this information along with the topology information + // returned by NodeGetInfo to ensure that a given volume is accessible + // from a given node when scheduling workloads. + // This field is OPTIONAL. If it is not specified, the CO SHOULD assume + // the topology is not changed by this modify volume request. + repeated Topology accessible_topology = 5; +} +``` + +When this new field is set, external-resizer will set `PersistentVolume.spec.nodeAffinity` accordingly, before it updates the PV status. + +A new error condition of `ControllerModifyVolume` is added to CSI spec: + +| Condition | gRPC Code | Description | Recovery Behavior | +|-----------|-----------|-------------|-------------------| +| Topology conflict | 9 FAILED_PRECONDITION | Indicates that the CO has requested a modification that would make the volume inaccessible to some already attached nodes. | Caller MAY detach the volume from the nodes that are in conflict and retry. | + +But this KEP does not cover the automatic correction. Kubernetes should only retry with exponential backoff. + + +Scheduler Enhancements: make sure the Pod is re-queued when the PV is updated. + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- `pkg/apis/core/validation`: 2025-06-02 - 84.7% + +##### Integration tests + + + + + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +##### e2e tests + + + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +- Test modifing VolumeAttributeClass can properly update PV `nodeAffinity`. + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-storage/5381-mutable-pv-affinity/kep.yaml b/keps/sig-storage/5381-mutable-pv-affinity/kep.yaml new file mode 100644 index 00000000000..a2e457f1f79 --- /dev/null +++ b/keps/sig-storage/5381-mutable-pv-affinity/kep.yaml @@ -0,0 +1,48 @@ +title: Mutable PersistentVolume Node Affinity +kep-number: 5381 +authors: + - "@huww98" +owning-sig: sig-storage +participating-sigs: + - sig-api-machinery + - sig-scheduling + - sig-storage +status: provisional +creation-date: 2025-06-02 +reviewers: + - "@gnuified" + - "@jsafrane" + - "@msau42" + - "@xing-yang" +approvers: + - "@xing-yang" + - "@msau42" + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.34" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.34" + beta: "v1.35" + stable: "v1.37" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: MutablePVAffinity + components: + - kube-apiserver + - external-resizer +disable-supported: true + +# The following PRR answers are required at beta release +# metrics: +# - my_feature_metric From 0602a5f744b8e4e201d7bd90eb69e67f1b9baf62 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=83=A1=E7=8E=AE=E6=96=87?= Date: Fri, 13 Jun 2025 18:01:58 +0800 Subject: [PATCH 2/7] address comments from sig-storage meeting --- .../5381-mutable-pv-affinity/README.md | 143 ++++++++++++++++-- 1 file changed, 134 insertions(+), 9 deletions(-) diff --git a/keps/sig-storage/5381-mutable-pv-affinity/README.md b/keps/sig-storage/5381-mutable-pv-affinity/README.md index f586fe5c7ae..b95e65575bf 100644 --- a/keps/sig-storage/5381-mutable-pv-affinity/README.md +++ b/keps/sig-storage/5381-mutable-pv-affinity/README.md @@ -212,6 +212,36 @@ to improve the availability of my application. I need a way to tell scheduler that the volume is now accessible in every zone, so that the pod can be scheduled to another zone when the current zone is down. +In this case, the old affinity would be: +```yaml +required: + nodeSelectorTerms: + - matchExpressions: + - key: topology.kubernetes.io/zone + operator: In + values: + - cn-beijing-g +``` +The updated affinity would be: +```yaml +required: + nodeSelectorTerms: + - matchExpressions: + - key: topology.kubernetes.io/region + operator: In + values: + - cn-beijing +``` + +Or in the view of CSI accessibility requirement, from: +```json +[{"topology.kubernetes.io/zone": "cn-beijing-g"}] +``` +to: +```json +[{"topology.kubernetes.io/region": "cn-beijing"}] +``` + #### Story 2 As a cluster operator, I'm conducting an upgrade to new storage category provided by our storage provider. @@ -219,6 +249,41 @@ However, once upgraded, the volume cannot be attached to some legacy nodes in th I need a way to convey this new requirement to the scheduler, so that my pod will not getting stuck in a `ContainerCreating` state. +In this case, the old affinity would be: +```yaml +required: + nodeSelectorTerms: + - matchExpressions: + - key: provider.com/disktype.cloud_ssd + operator: In + values: + - available +``` +The updated affinity would be: +```yaml +required: + nodeSelectorTerms: + - matchExpressions: + - key: provider.com/disktype.cloud_essd + operator: In + values: + - available +``` + +Or in the view of CSI accessibility requirement, from: +```json +[{"provider.com/disktype.cloud_ssd": "available"}] +``` +to: +```json +[{"provider.com/disktype.cloud_essd": "available"}] +``` + +Type A node only supports cloud_ssd, while Type B node supports both cloud_ssd and cloud_essd. +We will only allow the modification if the volume is attached to type B nodes. +And I want to make sure the Pods using new cloud_essd volume not to be scheduled to type A nodes. + + ### Notes/Constraints/Caveats (Optional) ## Alternatives +Instead of adding new fields to CSI GRPC `ControllerModifyVolume`, we could add a new GRPC `ControllerModifyVolumeTopology`: + +```protobuf +rpc ControllerModifyVolumeTopology (ControllerModifyVolumeTopologyRequest) + returns (ControllerModifyVolumeTopologyResponse) { + option (alpha_method) = true; + } + +message ControllerModifyVolumeTopologyRequest { + option (alpha_message) = true; + string volume_id = 1; + map secrest = 2 [(csi_secret) = true]; + map mutable_parameters = 3; + TopologyRequirement accessibility_requirements = 4; +} + +message ControllerModifyVolumeTopologyResponse { + option (alpha_message) = true; + repeated Topology accessible_topology = 1; + bool in_progress = 2; +} +``` + +The workflow of this new GRPC is essentially the same as the current `ControllerModifyVolume` GRPC, but it allows SPs to mutate the accessible +topologies of volumes by default. + +SPs with the `MODIFY_VOLUME_TOPOLOGY` controller capability should use this new GRPC instead of `ControllerModifyVolume` when modifying volumes. + +Comparison between these two approaches: +| Criteria | PR 592 (Extended GRPC) | PR 593 (New GRPC) | +| -------- | ---------------------- | ----------------- | +| Maintenance Difficulty | ✅ Low | ⚠️ High, need to also modify ControllerModifyVolumeTopology when making changes to ControllerModifyVolume | +| Implementation Complexity | ✅ Low | ⚠️ High, SPs will have to implement a new GRPC if they want to support topology modification even if they have implemented ControllerModifyVolume | +| Side Effects | ⚠️ Will impede the GA process of K8s VAC | ✅ No influence on other features | @@ -368,22 +370,6 @@ message ControllerModifyVolumeRequest { // Indicates whether the CO allows the SP to update the topology // as a part of the modification. bool allow_topology_updates = 4; - - // Specifies where (regions, zones, racks, etc.) the - // volume MUST be accessible from after modification. - // COs SHALL only specify topological accessibility - // information supported by the SP. - // This field is OPTIONAL. - // This field SHALL NOT be specified unless the SP has the - // VOLUME_ACCESSIBILITY_CONSTRAINTS plugin capability, the - // MODIFY_VOLUME_TOPOLOGY controller capability and - // allow_topology_updates is set to true. - // If this field is not specified and the SP has the - // VOLUME_ACCESSIBILITY_CONSTRAINTS plugin capability, the - // MODIFY_VOLUME_TOPOLOGY controller capability and - // allow_topology_updates is set to true, the SP MAY - // modify the volume topology according to the mutable_parameters. - TopologyRequirement accessibility_requirements = 5; } message ControllerModifyVolumeResponse { @@ -997,7 +983,10 @@ Why should this KEP _not_ be implemented? --> ## Alternatives -Instead of adding new fields to CSI GRPC `ControllerModifyVolume`, we could add a new GRPC `ControllerModifyVolumeTopology`: + +### New GRPC + +Instead of adding new fields to CSI GRPC `ControllerModifyVolume`, we could add a new GRPC `ControllerModifyVolumeTopology` (Other candidate names: `ControllerMigrateVolume`): ```protobuf rpc ControllerModifyVolumeTopology (ControllerModifyVolumeTopologyRequest) @@ -1010,7 +999,6 @@ message ControllerModifyVolumeTopologyRequest { string volume_id = 1; map secrest = 2 [(csi_secret) = true]; map mutable_parameters = 3; - TopologyRequirement accessibility_requirements = 4; } message ControllerModifyVolumeTopologyResponse { @@ -1023,15 +1011,35 @@ message ControllerModifyVolumeTopologyResponse { The workflow of this new GRPC is essentially the same as the current `ControllerModifyVolume` GRPC, but it allows SPs to mutate the accessible topologies of volumes by default. -SPs with the `MODIFY_VOLUME_TOPOLOGY` controller capability should use this new GRPC instead of `ControllerModifyVolume` when modifying volumes. +SPs with the `MODIFY_VOLUME_TOPOLOGY` controller capability should implement both this new GRPC and `ControllerModifyVolume`. +New COs that support modify volume topology (i.e. external-resizer) should only call the new GRPC when modifying volumes. +Old COs can continue to call `ControllerModifyVolume`. SPs should reject such requests if topology will be changed. Comparison between these two approaches: -| Criteria | PR 592 (Extended GRPC) | PR 593 (New GRPC) | +| Criteria | [PR 592](https://github.com/container-storage-interface/spec/pull/592) (Extended GRPC) | [PR 593](https://github.com/container-storage-interface/spec/pull/593) (New GRPC) | | -------- | ---------------------- | ----------------- | | Maintenance Difficulty | ✅ Low | ⚠️ High, need to also modify ControllerModifyVolumeTopology when making changes to ControllerModifyVolume | | Implementation Complexity | ✅ Low | ⚠️ High, SPs will have to implement a new GRPC if they want to support topology modification even if they have implemented ControllerModifyVolume | | Side Effects | ⚠️ Will impede the GA process of K8s VAC | ✅ No influence on other features | +### User Specified Topology Requirement + +Currently we don't support user specified topology requirement. +We've considered a design: +* Add `accessibility_requirements` in `ModifyVolumeRequest`, like that in `CreateVolumeRequest` +* Add `allowedTopologies` in `VolumeAttributeClass`, like that in `StorageClass` + +But facing a lot of unresolved questions: +* How to merge `allowedTopologies` from `VolumeAttributeClass`, `StorageClass`? +* Should we use `allowedTopologies` from `StorageClass` if it is not specified in `VolumeAttributeClass`? +* Should we consider the topology of the currently attached nodes? +* Should we consider the topology of all the nodes in the cluster? + +In most cases, SP can determine the desired topology of a volume from the `mutable_parameters`, or from the currently attached nodes. +An exception could be: modifying a volume from regional to zonal, and it is not attached to any node. +In this case, SP will need more information from the CO to determine the desired zone. +But we don't have such use case now, we decided leave it as a future work. + @@ -224,25 +226,40 @@ required: values: - cn-beijing-g ``` -The updated affinity would be: -```yaml -required: - nodeSelectorTerms: - - matchExpressions: - - key: topology.kubernetes.io/region - operator: In - values: - - cn-beijing -``` - -Or in the view of CSI accessibility requirement, from: +Or in the view of CSI accessibility requirement: ```json [{"topology.kubernetes.io/zone": "cn-beijing-g"}] ``` -to: -```json -[{"topology.kubernetes.io/region": "cn-beijing"}] -``` + +The workflow: +1. User create a `VolumeAttributeClass`: + ```yaml + apiVersion: storage.k8s.io/v1beta1 + kind: VolumeAttributesClass + metadata: + name: regional + driverName: csi.provider.com + parameters: + type: regional + ``` +2. User modify the `volumeAttributesClassName` in the PVC to `regional` +3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "regional"}` +4. CSI driver blocks until the modification finished, then return with `accessible_topology` set to + ```json + [{"topology.kubernetes.io/region": "cn-beijing"}] + ``` +5. external-resizer sets `PersistentVolume.spec.nodeAffinity` to + ```yaml + required: + nodeSelectorTerms: + - matchExpressions: + - key: topology.kubernetes.io/region + operator: In + values: + - cn-beijing + ``` + then update the PV status to indicate the modification is successful. + #### Story 2 @@ -261,30 +278,51 @@ required: values: - available ``` -The updated affinity would be: -```yaml -required: - nodeSelectorTerms: - - matchExpressions: - - key: provider.com/disktype.cloud_essd - operator: In - values: - - available -``` - -Or in the view of CSI accessibility requirement, from: +Or in the view of CSI accessibility requirement: ```json [{"provider.com/disktype.cloud_ssd": "available"}] ``` -to: -```json -[{"provider.com/disktype.cloud_essd": "available"}] -``` Type A node only supports cloud_ssd, while Type B node supports both cloud_ssd and cloud_essd. We will only allow the modification if the volume is attached to type B nodes. And I want to make sure the Pods using new cloud_essd volume not to be scheduled to type A nodes. +In this case, it takes long to modify the volume, the new topology is not strictly less restrictive, +and SP wants to minimize the time window of the race condition: + +The workflow: +1. User create a `VolumeAttributeClass`: + ```yaml + apiVersion: storage.k8s.io/v1beta1 + kind: VolumeAttributesClass + metadata: + name: essd + driverName: csi.provider.com + parameters: + type: cloud_essd + ``` +2. User modify the `volumeAttributesClassName` in the PVC to `essd` +3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "cloud_essd"}` +4. CSI driver returns with `in_progress` set to true, and `accessible_topology` set to + ```json + [{"provider.com/disktype.cloud_essd": "available"}] + ``` +5. external-resizer sets `PersistentVolume.spec.nodeAffinity` to + ```yaml + required: + nodeSelectorTerms: + - matchExpressions: + - key: provider.com/disktype.cloud_essd + operator: In + values: + - available + ``` + but the PV status is not updated yet. + From now on, the new Pod will be scheduled to nodes with `provider.com/disktype.cloud_essd: available`, + maybe they will stuck in `ContainerCreating` state until the modification finishes. +6. external-resizer go back to step 3, retries until `in_progress` is set to false. +7. external-resizer update the PV status to indicate the modification is successful. + ### Notes/Constraints/Caveats (Optional) @@ -390,11 +428,14 @@ message ControllerModifyVolumeResponse { // If it is not specified, the CO MAY assume // the volume is equally accessible from all nodes in the cluster and // MAY schedule workloads referencing the volume on any available - // node. + // node. // // SP MUST only set this field if allow_topology_updates is set // in the request. SP SHOULD fail the request if it needs to update // topology but is not allowed by CO. + // + // SP SHOULD only return topology that is a super-set of the + // original topology to avoid race conditions when scheduling. repeated Topology accessible_topology = 1; // Indicates whether the modification is still in progress. @@ -422,42 +463,6 @@ But this KEP does not cover the automatic correction. Kubernetes should only ret Scheduler Enhancements: make sure the Pod is re-queued when the PV is updated. -A typical workflow is (taking the user story 1 as an example): -1. User create a `VolumeAttributeClass`: - ```yaml - apiVersion: storage.k8s.io/v1beta1 - kind: VolumeAttributesClass - metadata: - name: regional - driverName: csi.provider.com - parameters: - type: regional - ``` -2. User modify the `volumeAttributesClassName` in the PVC to `regional` -3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "regional"}` -4. CSI driver blocks until the modification finished, then return with `accessible_topology` set to `[{"topology.kubernetes.io/region": "cn-beijing"}]` -5. external-resizer sets `PersistentVolume.spec.nodeAffinity` accordingly, then update the PV status to indicate the modification is successful. - -If it takes long to modify the volume, the new topology is not strictly less restrictive, -and SP wants to minimize the time window of the race condition (taking the user story 2 as an example): -1. User create a `VolumeAttributeClass`: - ```yaml - apiVersion: storage.k8s.io/v1beta1 - kind: VolumeAttributesClass - metadata: - name: essd - driverName: csi.provider.com - parameters: - type: cloud_essd - ``` -2. User modify the `volumeAttributesClassName` in the PVC to `essd` -3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "cloud_essd"}` -4. CSI driver returns with `in_progress` set to true, and `accessible_topology` set to `[{"provider.com/disktype.cloud_essd": "available"}]` -5. external-resizer sets `PersistentVolume.spec.nodeAffinity` accordingly, but the PV status is not updated yet. - From now on, the new Pod will be scheduled to nodes with `provider.com/disktype.cloud_essd: available`, - maybe they will stuck in `ContainerCreating` state until the modification finishes. -6. external-resizer go back to step 3, retries until `in_progress` is set to false. -7. external-resizer update the PV status to indicate the modification is successful. ### Test Plan @@ -1035,7 +1040,7 @@ rpc ControllerModifyVolumeTopology (ControllerModifyVolumeTopologyRequest) message ControllerModifyVolumeTopologyRequest { option (alpha_message) = true; string volume_id = 1; - map secrest = 2 [(csi_secret) = true]; + map secret = 2 [(csi_secret) = true]; map mutable_parameters = 3; } @@ -1067,16 +1072,100 @@ We've considered a design: * Add `accessibility_requirements` in `ModifyVolumeRequest`, like that in `CreateVolumeRequest` * Add `allowedTopologies` in `VolumeAttributeClass`, like that in `StorageClass` -But facing a lot of unresolved questions: +We determine this lacks vaild use cases. + +In most cases, SP can determine the desired topology of a volume from the `mutable_parameters`, or from the currently attached nodes. +An exception could be: modifying a volume from regional to zonal, and it is not attached to any node. +In this case, SP may need more information from the CO to determine the desired zone. +But we don't want user to create a separate VAC for each zone. +Instead, it maybe easier for user to just attach it to a node, so that SP can determine the desired zone. + +For the other case (User Story 2), the topology (provider.com/disktype.cloud_essd) is actually not intended for user as an API. +User just want to modify the disk type, and we implement the underlying limitations as topology. +So we don't want to let user to specify this topology key directly. + +Besides, this facing a lot of unresolved questions: * How to merge `allowedTopologies` from `VolumeAttributeClass`, `StorageClass`? * Should we use `allowedTopologies` from `StorageClass` if it is not specified in `VolumeAttributeClass`? * Should we consider the topology of the currently attached nodes? * Should we consider the topology of all the nodes in the cluster? -In most cases, SP can determine the desired topology of a volume from the `mutable_parameters`, or from the currently attached nodes. -An exception could be: modifying a volume from regional to zonal, and it is not attached to any node. -In this case, SP will need more information from the CO to determine the desired zone. -But we don't have such use case now, we decided leave it as a future work. +We may consider this again with vaild use cases. + +### Support for SPs that don't Know Attached Nodes + +Maybe there are some SPs that don't know the currently attached nodes, +so they cannot determine whether the topology change will break any workload. + +Some kind of storage does not have persistent connection between client and server, such as object storage like S3. +They may fall into this category of SP. +But as network attached storage, they can be accessed wherever the network can reach. +So these SPs typically do not use the topology feature at all. + +So, we decide that for an SP to support this feature, +they are required to properly detect potential breaking for existing workloads. + +That said, the candidate design looks: +Add a new `dry_run` parameter to the `ControllerModifyVolumeRequest`. +CO first call `ControllerModifyVolume` with `dry_run=true` to fetch the new topology, +determine if the new topology is compatible with the existing workloads, +then decide whether to proceed the modification with `dry_run=false`. + +Another way to get the new topology is further extending the "User Specified Topology Requirement" section, +Making it required for user to explicitly specify the new topology in the VAC and +remove `accessible_topology` from `ControllerModifyVolumeResponse`. +That is to say, SP must accept the new topology specified by user or it should reject the request. +The workflow will become: +1. User create a `VolumeAttributeClass`: + ```yaml + apiVersion: storage.k8s.io/v1beta1 + kind: VolumeAttributesClass + metadata: + name: regional + driverName: csi.provider.com + parameters: + type: regional + allowedTopologies: + - matchLabelExpressions: + - key: topology.kubernetes.io/region + values: + - cn-beijing + ``` +2. User modify the `volumeAttributesClassName` in the PVC to `regional` +3. external-resizer verifies all the nodes that all the nodes with this volume attached satisfy the `allowedTopologies` +4. external-resizer sets `PersistentVolume.spec.nodeAffinity` to + ```yaml + required: + nodeSelectorTerms: + - matchExpressions: + - key: topology.kubernetes.io/region + operator: In + values: + - cn-beijing + ``` +5. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "regional"}` +6. CSI driver blocks until the modification finished +7. external-resizer then update the PV status to indicate the modification is successful. + +Besides the reasons mentioned above, we also facing a critical drawback for this design: +Topology can have many orthogonal aspects, such as above mentioned zone/region and disk type. +If SP cannot return the topology, user will need to be aware of all aspects of topology used by SP. +And SP will not able to extend the topology in the future, since VAC is immutable. + +Note that the above designs are also racy. +We may still break some workloads that just started after the check but before the modification. + +### Try to Eliminate Race Condition + +Haven't figured out a reasonable way to do this. +Basically, we need to +1. pause scheduling of the pods that are using the volume; +2. get ack from scheduler that pause is effective; +3. conduct ModifyVolume; +4. retrive the new topology then resume scheduling. + +I think it is too complex and not worth it. +Similar failure is already possible with KEP-4876 when CSINode allocatable is out-of-date. @@ -180,10 +177,10 @@ improving the reliability of stateful pod scheduling. ### Goals - Make `PersistentVolume.spec.nodeAffinity` field mutable. -- Enable CSI drivers to return a new accessibility requirement on ControllerModifyVolume. ### Non-Goals +- Enable CSI drivers to return a new accessibility requirement on ControllerModifyVolume. - Modifying the core scheduling logic of Kubernetes. - Implementing cloud provider-specific solutions within Kubernetes core. - Re-scheduling running pods with volumes being modified, @@ -201,10 +198,7 @@ nitty-gritty. --> 1. Change APIServer validation to allow `PersistentVolume.spec.nodeAffinity` to be mutable. - -2. Change CSI Specification to allow `ControllerModifyVolume` to return a new accessibility requirement. - -3. Change external-resizer to set `PersistentVolume.spec.nodeAffinity` to the new accessibility requirement if returned by CSI driver. +2. Ensure scheduler will re-schedule pending pods that using a PV that has been updated. ### User Stories (Optional) @@ -226,40 +220,17 @@ required: values: - cn-beijing-g ``` -Or in the view of CSI accessibility requirement: -```json -[{"topology.kubernetes.io/zone": "cn-beijing-g"}] +We would like to change it to: +```yaml +required: +nodeSelectorTerms: +- matchExpressions: + - key: topology.kubernetes.io/region + operator: In + values: + - cn-beijing ``` - -The workflow: -1. User create a `VolumeAttributeClass`: - ```yaml - apiVersion: storage.k8s.io/v1beta1 - kind: VolumeAttributesClass - metadata: - name: regional - driverName: csi.provider.com - parameters: - type: regional - ``` -2. User modify the `volumeAttributesClassName` in the PVC to `regional` -3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "regional"}` -4. CSI driver blocks until the modification finished, then return with `accessible_topology` set to - ```json - [{"topology.kubernetes.io/region": "cn-beijing"}] - ``` -5. external-resizer sets `PersistentVolume.spec.nodeAffinity` to - ```yaml - required: - nodeSelectorTerms: - - matchExpressions: - - key: topology.kubernetes.io/region - operator: In - values: - - cn-beijing - ``` - then update the PV status to indicate the modification is successful. - +manually, hopefully integrated into CSI in the future. #### Story 2 @@ -278,50 +249,21 @@ required: values: - available ``` -Or in the view of CSI accessibility requirement: -```json -[{"provider.com/disktype.cloud_ssd": "available"}] -``` Type A node only supports cloud_ssd, while Type B node supports both cloud_ssd and cloud_essd. We will only allow the modification if the volume is attached to type B nodes. And I want to make sure the Pods using new cloud_essd volume not to be scheduled to type A nodes. -In this case, it takes long to modify the volume, the new topology is not strictly less restrictive, -and SP wants to minimize the time window of the race condition: - -The workflow: -1. User create a `VolumeAttributeClass`: - ```yaml - apiVersion: storage.k8s.io/v1beta1 - kind: VolumeAttributesClass - metadata: - name: essd - driverName: csi.provider.com - parameters: - type: cloud_essd - ``` -2. User modify the `volumeAttributesClassName` in the PVC to `essd` -3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "cloud_essd"}` -4. CSI driver returns with `in_progress` set to true, and `accessible_topology` set to - ```json - [{"provider.com/disktype.cloud_essd": "available"}] - ``` -5. external-resizer sets `PersistentVolume.spec.nodeAffinity` to - ```yaml - required: - nodeSelectorTerms: - - matchExpressions: - - key: provider.com/disktype.cloud_essd - operator: In - values: - - available - ``` - but the PV status is not updated yet. - From now on, the new Pod will be scheduled to nodes with `provider.com/disktype.cloud_essd: available`, - maybe they will stuck in `ContainerCreating` state until the modification finishes. -6. external-resizer go back to step 3, retries until `in_progress` is set to false. -7. external-resizer update the PV status to indicate the modification is successful. +We would like to change the affinity to: +```yaml +required: + nodeSelectorTerms: + - matchExpressions: + - key: provider.com/disktype.cloud_essd + operator: In + values: + - available +``` ### Notes/Constraints/Caveats (Optional) @@ -341,22 +283,14 @@ It is storage provider's responsibility to ensure that the running workload is n **Possible race condition** There is a race condition between volume modification and pod scheduling: -1. User updates PVC VolumeAttributeClassName. -2. external-resizer initiate ControllerModifyVolume. -3. User creates a new Pod and scheduler schedules it with the old affinity. -4. external-resizer sets a new affinity to the PV. +1. User modifies the volume from storage provider. +3. A new Pod is created and scheduler schedules it with the old affinity. +4. User sets the new affinity to the PV. 5. KCM/external-attacher attaches the volume to the node, and find the affinity mismatch. If this happens, the pod will be stuck in a `ContainerCreating` state. User will have to manually delete the pod, or using Kubernetes [descheduler](https://github.com/kubernetes-sigs/descheduler) or similar. -We propose to added an `in_progress` field in `ControllerModifyVolumeResponse` to allow more timely affinity update, -in order to reduce the time window of the race condition. -If the volume is not attachable during the modification, SPs can also return `in_progress` -with an impossible topology requirement, so that new pods will stay pending. -```json -[{"provider.com/volume-status": "modifying"}] -``` ### Risks and Mitigations @@ -381,89 +315,6 @@ required) or even code snippets. If there's any ambiguity about HOW your proposal will be implemented, this is the place to discuss them. --> -We will extend CSI specification to add: -```protobuf -// Specifies a capability of the controller service. -message ControllerServiceCapability { - message RPC { - enum Type { - ... - // TODO - MODIFY_VOLUME_TOPOLOGY = 16 [(alpha_enum_value) = true]; - } - - Type type = 1; - } - - oneof type { - // RPC that the controller supports. - RPC rpc = 1; - } -} - -message ControllerModifyVolumeRequest { - option (alpha_message) = true; - ... - - // Indicates whether the CO allows the SP to update the topology - // as a part of the modification. - bool allow_topology_updates = 4; -} - -message ControllerModifyVolumeResponse { - option (alpha_message) = true; - - // Specifies where (regions, zones, racks, etc.) the modified - // volume is accessible from. - // A plugin that returns this field MUST also set the - // VOLUME_ACCESSIBILITY_CONSTRAINTS plugin capability. - // An SP MAY specify multiple topologies to indicate the volume is - // accessible from multiple locations. - // COs MAY use this information along with the topology information - // returned by NodeGetInfo to ensure that a given volume is accessible - // from a given node when scheduling workloads. - // This field is OPTIONAL. It is effective and replaces the - // accessible_topology returned by CreateVolume if the plugin has - // MODIFY_VOLUME_TOPOLOGY controller capability. - // If it is not specified, the CO MAY assume - // the volume is equally accessible from all nodes in the cluster and - // MAY schedule workloads referencing the volume on any available - // node. - // - // SP MUST only set this field if allow_topology_updates is set - // in the request. SP SHOULD fail the request if it needs to update - // topology but is not allowed by CO. - // - // SP SHOULD only return topology that is a super-set of the - // original topology to avoid race conditions when scheduling. - repeated Topology accessible_topology = 1; - - // Indicates whether the modification is still in progress. - // SPs MAY set in_progress to update the accessible_topology - // before the modification finishes to reduce possible race conditions. - // COs SHOULD retry the request if in_progress is set to true, - // until in_progress is set to false. - bool in_progress = 2; -} -``` - -When this new field is set, external-resizer will set `PersistentVolume.spec.nodeAffinity` accordingly, before it updates the PV status. - -When anything unexpected happens (race between multiple resizer instances, crashes) and we lost track of the latest topology. -external-resizer will invoke `ControllerGetVolume` again with the desired `mutable_parameters` to fetch the latest topology. - -A new error condition of `ControllerModifyVolume` is added to CSI spec: - -| Condition | gRPC Code | Description | Recovery Behavior | -|-----------|-----------|-------------|-------------------| -| Topology conflict | 9 FAILED_PRECONDITION | Indicates that the CO has requested a modification that would make the volume inaccessible to some already attached nodes. | Caller MAY detach the volume from the nodes that are in conflict and retry. | - -But this KEP does not cover the automatic correction. Kubernetes should only retry with exponential backoff. - - -Scheduler Enhancements: make sure the Pod is re-queued when the PV is updated. - - ### Test Plan @@ -1027,145 +878,13 @@ Why should this KEP _not_ be implemented? ## Alternatives -### New GRPC +### Integrate with the CSI spec and VolumeAttributeClass -Instead of adding new fields to CSI GRPC `ControllerModifyVolume`, we could add a new GRPC `ControllerModifyVolumeTopology` (Other candidate names: `ControllerMigrateVolume`): - -```protobuf -rpc ControllerModifyVolumeTopology (ControllerModifyVolumeTopologyRequest) - returns (ControllerModifyVolumeTopologyResponse) { - option (alpha_method) = true; - } - -message ControllerModifyVolumeTopologyRequest { - option (alpha_message) = true; - string volume_id = 1; - map secret = 2 [(csi_secret) = true]; - map mutable_parameters = 3; -} - -message ControllerModifyVolumeTopologyResponse { - option (alpha_message) = true; - repeated Topology accessible_topology = 1; - bool in_progress = 2; -} -``` +We have proposed the plan to integrate in the previous version of this KEP. +But we did not reach consensus due to lack of SP want to implement this feature. +The main concerns were about race condition between scheduler and update PV. -The workflow of this new GRPC is essentially the same as the current `ControllerModifyVolume` GRPC, but it allows SPs to mutate the accessible -topologies of volumes by default. - -SPs with the `MODIFY_VOLUME_TOPOLOGY` controller capability should implement both this new GRPC and `ControllerModifyVolume`. -New COs that support modify volume topology (i.e. external-resizer) should only call the new GRPC when modifying volumes. -Old COs can continue to call `ControllerModifyVolume`. SPs should reject such requests if topology will be changed. - -Comparison between these two approaches: -| Criteria | [PR 592](https://github.com/container-storage-interface/spec/pull/592) (Extended GRPC) | [PR 593](https://github.com/container-storage-interface/spec/pull/593) (New GRPC) | -| -------- | ---------------------- | ----------------- | -| Maintenance Difficulty | ✅ Low | ⚠️ High, need to also modify ControllerModifyVolumeTopology when making changes to ControllerModifyVolume | -| Implementation Complexity | ✅ Low | ⚠️ High, SPs will have to implement a new GRPC if they want to support topology modification even if they have implemented ControllerModifyVolume | -| Side Effects | ⚠️ Will impede the GA process of K8s VAC | ✅ No influence on other features | - -### User Specified Topology Requirement - -Currently we don't support user specified topology requirement. -We've considered a design: -* Add `accessibility_requirements` in `ModifyVolumeRequest`, like that in `CreateVolumeRequest` -* Add `allowedTopologies` in `VolumeAttributeClass`, like that in `StorageClass` - -We determine this lacks vaild use cases. - -In most cases, SP can determine the desired topology of a volume from the `mutable_parameters`, or from the currently attached nodes. -An exception could be: modifying a volume from regional to zonal, and it is not attached to any node. -In this case, SP may need more information from the CO to determine the desired zone. -But we don't want user to create a separate VAC for each zone. -Instead, it maybe easier for user to just attach it to a node, so that SP can determine the desired zone. - -For the other case (User Story 2), the topology (provider.com/disktype.cloud_essd) is actually not intended for user as an API. -User just want to modify the disk type, and we implement the underlying limitations as topology. -So we don't want to let user to specify this topology key directly. - -Besides, this facing a lot of unresolved questions: -* How to merge `allowedTopologies` from `VolumeAttributeClass`, `StorageClass`? -* Should we use `allowedTopologies` from `StorageClass` if it is not specified in `VolumeAttributeClass`? -* Should we consider the topology of the currently attached nodes? -* Should we consider the topology of all the nodes in the cluster? - -We may consider this again with vaild use cases. - -### Support for SPs that don't Know Attached Nodes - -Maybe there are some SPs that don't know the currently attached nodes, -so they cannot determine whether the topology change will break any workload. - -Some kind of storage does not have persistent connection between client and server, such as object storage like S3. -They may fall into this category of SP. -But as network attached storage, they can be accessed wherever the network can reach. -So these SPs typically do not use the topology feature at all. - -So, we decide that for an SP to support this feature, -they are required to properly detect potential breaking for existing workloads. - -That said, the candidate design looks: -Add a new `dry_run` parameter to the `ControllerModifyVolumeRequest`. -CO first call `ControllerModifyVolume` with `dry_run=true` to fetch the new topology, -determine if the new topology is compatible with the existing workloads, -then decide whether to proceed the modification with `dry_run=false`. - -Another way to get the new topology is further extending the "User Specified Topology Requirement" section, -Making it required for user to explicitly specify the new topology in the VAC and -remove `accessible_topology` from `ControllerModifyVolumeResponse`. -That is to say, SP must accept the new topology specified by user or it should reject the request. -The workflow will become: -1. User create a `VolumeAttributeClass`: - ```yaml - apiVersion: storage.k8s.io/v1beta1 - kind: VolumeAttributesClass - metadata: - name: regional - driverName: csi.provider.com - parameters: - type: regional - allowedTopologies: - - matchLabelExpressions: - - key: topology.kubernetes.io/region - values: - - cn-beijing - ``` -2. User modify the `volumeAttributesClassName` in the PVC to `regional` -3. external-resizer verifies all the nodes that all the nodes with this volume attached satisfy the `allowedTopologies` -4. external-resizer sets `PersistentVolume.spec.nodeAffinity` to - ```yaml - required: - nodeSelectorTerms: - - matchExpressions: - - key: topology.kubernetes.io/region - operator: In - values: - - cn-beijing - ``` -5. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type": "regional"}` -6. CSI driver blocks until the modification finished -7. external-resizer then update the PV status to indicate the modification is successful. - -Besides the reasons mentioned above, we also facing a critical drawback for this design: -Topology can have many orthogonal aspects, such as above mentioned zone/region and disk type. -If SP cannot return the topology, user will need to be aware of all aspects of topology used by SP. -And SP will not able to extend the topology in the future, since VAC is immutable. - -Note that the above designs are also racy. -We may still break some workloads that just started after the check but before the modification. - -### Try to Eliminate Race Condition - -Haven't figured out a reasonable way to do this. -Basically, we need to -1. pause scheduling of the pods that are using the volume; -2. get ack from scheduler that pause is effective; -3. conduct ModifyVolume; -4. retrive the new topology then resume scheduling. - -I think it is too complex and not worth it. -Similar failure is already possible with KEP-4876 when CSINode allocatable is out-of-date. +We will try this again in the future.