Skip to content

KEP-5328: Node Capabilities #5347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pravk03
Copy link

@pravk03 pravk03 commented May 28, 2025

  • One-line PR description: Add the initial KEP for KEP 5328: Node Capabilities
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels May 28, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @pravk03!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 28, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @pravk03. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 28, 2025
@pravk03 pravk03 marked this pull request as draft May 28, 2025 00:47
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2025
@pravk03 pravk03 force-pushed the node-capabilities branch 2 times, most recently from 59e7e54 to 4719180 Compare May 28, 2025 00:59
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pravk03 pravk03 force-pushed the node-capabilities branch 3 times, most recently from 4c11e06 to 9254f9b Compare May 28, 2025 23:11
@pravk03 pravk03 changed the title KEP-5328: Node Capability Aware Scheduling KEP-5328: Node Capabilities May 28, 2025
@pravk03 pravk03 marked this pull request as ready for review May 28, 2025 23:14
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mrunalp May 28, 2025 23:14
@pravk03
Copy link
Author

pravk03 commented May 29, 2025

/cc @tallclair @yujuhong

@pravk03 pravk03 force-pushed the node-capabilities branch from 9254f9b to f8291a4 Compare May 29, 2025 01:06
@sanposhiho
Copy link
Member

/sig scheduling

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label May 29, 2025
@SergeyKanzhelev
Copy link
Member

Can we have any examples listed that will justify this. Right now the KEP suggests to use it for FG-related capabilities, while not giving a good examples where it would be non-FG related.

The guaranteedQOSPodCPUResize example used in the KEP isn't purely a feature gate; it's a logical capability derived from a combination of feature gates and the Kubelet's cpuManagerPolicy configuration.

While this is still in early stages, this recent discussion about making the pod requirement for exclusive resources more explicit also indicates a need for non-FG capabilities. The API field itself should be forward-facing enough to support such potential use-cases ?.

Those are all examples of FG-related capabilities. Not the generic long-term capabilities.

@pravk03 pravk03 force-pushed the node-capabilities branch from 5fb093d to a3e1436 Compare June 18, 2025 20:36
@tallclair
Copy link
Member

It seems like most of the concerns with this are around the specific capabilities being added, but this KEP doesn't actually propose adding any capabilities. The examples given are hypothetical examples based on features currently in development, but no new features will be able to depend on capabilities until it goes to beta. This creates a bit of a chicken-and-egg situation, where it's hard to point to exactly how capabilities will be used until we have users lined up, but we can't line up users yet.

@SergeyKanzhelev
Copy link
Member

SergeyKanzhelev commented Jun 18, 2025

It seems like most of the concerns with this are around the specific capabilities being added, but this KEP doesn't actually propose adding any capabilities. The examples given are hypothetical examples based on features currently in development, but no new features will be able to depend on capabilities until it goes to beta. This creates a bit of a chicken-and-egg situation, where it's hard to point to exactly how capabilities will be used until we have users lined up, but we can't line up users yet.

we kind of need to know what will be expected use cases. Maybe past examples or hypothetical examples thought thru end-to-end. Right now this KEP is limited to just set of name/value pairs and a scenario of FG discoverability. But already we are thinking there MAY be need to support capabilities for node selection, ability to declare tolerations for capabilities, ability to have node-restricted capabilities. Knowing the scope would help to understand if API proposed is needed (among alternatives if the set of use cases is limited) and if needed, what shape should it have.

@pravk03 pravk03 force-pushed the node-capabilities branch from a3e1436 to f069f62 Compare June 18, 2025 23:55
@pravk03 pravk03 force-pushed the node-capabilities branch 3 times, most recently from 8d6230d to cd6d67e Compare June 19, 2025 17:18
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 19, 2025
@pravk03
Copy link
Author

pravk03 commented Jun 19, 2025

Maybe past examples or hypothetical examples thought thru end-to-end

RuntimeClass was intended as a past example used to illustrate non-FG related runtime capabilities in the earlier version of the proposal. I agree that it had some missing details and thanks for highlighted them in your comment.

  1. Runtime handlers as a list of handlers is also not a good fit. Default handler runc is not specified in pod spec. So it will not be used by scheduler and by definition must not be added to capabilities. Non-default handlers may need more details on what it is. And names list may not fit into the value length limits. Special object representing the runtime is a better choice here.

I have tried to address these the Case Study section.

@pravk03 pravk03 requested a review from wojtek-t June 20, 2025 16:01
@tallclair
Copy link
Member

Maybe past examples or hypothetical examples thought thru end-to-end. Right now this KEP is limited to just set of name/value pairs and a scenario of FG discoverability.

I feel like we've discussed these options in depth already. Yes, these are all somewhat hypothetical because we've had to work around them in other ways. I'm sure we can dig up more examples from past KEPs, but is that necessary?

Capabilities that are not limited to just feature gates:

  • swap enabled
  • static CPU / memory manager enabled
  • user namespace support

Feature gate capabilities:

  • pod-level resources
  • TLS for gRPC probes
  • in-place resize (+IPPR for pod-level resources, IPPR for static CPU assignment, etc)

But already we are thinking there MAY be need to support capabilities for node selection, ability to declare tolerations for capabilities,

Not sure what node selection means, but we've explicitly said tolerations are out of scope.

ability to have node-restricted capabilities.

Where did this come in? Capabilities are just added by the node, so I'm not sure what this would even mean.

@pravk03
Copy link
Author

pravk03 commented Jun 20, 2025

We discussed this KEP today and decided to re-consider this for 1.35 release cycle. The primary reason is to get input fromsig-arch on using this capability-based framework as a general strategy for managing version skew.

Few more things discussed and that could be refined in the proposal:

  1. Evaluate the strategy for managing capabilities with bounded lifetime. Define a clear lifecycle and deprecation path for capabilities tied to features that graduate to GA.
  2. We would need a better use-case to consider long-term capabilities in-scope. It can be considered a future enhancement once a clear use case arises.
  3. Further explore SemVer based filtering in Node Selectors as a potential alternative.

cc @tallclair @SergeyKanzhelev @dchen1107 @yujuhong

@pravk03
Copy link
Author

pravk03 commented Jul 1, 2025

This proposal was discussed in the SIG-Arch community meeting on June 26th (recording), It was generally seen as a beneficial strategy for managing version skew. The key takeaways and action items from the discussion are as follows:

  1. The KEP can be scoped for temporary capabilities to solve version skew use cases. The configuration skew (like operating systems etc.) use-cases are less clear at the moment and may need explicit fields (like pod.spec.os).
  2. Make the temporary nature of capabilities more obvious. This can prevent other components (webhooks etc.) from depending on this API. Obfuscating the fields can be explored if required.
  3. A common library should be used to encapsulate the logic for inferring the capability requirements. This will ensure consistency between all consuming components, such as the kube-scheduler and admission controllers.
  4. Autoscaler support: The "client-side" problem (determining a pod's requirements) can be solved with the proposed shared library. The problem of how to scale up nodes (specially scaling up from 0 nodes) is still a challenge and needs further exploration. This is already added in the future consideration section of the KEP.

I will incorporate this feedback into the KEP and reach out when its ready for review.

@pravk03 pravk03 force-pushed the node-capabilities branch from cd6d67e to eff4d75 Compare July 8, 2025 15:17
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 8, 2025
@pravk03 pravk03 force-pushed the node-capabilities branch from eff4d75 to 1313dd7 Compare July 8, 2025 15:17
@pravk03 pravk03 force-pushed the node-capabilities branch from 1313dd7 to fc48bfb Compare July 9, 2025 01:09
@pravk03
Copy link
Author

pravk03 commented Jul 9, 2025

I've updated the KEP based on the SIG Architecture feedback (#5347 (comment)). The new version focuses more on capabilities tied to the feature lifecycle and expands on the deprecation strategy.

@tallclair @SergeyKanzhelev @haircommander @wojtek-t PTAL when you get a chance.

- '@wojtek-t'
- '@dom4ha'
- '@macsko'
approvers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to add somebody from sig scheduling as approver here

2. Introduce a shared library to encapsulate the logic for inferring a pod's requirements and matching them against node capabilities, ensuring consistency between control plane components that depends on capabilities.
3. Enhance the kube-scheduler to filter nodes based on the pod's requirements.
4. Enable API admission controllers to validate requests for operations against a node's actual feature support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. Enable kubelet admission plugin to check the Pod is compatible with the Node's features

Considered approaches:

1. Have the autoscaler inspect a running node in the target node pool and assume all new nodes will be identical. This would work only if a running node exists and fails for the "scale-from-zero" conditions.
2. This problem is fundamentally the same as what [kubernetes/autoscaler#7799](https://github.com/kubernetes/autoscaler/issues/7799) is tracking to support DRA use cases. The cluster-autoscaler currently does not consider DRA resources while scaling up and the long term solution would likely involve a new API surface to specify and/or modify autoscaler predictions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the same? I thought that DRA is unique as DRA is not a part of a Node Status so it is harder to add those to the templates

**Node Capabilities Requirements:**

1. Every capability must be associated with a Kubernetes feature graduating through the Alpha/Beta/GA process. This ensures capabilities are not used as permanent node attributes and are automatically removed after the feature is stable (after the supported version skew period)
2. Must be derived from node's static configuration, which the Kubelet evaluates during bootstrap. Reporting new or changed capabilities requires a Kubelet restart to take effect.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also capabilities must be calculated BEFORE Pods admission. Otherwise pod admission will fail on node restart

Comment on lines +314 to +315
* Graduation (GA): When the feature graduates to GA, the Kubelet continues to report the capability. This is necessary to manage version skew, allowing the control plane to correctly identify older nodes that do not yet have the GA feature.
* Automated Deprecation (Post-GA): Kubelet automatically stop reporting the capability after the feature has been GA for a duration that exceeds the cluster's supported version skew. The capability check is bypassed in the shared library based on consumer component (e.g., kube-scheduler) version and feature gate graduation version.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires clarification. Maybe with the specific versions on how to calculate supported version skews. Does this statement suggests that the capability will be removed only after GA + 3 versions? And after this, the logic is removed from both - control plane and kubelet at the same time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarificaqtion is needed for the when it is removed from the control plane mostly

1. Replace Taints/Tolerations or Node Labels/Selectors/Affinity.
2. Serve as a reporting mechanism for permanent static node attributes (like architecture, or specific hardware).
3. To define the exact mapping of a feature to a capability. This KEP proposes the framework that establishes the mechanism; specific mappings will be defined with the features that use them.
4. To include full Cluster Autoscaler integration in the initial Alpha stage. The autoscaler makes scaling decisions based on node templates, which lack the capability information. Defining an integration strategy is deferred as a [future enhancement](#cluster-autoscaler-integration).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will delay adoption. Perhaps it can be solved in alpha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lead-opted-in Denotes that an issue has been opted in to a release ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Needs Triage
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.