-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5729: DRA: ResourceClaim Support for Workloads #5736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
nojnhuh
commented
Dec 12, 2025
- One-line PR description: Add initial draft for KEP-5729: DRA: ResourceClaim Support for Workloads
- Issue link: DRA: ResourceClaim Support for Workloads #5729
- Other comments:
Add KEP-5729: DRA: ResourceClaim Support for Workloads
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nojnhuh The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
I'm ready for feedback on the Summary and Motivation sections. Mostly want to make sure this is appropriately scoped w.r.t. #5732. /cc @wojtek-t @erictune @44past4 @helayoty @johnbelamaric I'm still working on the meat of the proposal and hope to have a draft of that ready to share by the middle of next week before I go on vacation through the end of the year. |
Add API
|
I've added the first draft of the API. Still working through the rest of the KEP, but I think that section is ready to start getting some feedback. /label api-review |
|
@nojnhuh: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
|
||
| - Allow users to express sets of DRA resources to be replicated for each | ||
| instance of a PodGroup, and shared by each Pod in the PodGroup. | ||
| - Automatically create and delete PodGroups' ResourceClaims as needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An important problem to solve is defining the exact moment in time when deallocation of PodGroups' ResourceClaims will occur. This will be challenging because PodGroups do not have a well define lifecycle. Should this be put as a separate goal here?
| proposal will be implemented, this is the place to discuss them. | ||
| --> | ||
|
|
||
| ### API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative/complementary solution could include an option to define an optional ResourceClaim constraints (similar to ResourceClaim.spec.devices.constraints) on the level of PodGroup which would need to apply to all ResourceClaims used by PodGroup's Pods which come from a given ResourceClaimTemplate.
This way each pod could have its own ResourceClaim created from a given template but when scheduling those Pods as part of PodGroup kube-scheduler would make sure that an additional PodGroup level ResourceClaim constraint is fulfilled for all devices allocated to all Pods which are part of this PodGroup. This would provide for instance an easy way to specify that all GPUs allocated to Pods in a given PodGroup needs to come from a single block (they need to have the same value for a given attribute of the Device - like gpu/block-name in the example below):
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: my-workload
namespace: default
spec:
podGroups:
- name: group-1
policy:
basic: {}
resourceClaims:
- name: wl-claim
resourceClaimTemplateName: gpu-claim-template
constraints:
- matchAttribute: gpu/block-name
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: gpu-claim-template
namespace: default
spec:
spec:
devices:
requests:
- name: my-device
exactly:
deviceClassName: gpu
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-claim-example-1
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: wl-claim-example-1
template:
metadata:
labels:
app: wl-claim-example-1
spec:
containers:
- name: pause
image: "registry.k8s.io/pause:3.6"
resources:
claims:
- name: resource-1
resourceClaims:
- name: resource-1
resourceClaimTemplateName: gpu-claim
workloadRef:
name: my-workload
podGroup: group-1
podGroupReplicaKey: "1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This alternative/complementary solution is discussed in https://docs.google.com/document/d/11rC_qDtArIOx_ZQfM-G4H5qIYFm0rX_GUz1yPGPrv3k/edit?usp=sharing document.