KEP-4671: Introduce Workload Scheduling Cycle, extend basic policy with desiredCount #5730

macsko · 2025-12-10T09:05:27Z

One-line PR description: Update KEP-4671 for beta in v1.36. Introduce Workload Scheduling Cycle phase.

Issue link: Gang Scheduling Support in Kubernetes #4671

Other comments: This PR is shows an initial idea of new workload scheduling phase. Further changes related to graduation to beta will be added gradually.

macsko · 2025-12-10T09:06:05Z

/cc @dom4ha @sanposhiho @wojtek-t @erictune

Argh4k · 2025-12-10T13:54:35Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+Alternative 1 (Modify sorting logic):
+
+Modify the sorting logic within the existing `PriorityQueue` to put all pods


We would need to modify it once we have workload priority, right? Otherwise we can keep the current sorting algorithm and pull all gang members from the queue once we encounter the first member of the gang (and switch to workload scheduling), right?

Right, we could keep the current algorithm, but since the gang members can be in any internal queue, removing them from these queues would require some effort (running PreEnqueue, registering metrics, filling in internal structures used for queuing hints, etc.). In fact, this effort could be moved to properly implement queueing with more feature-proof alternative.

If we decide that if there is no workload priority then priority = min(pods priority) (as Wojtek suggested) then it becomes a big problem with this approach.

Yes, if we choose that priority calculation, then this alternative is just a bad idea

I'd also not modify the queue sorting, but "remove" PodGroup pods form it when PodGroup itself is unschedulable by simply making them unschedulable.

We can't "simply" make them unschedulable. What about event handling? What should make the PodGroup schedulable after failure? What to do with the observability (metrics, logs, etc.) that are aligned to the pod-by-pod queueing now? It's hard to say whether we could easily do that or not. I need to analyze the code deeper and come back with some thoughts.

Anyway, the modification of the queue sorting will be needed when the pod group priority will be introduced by the workload-aware preemption KEP. We should consider that when designing the queueing part of this KEP.

erictune

This looks great overall.

I have one proposal to consider renaming part of the API, and some clarifying questions.

keps/sig-scheduling/4671-gang-scheduling/README.md

erictune · 2025-12-11T00:04:08Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


The term PodGroup has become ambiguous. It can mean:

a list item in Workload.podGroups

a specific set of pods having the same PodGroupReplicaKey
These are often the same thing but not always.

If we want to be clear what we mean, we can either:

Always call (1) a PodGroup and always call (2) a PodGroupReplica, including in the case when podGroupReplicaKey is nil

Or, rename PodGroup to PodGroup Template, and call (2) a PodGroup

This may seem pedantic, but when reviewing this KEP, I felt that current double meaning made the text imprecise in a number of places. We do have a chance to fix the naming with v1beta1, but I don't think we are supposed to rename when we go to GA.

My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.

If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?

Agree with andreyvelich, PodSubGroupReplica and PodSetReplica are confusing.

This would be less confusing:

1 Workload object in APIserver names exactly one 1 group of pods at runtime.

1 Workload.podGroupTemplates[i] is a template for N independent PodGroups which are distinguished by PodGroupReplicaKey

Later, if PodSubGroup is added:

1 Workload.podGroupTemplate[i].podSubGroup[j] names exactly one PodSubGroup within a single group of pods (PodGroup) defined as above.

Later, if PodSet is added under PodSubGroup:

1 Workload.podGroupTemplate[i].podSubGroup[j].podSet[k] names exactly one PodSet within PodSubGroup, defined as above.

This needs a longer discussion than belongs in the review comments of this PR.

Suggestion to KEP author:

Keep Workload Scheduling Cycle proposal in this PR.

Keep Basic policy enhancements in this PR.

Remove advancement to beta to.

Let's all as reviewers approve the above. And we can take a week to discuss the "PodGroup as separate object" proposal. which may affect alpha/beta status of Workload.

Agree with @johnbelamaric and @erictune that we should separate discussion whether we need separate object for PodGroup.

Remove advancement to beta to.

Is my understanding correct that we will create another KEP for beta graduation in that case?

I'm going back and forth with whether we should separate PodGroup into a separate object (in fact my original counter-proposal was going that way though for different reasons https://docs.google.com/document/d/13UkLjVMj_edMh7biqVU6SVyNNTIfGAT35p6-pNsF5AY/edit?resourcekey=0-dqUEiwiXWICLwAg6Tqkupw&tab=t.9le0fmf90j3w#heading=h.vf43rjyfidc6 )

I agree it's the last moment to change that before going to Beta.

But I also agree that pretty much neither of the proposed changes (basicPolicy, workload scheduling cycle, scheduling algorithm, ...) depend on this decision and can proceed independently.

So I'm heavily +1 on Eric's suggestion above to focus this PR on those changes but still leave the KEP in Alpha after this PR.

And have a separate PR that will be focused on the API itself (it can't be a separate KEP, but it can be a separate PR and discussion).
Also - we should probably start a dedicated document for that to clearly describe Pros/Cons of both options to make the more data-driven decision.

@macsko - I'll be OOO for the next 2 weeks, would you be able to start such doc and I'm happy to contribute once I'm back

Okay, so I'll remove the beta graduation part from this PR and the discussion about API can be moved to a new document

So I went ahead and started a doc: https://docs.google.com/document/d/1zVdNyMGuSi861Uw16LAKXzKkBgZaICOWdPRQB9YAwTk/edit?resourcekey=0-bD8cjW_B6ZfOpSGgDrU6Mg&tab=t.0#heading=h.c4vrtnmf9f4o

It's only a starter and requires a lot of work so I would appreciate all contributions, especially given I will be OOO until Jan 7th

erictune · 2025-12-11T00:09:13Z

keps/sig-scheduling/4671-gang-scheduling/README.md


-// WorkloadSpec describes a workload in a portable way that scheduler and related
-// tools can understand.  
+// WorkloadMaxPodGroups is the maximum number of pod groups per Workload.


This is the number of items in the list PodGroups

FWIW - these updates are based on what was merged in the code.
I'm happy for these to be updated, but for these specifically maybe we should open a dedicated PR in k/k.

Agree with @wojtek-t. We can defer API description discussions to the k/k PR with API graduation (when it will be created)

erictune · 2025-12-11T00:10:04Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	ControllerRef *TypedLocalObjectReference
+
+	// PodGroups is the list of pod groups that make up the Workload.
+	// The maximum number of pod groups is 8. This field is immutable.


But the maximum number of PodGroup Replicas is not limited.

That's right. It would be even hard to enforce such limitation in the current form of replication - it's based solely on pods' workloadRefs.

Was suggesting to make this more clear in the code comment.

Splitting PodGroup into its own resource would make it a lot clearer that there is a distinction between the template and the instance, and that the limit applies to defined templates, not instances.

keps/sig-scheduling/4671-gang-scheduling/README.md

erictune · 2025-12-11T00:50:01Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+When the scheduler pops a Pod from the active queue, it checks if that Pod
+belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler
+initiates the Workload Scheduling Cycle.


If a pod belongs to an already scheduled PodGroup, it is not clear what to do. We could:

Hold it back, on the assumption that it is a replacement for an already scheduled pod. When minCount additional pods show up, then handle all those at once.

Treat it as if it is a "gang scale up", and try to place it too (on a best-effort basis). If we do this, does it go through Workload cycle, or just normal pod-at-a-time path?

I thinking about races that could happen when a workload is failing and getting pods replaced. And I am thinking about the case when a workload wants to tolerate a small number of failures by having actual count > minCount.

Taking a step back from the implementation, we should think what we really need to do conceptually in that case.

If the pod is part of PodGroup, then what we conceptually want is to schedule that in the context of its PodGroup instance. I think there are effectively three cases here:

this PodGroup instance doesn't satisfy its minCount now, but with this pod it will satisfy it

this PodGroup instance doesn't satisfy its minCount now and with that pod it won't satisfy it either

this PodGroup instance already satisfies its minCount even without this pod

It's worth noting at this point, that with topology-aware scheduling introduce PodGroup instance could have tas requirements too when thinking what we should do with it.

The last point in my opinion means that we effectively should always go through Workload cycle, because we always want to consider it in the context of whole workload.
The primary question is if we want to go kick this workload cycle off with individual pod or wait for more and if so based on what criteria.
I think that the exact criteria will be evolving in the future (I can definitely imagine it depend on "preemption unit" that we're introducing in KEP-5710). So for now, I wouldn't try to settle down on the final solution.

I would suggest starting with:

always go through Workload cycle (to keep in mind whole podGroup instance as context)

for now, always schedule individual pod for the sake of making progress here (we will most probably adjust it in the future but it will be pretty localized change and I think it's fine)

It seems that the easiest way is to do what @wojtek-t described, i.e., go best effort (as for the basic policy), so take as many pods as we have available, and try to schedule them in the workload cycle. Only pods that have passed the workload cycle will be able to move on to the pod-by-pod cycle. In the future, we can try to create more intelligent grouping, but for now, let's focus on delivering a good enough implementation.

What is the plan if PodGroup has been scheduled with TAS constraints based on minCount, and a new Pod comes in that is part of the PodGroup but doesn't fit in the TAS constraints?

That's a good point. I suppose, the new Pod should go through a workload scheduling cycle and the TAS algorithm there should take the scheduled pods from a gang into consideration. If the pod doesn't fit, it will remain unschedulable until something changes in the cluster.

erictune · 2025-12-11T01:11:39Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+     They are then effectively "reserved" on those nodes in the
+     scheduler's internal cache. Pods are then pushed to the
+     active queue (restoring their original timestamps to ensure fairness)
+     to pass through the standard scheduling and binding cycle,


IIUC, there are some cases where a nominated pod fails to pass through the standard scheduling cycle:

differences in the algorithm between workload cycle and standard cycle

refreshed snapshot changes node eligibility.

higher priority pod jumps ahead in active queue

Yes, these are the cases, but we are okay when they occur, as long as the new placement will be still valid. If it won't be anymore, scheduling of the gang will fail (at WaitOnPermit) and the gang will be retried.

There should be no differences between workload and standard cycle in terms of pods feasibility

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t

Added few comments but they are pretty localized - overall this is great proposal pretty aligned with how I was thinking about it too.

keps/prod-readiness/sig-scheduling/4671.yaml

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t · 2025-12-11T09:17:59Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.

wojtek-t · 2025-12-11T09:20:12Z

keps/sig-scheduling/4671-gang-scheduling/README.md


-// WorkloadSpec describes a workload in a portable way that scheduler and related
-// tools can understand.  
+// WorkloadMaxPodGroups is the maximum number of pod groups per Workload.


FWIW - these updates are based on what was merged in the code.
I'm happy for these to be updated, but for these specifically maybe we should open a dedicated PR in k/k.

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t · 2025-12-11T09:55:10Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+  Can report dedicated logs and metrics with less confusion to the user.
+* *Cons:* Significant and non-trivial architectural change to the scheduling queue
+  and `scheduleOne` loop.
+<<[/UNRESOLVED]>>


I would more value opinions from people more hands-on with scheduler code recently than myself, but on paper I think the third alternative seems preferred to me:

Alternative 1 - it will be extremely hard to reason about it if pods sit in different queues (backoff, active, ..) independently and the evolution point is extremely important imho - we know we will be evolving it a lot and preparing for that is super important

Alternative 2 - I share the concerns about corner cases (pod deleted is exactly what I have on my mind). Once we get to rescheduling cases (new pods appear but they should trigger removing and recreating some of the existing pods it will get even more complicated with even harder corner cases). Given that we know that rescheduling is super important, I'm also reluctant about it.

Alternative 3 - It's clearly the cleanest option. The primary drawback of the need for non-trivial changes is imho justified given we know that we will be evolving it significantly in the future.

So I have quite strong preference towards (3) at this point, but I'm happy to be challenged by implementation-specific counter-arguments.

That was my thought as well. I think the effort involved in adding workload queuing will be comparable to modifying the code for the previous alternatives, but maybe I don't see some significant drawbacks.

For the third alternative, do we want to introduce only the queue for PodGroups? If so then we have the same problem of pulling pods from backoff and unschedulable queues right? Or do we mean by the workload queue a structure that will hold all the pods related to workload scheduling?

Let's say that we add a Pod that is part of a group but does not make us meet the min count . Should it land in unschedulable queue as it does right now? Or some additional structure?

If the pod group failed scheduling, where do we put pods from it? We need to have them somewhere, so we can check them against cluster events in case some event makes the pod schedulable thus potentially makes the whole pod group schedulable.

I think, one way might be to introduce both:

a queue for pod groups (let's say podGroupsQ). This will contain the pod groups that can be popped for workload scheduling

a data structure for unscheduled pods (let's say workloadPods) that have to wait for workload scheduling to finish to proceed with their pod-by-pod cycles.

I imagine the pod transitions (for pods that have a workloadRef) would be:

pod is created -> when PreEnqueue passes for a pod, add it to the workloadPods, otherwise add to unschedulablePods -> [for gang pod group] if the >= minCount pods for the group is in a workloadPods, the group is added to the podGroupsQ -> pod group is processed and workload scheduling cycle is executed

When the workload cycle finishes successfully:

pod gets the NNN and is moved to the activeQ -> pod is popped and goes through its pod-by-pod scheduling cycle -> ...

When the workload cycle fails:

pod is moved to the unschedulablePods, where it waits for a cluster change to happen -> when the change happens for this pod or other from its pod group, the pod(s) are moved to the workloadPods -> workload manager detects the group should be retried and adds the group to the podGroupsQ -> processing continues as previously

Again, alternative 3 is the obvious choice IMO if PodGroup is a separate resource.

In Koordinator, I used Alternative2. The desired effect is that, in ActiveQ/BackoffQ, a PodGroup has only one Item, and this Item can carry all queuing attributes (Priority, LastScheduleTime, BackoffTime, etc.). Whether this Item is a representative pod or a real PodGroup is less important.

Another important point is that when a representative Pod or PodGroup is dequeued, its sibling Pods are also dequeued. In Koordinator, we implemented our own NextPod (koordinator-sh/koordinator#2417) to achieve this.

Regarding this KEP, I feel that a fusion of Alternative 2 and Alternative 3 might be a better solution:

Use a QueuedPodGroupInfo that aggregates the queuing attributes of all member Pods to flow PodGroups between ActiveQ, BackoffQ, and UnschedulableQ, allowing the previous QueueHint mechanism to seamlessly integrate with PodGroups.

Use a separate Map to store the Pods belonging to QueuedPodGroupInfo for easy indexing.

Consider that we should have PodGroups without pods at some point in time and we should be able to schedule them somehow.

However, before we get there, I'd consider scheduling PodGroup simply whenever we encounter the first pod that refers to it and the PodGroup wasn't scheduled yet. It's similar to option 1 but without modifying sorting function.

I think it would be the simplest implementation as all pods stay in the active queue unless an attempt to schedule a PodGroup they belong to, makes the PodGroup unschedulable. Obviously the question is when to reconsider unschedulable PodGroup. At the beginning I'd make it periodically schedulable again (after some timeout) without defining PodGroup queue yet.

I'm not sure yet how Workload aware preemption may interact with it yet.

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t · 2025-12-11T10:00:54Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+     * If preemption is successful, the pod is nominated on the selected node.
+     * If preemption fails, the pod is considered unscheduled for this cycle.
+
+   The phase can effectively stop once `minCount` pods have a placement,


From the POV of optimizing locally, we should continue scheduling all pod (and not stop after minCount) - what changes is that the unschedulability of further pods shouldn't make the whole group unschedulable.

Given that we have minCount at the whole PodGroup level, the local optimization is actually what we should probably do anyway (it would be a different story if we would have a separate minCount per signature, but it's not the case).

If the number of pods in the in the PodGroup is larger than minCount there is also a question if we should stop at first pod scheduling failure or maybe only if the number of failures is larger than the difference between the size of the PodGroup and minCount?

My intuition is that if we encounter a pod scheduling failure we should probably skip scheduling of the remaining pods in a given sub-group treating them as unschedulable but if there are more sub-groups and if there is still a chance that we will be able to schedule minCount of pods we should probably try scheudling pods from the sebsequent sub-groups.

My intuition is that if we encounter a pod scheduling failure we should probably skip scheduling of the remaining pods in a given sub-group treating them as unschedulable but if there are more sub-groups and if there is still a chance that we will be able to schedule minCount of pods we should probably try scheudling pods from the sebsequent sub-groups.

I agree and that was the idea. Added it explicitly to the KEP

keps/sig-scheduling/4671-gang-scheduling/README.md

andreyvelich · 2025-12-11T16:36:29Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?

keps/sig-scheduling/4671-gang-scheduling/README.md

sanposhiho

/assign

wojtek-t · 2025-12-18T14:09:27Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


@thockin - I would like to appreciate your feedback from API approver perspective.

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t · 2025-12-18T14:17:25Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+the standard pod-by-pod scheduling cycle.
+
+When the scheduler pops a Pod from the active queue, it checks if that Pod
+belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler


As discussed in the Basic policy section above, I actually think that all pods that belong to workloads should go through this phase (whether they form gangs or just are from basic policy).

I acknowledge that for basic it will be kind of best-effort (if more pods were created we will get all of them, if we only observed 1 it will be just one), but that better opens doors for future extensions once we have pod templates etc. defined in Workload.

+1
I'd say that Workload scheduling is a phase scheduling PodGroups. Depending on what policy type the group is, the logic is different.

In case of the Basic policy, the group becomes scheduled unconditionally, still pods belonging to that group cannot be scheduled until the PG itself is scheduled.

Removed "Gang" from this sentence. I think we are aligned that the basic-policy pods should be scheduled by this phase

wojtek-t · 2025-12-18T14:19:49Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+*Proposed:* Implement it on PodGroup Level for Beta. However, future migration
+to the Workload Level might necessitate non-trivial changes to the phase
+introduced by this KEP.


I suggest resolving that (and potentially mentioning the risk of non-trivial changes in the Risks section).

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t · 2025-12-18T14:32:17Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+2. The scheduler nominates the victims for preemption and the gang Pod
+   for scheduling on their place. This way, the gang can be attempted
+   without making any intermediate disruptions to the cluster.
+   * If the quorum is met, the scheduler continues scheduling the gang Pods pod-by-pod.


This should be adjusted once we settle down on the details in #5711

Currently these are not aligned :)

[It's a comment for myself too]

johnbelamaric

I really think we should take the opportunity this release to shift to top-level PodGroup instances, with the templates living in workload. I believe this clears up a number of things and will make building on top of PodGroups much easier for the rest of the project.

johnbelamaric · 2025-12-18T18:14:09Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


Embedding PodGroups in Workload as done today makes it much harder for clients to track lifecycle events - creation, deletion, resize, etc. of individual PodGroup replicas. And given that (AIUI) individual replica keys are not all stored in the workload, it may not even be possible to track PodGroup replica creation and deletion today without some sort of watch on Pods and inferring them from that.

I don't think it's too late to pull PodGroup out into its own top-level resource, but it is our last chance to do so. I think it's the better design and we should take this opportunity. With that update, clients can use all the standard API machinery rather than something that calculates deltas from changes in workload objects and/or watching Pods.

In that case Workload may contain PodGroupTemplates but the actual instances of PodGroups would be separate objects with a reference back to the workload and template. So we clearly separate lifecycle of the policy configuration and instances of groups based on that policy configuration.

This then would probably need to change from a WorkloadRef to a PodGroup resource name.
Barring that, I prefer what Eric suggests above.

johnbelamaric · 2025-12-18T18:24:34Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	ControllerRef *TypedLocalObjectReference
+
+	// PodGroups is the list of pod groups that make up the Workload.
+	// The maximum number of pod groups is 8. This field is immutable.


Splitting PodGroup into its own resource would make it a lot clearer that there is a distinction between the template and the instance, and that the limit applies to defined templates, not instances.

johnbelamaric · 2025-12-18T18:32:24Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+When the scheduler pops a Pod from the active queue, it checks if that Pod
+belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler
+initiates the Workload Scheduling Cycle.


What is the plan if PodGroup has been scheduled with TAS constraints based on minCount, and a new Pod comes in that is part of the PodGroup but doesn't fit in the TAS constraints?

johnbelamaric · 2025-12-18T18:36:25Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+*Proposed:* Implement it on PodGroup Level for Beta. However, future migration
+to the Workload Level might necessitate non-trivial changes to the phase
+introduced by this KEP.


+1

Complex true workload controllers should implement any orchestration themselves. In other words, if there really are, say, dependencies between PodGroups, I think we should leave that complexity in the upper layer controller, and the scheduler just deal in PodGroups. So the controller would wait to create the second PodGroup until after it made sure the first one got scheduled. This does mean we don't have full workload atomic scheduling, only PodGroup atomic scheduling. But that level of complexity probably needs something more like Reservation.

johnbelamaric · 2025-12-18T18:44:20Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+end-to-end Pod scheduling flow, it is planned to place this new phase *before*
+the standard pod-by-pod scheduling cycle.
+
+When the scheduler pops a Pod from the active queue, it checks if that Pod


If PodGroup is a separate resource, we can watch for unschedule PodGroups, and defer scheduling any Pod that references a PodGroup until that PodGroup has been scheduled. I think that's a cleaner design than going through the Pod indirection.

johnbelamaric · 2025-12-18T18:50:11Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+  Can report dedicated logs and metrics with less confusion to the user.
+* *Cons:* Significant and non-trivial architectural change to the scheduling queue
+  and `scheduleOne` loop.
+<<[/UNRESOLVED]>>


Again, alternative 3 is the obvious choice IMO if PodGroup is a separate resource.

erictune · 2025-12-18T19:58:51Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+and enables the scheduler to optimize the placement of such pod groups by taking the desired state
+into account. Ideally, the scheduler should prefer placements that can accommodate
+the full `desiredCount`, even if not all pods are created yet.
+


Add something like this:

When desiredCount is specified, the scheduler is can delay scheduling the first pod it sees for a short amount of time in order to wait for more pods to be created/noticed.

ZiMengSheng · 2025-12-19T07:01:48Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+Note that the implementation of this specific logic might follow in a Beta stage
+of this API field.
+
+#### Delayed Preemption


The deletion of Victim takes some time due to resource release and other reasons. During this period, how can the binding process of the Preemptor be blocked to prevent the kubelet from rejecting the Pod due to resource over-provisioning on the node?

The exact design of delayed preemption is proposed in #5711. Let's move this discussion there

kannon92 · 2025-12-19T20:43:53Z

keps/sig-scheduling/4671-gang-scheduling/kep.yaml

 # If the purpose of this KEP is to deprecate a user-visible feature
 # and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
-stage: alpha
+stage: beta


I don't completely follow everything in this proposal but promoting to beta and introducing two new feature gates raises a lot of eyebrows to me.

And now we have 4 separate feature gates for this feature..

What are the implications on these features gates in relation to each other?

ie if GenericWorkload is disabled and GangScheduling enabled what do we expect to happen?

Will any of these graduate at a different time than the other features gates?

I don't completely follow everything in this proposal but promoting to beta and introducing two new feature gates raises a lot of eyebrows to me.

One of the gates (WorkloadBasicPolicyDesiredCount) will start in alpha, but the change is that small that creating a new KEP for it would be an overhead, so we decided to put it in this KEP.

And now we have 4 separate feature gates for this feature..

Tentatively removed one feature gate from the list. I think the workload scheduling cycle could be covered using the gang scheduling gate.

What are the implications on these features gates in relation to each other?

GangScheduling requires GenericWorkload to work, as the latter defines the API that enables the gang scheduling. GenericWorkload itself can be enabled and used to express the "true" workload without a need for it to be gang scheduled by the kube-scheduler.

GenericWorkload is disabled and GangScheduling enabled what do we expect to happen?

GangScheduling gate requires GenericWorkload to be enabled (it is enforced by the feature gate validation.

kannon92 · 2025-12-19T21:32:03Z

Honestly I would really like to see some k8s workloads adopting WAS.

We have #5547 but that seems decoupled from this KEP.

But if we promote the API to beta than we are going to discourage breaking API changes.

Would it be possible to make beta promotion contingent on at least a few workload apis proving out this design?

I don't want to get into a situation where this API gets promoted to beta/ga and then workload authors figure out issues with the API and then we have to revisit the API but we couldn't break the API.

keps/sig-scheduling/4671-gang-scheduling/README.md

dom4ha · 2025-12-22T00:35:33Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+the standard pod-by-pod scheduling cycle.
+
+When the scheduler pops a Pod from the active queue, it checks if that Pod
+belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler


+1
I'd say that Workload scheduling is a phase scheduling PodGroups. Depending on what policy type the group is, the logic is different.

In case of the Basic policy, the group becomes scheduled unconditionally, still pods belonging to that group cannot be scheduled until the PG itself is scheduled.

dom4ha · 2025-12-22T00:50:11Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+*Proposed:* Implement it on PodGroup Level for Beta. However, future migration
+to the Workload Level might necessitate non-trivial changes to the phase
+introduced by this KEP.


+1
IMO we need to schedule PodGroups themselves, as they define scheduling constraints on a level of group of pods, which means pods cannot be scheduled individually.

If we introduce group of groups, then this phase would have to schedule such a group as a whole since individual PodGroups no longer could be scheduled individually. Whenever we add them, extension of the workload scheduling phase would be incremental.

If we ever had cross workloads scheduling constraints, I bet we'd need to schedule such workloads in one cycle as well, so a single workload may not necessarily be a boundary for this phase.

dom4ha · 2025-12-22T01:09:47Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+     scheduler's internal cache. Pods are then pushed to the
+     active queue (restoring their original timestamps to ensure fairness)
+     to pass through the standard scheduling and binding cycle,
+     which will respect the nomination.


Can you elaborate on what "respect" means? Do you mean "consider" like currently or something more? The alternative that I see is consider nomination as required. Inability to follow the nomination would make PodGroup unschedulable.

However, as the first implementation I'd stick to the current logic and use the word "consider". That means that Pods could pick a different node, but only within a constraint selected for a PodGroup. In other words, pod-by-pod cannot reschedule PodGroup itself (assuming scheduling a PodGroup has "an effect", for instance picking a specific topology option).

Checking validity of the PodGroup level constraint may not be that easy and may require checking the state of other pods in the PodGroup or checking the state of DRA allocations. In the context of topology-aware workload scheduling I believe that it would be much easier to consider the nomination to be a hard requirement.

I meant "consider" here, but @44past4 point is valid. If we want to have a nominate semantic here, PodGroup-level plugins such as TAS would need to provide Filter extension points that will be able to verify the correctness of the new chosen node. Or, be able to provide the placement to the pod-by-pod cycle that will limit that phase to the topology/assignment chosen.

To provide TAS Filter extension we would need to store information about selected placement (like the name of the Node label and its value) for the PodGroup. For this we would need probably to use Workload status. This is doable but it is not in scope of the current TAS KEP proposal.

TAS could check in PreFilter where the pod group's previous pods were scheduled and based on that reject or allow the nodes in Filter. Obviously, it won't be much efficient, but maybe sufficiently good for now.

dom4ha · 2025-12-22T02:15:05Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+   reducing overall scheduling time.
+
+   * If a pod fits, it is tentatively nominated.
+   * If a pod cannot fit, the scheduler tries preemption by running


I wouldn't use PostFilter phase in workload scheduling phase but rather "wait" for the proper workload preemption feature. Note that once we have pods that are batched (same signature) the Opportunistic Batching does not have a feature to find feasible-nodes-after-preemption-attempt anyway

But, if workload-aware preemption is not there (or not enabled), we need to provide a way for gangs to perform preemptions. Otherwise, what is the point of having delayed preemption in the beta graduation criteria for gang scheduling?

Opportunistic batching just optimizes the default scheduling algorithm. I don't think preemption would be a big problem anyway because subsequent pod from the homogeneous sub-group can generate new placements for the rest of the batch.

dom4ha · 2025-12-22T02:23:18Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+   * If `schedulableCount >= minCount`, the cycle succeeds. Pods are pushed
+     to the active queue and will soon attempt to be scheduled on their
+     nominated nodes in their own, pod-by-pod cycles. If a pod selects a


I don't see any strong reason why PostFilter should not run in pod-by-pod scheduling if we allow nominated pod to be changed. Yes, it's indeed an open question whether we want to allow changing nomination itself, but if we do, then we should allow to run PostFilter as well.

I think that allowing such disruptions in pod-by-pod scheduling would make workload scheduling hard to reason about. The reason we'll use the workload-aware preemption is to enable efficient and effective preemption. Otherwise, each pod in the group could preempt some pods or even workloads based on its own needs rather than the needs of the entire pod group. I don't see many advantages to allowing PostFilter to run in pod-by-pod scheduling.

dom4ha · 2025-12-22T02:43:28Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+     They are then effectively "reserved" on those nodes in the
+     scheduler's internal cache. Pods are then pushed to the
+     active queue (restoring their original timestamps to ensure fairness)
+     to pass through the standard scheduling and binding cycle,


There should be no differences between workload and standard cycle in terms of pods feasibility

dom4ha · 2025-12-22T02:49:13Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+Alternative 1 (Modify sorting logic):
+
+Modify the sorting logic within the existing `PriorityQueue` to put all pods


I'd also not modify the queue sorting, but "remove" PodGroup pods form it when PodGroup itself is unschedulable by simply making them unschedulable.

dom4ha · 2025-12-22T02:59:16Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+  Can report dedicated logs and metrics with less confusion to the user.
+* *Cons:* Significant and non-trivial architectural change to the scheduling queue
+  and `scheduleOne` loop.
+<<[/UNRESOLVED]>>


Consider that we should have PodGroups without pods at some point in time and we should be able to schedule them somehow.

However, before we get there, I'd consider scheduling PodGroup simply whenever we encounter the first pod that refers to it and the PodGroup wasn't scheduled yet. It's similar to option 1 but without modifying sorting function.

I think it would be the simplest implementation as all pods stay in the active queue unless an attempt to schedule a PodGroup they belong to, makes the PodGroup unschedulable. Obviously the question is when to reconsider unschedulable PodGroup. At the beginning I'd make it periodically schedulable again (after some timeout) without defining PodGroup queue yet.

I'm not sure yet how Workload aware preemption may interact with it yet.

44past4 · 2025-12-22T08:42:39Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+  Would need to inject the workload priority into each of the Pods
+  or somehow apply the lowest pod's priority to the rest of the group.
+
+Alternative 2 (Store a gang representative):


I guess that one more option which would fall somewhere in between Alternative 2 and Alternative 3 could be changing the implementation of activeQ to operate on groups of Pods and treat single pods as groups of pods containing single pod only. Does this option make sense? Have you considered it?

44past4 · 2025-12-22T08:58:24Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+optional. In the `Beta` timeframe, we may opportunistically apply this cycle to
+`Basic` pod groups to leverage the batching performance benefits, but the
+"all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to
+schedule as many pods from such PodGroup as possible.


It is not clear what will happen in case of a pod scheduling failure in this case. Do we stop at first failure or do we continue? If we continue do we continue from a subsequent pod or maybe from subsequent sub-group?

Extended this part. The algorithm will continue from a pod from a subsequent sub-group.

…heduling cycle

k8s-ci-robot · 2025-12-22T11:19:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: macsko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/sig-scheduling/OWNERS~~ [macsko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-12-22T11:22:29Z

@macsko: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-enhancements-verify	`43b5aa9`	link	true	`/test pull-enhancements-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

macsko · 2025-12-22T11:43:03Z

/hold
To make sure we don't accidentally merge this PR

ZiMengSheng · 2025-12-24T09:19:46Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+   * If the group (i.e., at least `minCount` Pods) can be placed,
+     these Pods have the `.status.nominatedNodeName` set.
+     They are then effectively "reserved" on those nodes in the
+     scheduler's internal cache. Pods are then pushed to the


While reserving Pods using NominatedNodes eliminates the need to consider Pod dequeueing order, too many NominatedNodes can still degrade scheduling performance. A mechanism is needed to ensure that Pods can be dequeued as quickly as possible after a PodGroup resource is nominated.

saintube · 2025-12-25T05:12:19Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+opportunistic batching itself will provide significant improvements.
+Future features like Topology Aware Scheduling can further improve other subsets of use cases.
+
+#### Interaction with Basic Policy


IIUC, the all-or-nothing semantics of minCount are skipped with WSC enabled for the Basic policy, and desiredCount is only a hint for feasibility checks.

Given that, is it in-scope (maybe in the future) to support using both minCount and desiredCount together to express a more elastic gang behavior (e.g. minCount as a hard lower bound, desiredCount as the target batch size)?

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Dec 10, 2025

k8s-ci-robot requested review from dom4ha and kikisdeliveryservice December 10, 2025 09:05

github-project-automation bot added this to SIG Scheduling Dec 10, 2025

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 10, 2025

k8s-ci-robot requested review from erictune, sanposhiho and wojtek-t December 10, 2025 09:06

macsko mentioned this pull request Dec 10, 2025

Gang Scheduling Support in Kubernetes #4671

Open

13 tasks

Argh4k reviewed Dec 10, 2025

View reviewed changes

erictune reviewed Dec 11, 2025

View reviewed changes

wojtek-t reviewed Dec 11, 2025

View reviewed changes

wojtek-t self-assigned this Dec 11, 2025

andreyvelich reviewed Dec 11, 2025

View reviewed changes

wojtek-t reviewed Dec 12, 2025

View reviewed changes

keps/sig-scheduling/4671-gang-scheduling/README.md Show resolved Hide resolved

macsko force-pushed the gang_scheduling_beta branch from a63f0ef to 71ef75a Compare December 12, 2025 12:23

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 12, 2025

macsko force-pushed the gang_scheduling_beta branch from 71ef75a to 9ce3fc7 Compare December 12, 2025 12:23

Add a section about scheduler changes for v1.36

eae3ddb

macsko force-pushed the gang_scheduling_beta branch from 9ce3fc7 to eae3ddb Compare December 12, 2025 15:30

sanposhiho reviewed Dec 15, 2025

View reviewed changes

k8s-ci-robot assigned sanposhiho Dec 15, 2025

Add a section about basic policy update

9e672be

Argh4k mentioned this pull request Dec 17, 2025

[KEP-5710]: Workload-aware preemption KEP #5711

Open

wojtek-t reviewed Dec 18, 2025

View reviewed changes

johnbelamaric reviewed Dec 18, 2025

View reviewed changes

erictune reviewed Dec 18, 2025

View reviewed changes

ZiMengSheng reviewed Dec 19, 2025

View reviewed changes

kannon92 reviewed Dec 19, 2025

View reviewed changes

dom4ha reviewed Dec 22, 2025

View reviewed changes

44past4 reviewed Dec 22, 2025

View reviewed changes

Remove beta graduation from the PR, extend sections about workload sc…

43b5aa9

…heduling cycle

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 22, 2025

macsko changed the title ~~WIP: KEP-4671: Introduce Workload Scheduling Cycle, graduate Workload API and gang scheduling to beta~~ KEP-4671: Introduce Workload Scheduling Cycle, extend basic policy with desiredCount Dec 22, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 22, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 22, 2025

ZiMengSheng reviewed Dec 24, 2025

View reviewed changes

saintube reviewed Dec 25, 2025

View reviewed changes


		Alternative 1 (Modify sorting logic):

		Modify the sorting logic within the existing `PriorityQueue` to put all pods

KEP-4671: Introduce Workload Scheduling Cycle, extend basic policy with desiredCount #5730

Are you sure you want to change the base?

KEP-4671: Introduce Workload Scheduling Cycle, extend basic policy with desiredCount #5730

Conversation

macsko commented Dec 10, 2025

Uh oh!

macsko commented Dec 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erictune left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erictune Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

macsko Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erictune Dec 18, 2025 •

edited

Loading

macsko Dec 19, 2025 •

edited

Loading

macsko Dec 12, 2025 •

edited

Loading