Skip to content

Conversation

@macsko
Copy link
Member

@macsko macsko commented Dec 10, 2025

  • One-line PR description: Update KEP-4671 for beta in v1.36. Introduce Workload Scheduling Cycle phase.
  • Other comments: This PR is shows an initial idea of new workload scheduling phase. Further changes related to graduation to beta will be added gradually.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Dec 10, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 10, 2025
@macsko
Copy link
Member Author

macsko commented Dec 10, 2025


Alternative 1 (Modify sorting logic):

Modify the sorting logic within the existing `PriorityQueue` to put all pods
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to modify it once we have workload priority, right? Otherwise we can keep the current sorting algorithm and pull all gang members from the queue once we encounter the first member of the gang (and switch to workload scheduling), right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we could keep the current algorithm, but since the gang members can be in any internal queue, removing them from these queues would require some effort (running PreEnqueue, registering metrics, filling in internal structures used for queuing hints, etc.). In fact, this effort could be moved to properly implement queueing with more feature-proof alternative.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide that if there is no workload priority then priority = min(pods priority) (as Wojtek suggested) then it becomes a big problem with this approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we choose that priority calculation, then this alternative is just a bad idea

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also not modify the queue sorting, but "remove" PodGroup pods form it when PodGroup itself is unschedulable by simply making them unschedulable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't "simply" make them unschedulable. What about event handling? What should make the PodGroup schedulable after failure? What to do with the observability (metrics, logs, etc.) that are aligned to the pod-by-pod queueing now? It's hard to say whether we could easily do that or not. I need to analyze the code deeper and come back with some thoughts.

Anyway, the modification of the queue sorting will be needed when the pod group priority will be introduced by the workload-aware preemption KEP. We should consider that when designing the queueing part of this KEP.

Copy link
Contributor

@erictune erictune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall.

I have one proposal to consider renaming part of the API, and some clarifying questions.

// +required
Name string
// PodGroup is the name of the PodGroup within the Workload that this Pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term PodGroup has become ambiguous. It can mean:

  1. a list item in Workload.podGroups
  2. a specific set of pods having the same PodGroupReplicaKey
    These are often the same thing but not always.

If we want to be clear what we mean, we can either:

  • Always call (1) a PodGroup and always call (2) a PodGroupReplica, including in the case when podGroupReplicaKey is nil
  • Or, rename PodGroup to PodGroup Template, and call (2) a PodGroup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may seem pedantic, but when reviewing this KEP, I felt that current double meaning made the text imprecise in a number of places. We do have a chance to fix the naming with v1beta1, but I don't think we are supposed to rename when we go to GA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with andreyvelich, PodSubGroupReplica and PodSetReplica are confusing.

This would be less confusing:

  • 1 Workload object in APIserver names exactly one 1 group of pods at runtime.
  • 1 Workload.podGroupTemplates[i] is a template for N independent PodGroups which are distinguished by PodGroupReplicaKey

Later, if PodSubGroup is added:

  • 1 Workload.podGroupTemplate[i].podSubGroup[j] names exactly one PodSubGroup within a single group of pods (PodGroup) defined as above.

Later, if PodSet is added under PodSubGroup:

  • 1 Workload.podGroupTemplate[i].podSubGroup[j].podSet[k] names exactly one PodSet within PodSubGroup, defined as above.

Copy link
Contributor

@erictune erictune Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a longer discussion than belongs in the review comments of this PR.

Suggestion to KEP author:

  • Keep Workload Scheduling Cycle proposal in this PR.
  • Keep Basic policy enhancements in this PR.
  • Remove advancement to beta to.

Let's all as reviewers approve the above. And we can take a week to discuss the "PodGroup as separate object" proposal. which may affect alpha/beta status of Workload.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @johnbelamaric and @erictune that we should separate discussion whether we need separate object for PodGroup.

Remove advancement to beta to.

Is my understanding correct that we will create another KEP for beta graduation in that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going back and forth with whether we should separate PodGroup into a separate object (in fact my original counter-proposal was going that way though for different reasons https://docs.google.com/document/d/13UkLjVMj_edMh7biqVU6SVyNNTIfGAT35p6-pNsF5AY/edit?resourcekey=0-dqUEiwiXWICLwAg6Tqkupw&tab=t.9le0fmf90j3w#heading=h.vf43rjyfidc6 )

I agree it's the last moment to change that before going to Beta.

But I also agree that pretty much neither of the proposed changes (basicPolicy, workload scheduling cycle, scheduling algorithm, ...) depend on this decision and can proceed independently.

So I'm heavily +1 on Eric's suggestion above to focus this PR on those changes but still leave the KEP in Alpha after this PR.

And have a separate PR that will be focused on the API itself (it can't be a separate KEP, but it can be a separate PR and discussion).
Also - we should probably start a dedicated document for that to clearly describe Pros/Cons of both options to make the more data-driven decision.

@macsko - I'll be OOO for the next 2 weeks, would you be able to start such doc and I'm happy to contribute once I'm back

Copy link
Member Author

@macsko macsko Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I'll remove the beta graduation part from this PR and the discussion about API can be moved to a new document

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I went ahead and started a doc: https://docs.google.com/document/d/1zVdNyMGuSi861Uw16LAKXzKkBgZaICOWdPRQB9YAwTk/edit?resourcekey=0-bD8cjW_B6ZfOpSGgDrU6Mg&tab=t.0#heading=h.c4vrtnmf9f4o

It's only a starter and requires a lot of work so I would appreciate all contributions, especially given I will be OOO until Jan 7th

// WorkloadSpec describes a workload in a portable way that scheduler and related
// tools can understand.
// WorkloadMaxPodGroups is the maximum number of pod groups per Workload.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the number of items in the list PodGroups

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW - these updates are based on what was merged in the code.
I'm happy for these to be updated, but for these specifically maybe we should open a dedicated PR in k/k.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @wojtek-t. We can defer API description discussions to the k/k PR with API graduation (when it will be created)

ControllerRef *TypedLocalObjectReference
// PodGroups is the list of pod groups that make up the Workload.
// The maximum number of pod groups is 8. This field is immutable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the maximum number of PodGroup Replicas is not limited.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. It would be even hard to enforce such limitation in the current form of replication - it's based solely on pods' workloadRefs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was suggesting to make this more clear in the code comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting PodGroup into its own resource would make it a lot clearer that there is a distinction between the template and the instance, and that the limit applies to defined templates, not instances.


When the scheduler pops a Pod from the active queue, it checks if that Pod
belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler
initiates the Workload Scheduling Cycle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a pod belongs to an already scheduled PodGroup, it is not clear what to do. We could:

  • Hold it back, on the assumption that it is a replacement for an already scheduled pod. When minCount additional pods show up, then handle all those at once.
  • Treat it as if it is a "gang scale up", and try to place it too (on a best-effort basis). If we do this, does it go through Workload cycle, or just normal pod-at-a-time path?

I thinking about races that could happen when a workload is failing and getting pods replaced. And I am thinking about the case when a workload wants to tolerate a small number of failures by having actual count > minCount.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a step back from the implementation, we should think what we really need to do conceptually in that case.

If the pod is part of PodGroup, then what we conceptually want is to schedule that in the context of its PodGroup instance. I think there are effectively three cases here:

  1. this PodGroup instance doesn't satisfy its minCount now, but with this pod it will satisfy it
  2. this PodGroup instance doesn't satisfy its minCount now and with that pod it won't satisfy it either
  3. this PodGroup instance already satisfies its minCount even without this pod

It's worth noting at this point, that with topology-aware scheduling introduce PodGroup instance could have tas requirements too when thinking what we should do with it.

The last point in my opinion means that we effectively should always go through Workload cycle, because we always want to consider it in the context of whole workload.
The primary question is if we want to go kick this workload cycle off with individual pod or wait for more and if so based on what criteria.
I think that the exact criteria will be evolving in the future (I can definitely imagine it depend on "preemption unit" that we're introducing in KEP-5710). So for now, I wouldn't try to settle down on the final solution.

I would suggest starting with:

  • always go through Workload cycle (to keep in mind whole podGroup instance as context)
  • for now, always schedule individual pod for the sake of making progress here (we will most probably adjust it in the future but it will be pretty localized change and I think it's fine)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the easiest way is to do what @wojtek-t described, i.e., go best effort (as for the basic policy), so take as many pods as we have available, and try to schedule them in the workload cycle. Only pods that have passed the workload cycle will be able to move on to the pod-by-pod cycle. In the future, we can try to create more intelligent grouping, but for now, let's focus on delivering a good enough implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan if PodGroup has been scheduled with TAS constraints based on minCount, and a new Pod comes in that is part of the PodGroup but doesn't fit in the TAS constraints?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I suppose, the new Pod should go through a workload scheduling cycle and the TAS algorithm there should take the scheduled pods from a gang into consideration. If the pod doesn't fit, it will remain unschedulable until something changes in the cluster.

They are then effectively "reserved" on those nodes in the
scheduler's internal cache. Pods are then pushed to the
active queue (restoring their original timestamps to ensure fairness)
to pass through the standard scheduling and binding cycle,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, there are some cases where a nominated pod fails to pass through the standard scheduling cycle:

  • differences in the algorithm between workload cycle and standard cycle
  • refreshed snapshot changes node eligibility.
  • higher priority pod jumps ahead in active queue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are the cases, but we are okay when they occur, as long as the new placement will be still valid. If it won't be anymore, scheduling of the gang will fail (at WaitOnPermit) and the gang will be retried.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no differences between workload and standard cycle in terms of pods feasibility

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few comments but they are pretty localized - overall this is great proposal pretty aligned with how I was thinking about it too.

// +required
Name string
// PodGroup is the name of the PodGroup within the Workload that this Pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.

// WorkloadSpec describes a workload in a portable way that scheduler and related
// tools can understand.
// WorkloadMaxPodGroups is the maximum number of pod groups per Workload.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW - these updates are based on what was merged in the code.
I'm happy for these to be updated, but for these specifically maybe we should open a dedicated PR in k/k.

Can report dedicated logs and metrics with less confusion to the user.
* *Cons:* Significant and non-trivial architectural change to the scheduling queue
and `scheduleOne` loop.
<<[/UNRESOLVED]>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would more value opinions from people more hands-on with scheduler code recently than myself, but on paper I think the third alternative seems preferred to me:

  1. Alternative 1 - it will be extremely hard to reason about it if pods sit in different queues (backoff, active, ..) independently and the evolution point is extremely important imho - we know we will be evolving it a lot and preparing for that is super important

  2. Alternative 2 - I share the concerns about corner cases (pod deleted is exactly what I have on my mind). Once we get to rescheduling cases (new pods appear but they should trigger removing and recreating some of the existing pods it will get even more complicated with even harder corner cases). Given that we know that rescheduling is super important, I'm also reluctant about it.

  3. Alternative 3 - It's clearly the cleanest option. The primary drawback of the need for non-trivial changes is imho justified given we know that we will be evolving it significantly in the future.

So I have quite strong preference towards (3) at this point, but I'm happy to be challenged by implementation-specific counter-arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my thought as well. I think the effort involved in adding workload queuing will be comparable to modifying the code for the previous alternatives, but maybe I don't see some significant drawbacks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the third alternative, do we want to introduce only the queue for PodGroups? If so then we have the same problem of pulling pods from backoff and unschedulable queues right? Or do we mean by the workload queue a structure that will hold all the pods related to workload scheduling?

Let's say that we add a Pod that is part of a group but does not make us meet the min count . Should it land in unschedulable queue as it does right now? Or some additional structure?

If the pod group failed scheduling, where do we put pods from it? We need to have them somewhere, so we can check them against cluster events in case some event makes the pod schedulable thus potentially makes the whole pod group schedulable.

Copy link
Member Author

@macsko macsko Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, one way might be to introduce both:

  • a queue for pod groups (let's say podGroupsQ). This will contain the pod groups that can be popped for workload scheduling
  • a data structure for unscheduled pods (let's say workloadPods) that have to wait for workload scheduling to finish to proceed with their pod-by-pod cycles.

I imagine the pod transitions (for pods that have a workloadRef) would be:

pod is created -> when PreEnqueue passes for a pod, add it to the workloadPods, otherwise add to unschedulablePods -> [for gang pod group] if the >= minCount pods for the group is in a workloadPods, the group is added to the podGroupsQ -> pod group is processed and workload scheduling cycle is executed

When the workload cycle finishes successfully:

pod gets the NNN and is moved to the activeQ -> pod is popped and goes through its pod-by-pod scheduling cycle -> ...

When the workload cycle fails:

pod is moved to the unschedulablePods, where it waits for a cluster change to happen -> when the change happens for this pod or other from its pod group, the pod(s) are moved to the workloadPods -> workload manager detects the group should be retried and adds the group to the podGroupsQ -> processing continues as previously

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, alternative 3 is the obvious choice IMO if PodGroup is a separate resource.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Koordinator, I used Alternative2. The desired effect is that, in ActiveQ/BackoffQ, a PodGroup has only one Item, and this Item can carry all queuing attributes (Priority, LastScheduleTime, BackoffTime, etc.). Whether this Item is a representative pod or a real PodGroup is less important.

Another important point is that when a representative Pod or PodGroup is dequeued, its sibling Pods are also dequeued. In Koordinator, we implemented our own NextPod (koordinator-sh/koordinator#2417) to achieve this.

Regarding this KEP, I feel that a fusion of Alternative 2 and Alternative 3 might be a better solution:

  1. Use a QueuedPodGroupInfo that aggregates the queuing attributes of all member Pods to flow PodGroups between ActiveQ, BackoffQ, and UnschedulableQ, allowing the previous QueueHint mechanism to seamlessly integrate with PodGroups.

  2. Use a separate Map to store the Pods belonging to QueuedPodGroupInfo for easy indexing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider that we should have PodGroups without pods at some point in time and we should be able to schedule them somehow.

However, before we get there, I'd consider scheduling PodGroup simply whenever we encounter the first pod that refers to it and the PodGroup wasn't scheduled yet. It's similar to option 1 but without modifying sorting function.

I think it would be the simplest implementation as all pods stay in the active queue unless an attempt to schedule a PodGroup they belong to, makes the PodGroup unschedulable. Obviously the question is when to reconsider unschedulable PodGroup. At the beginning I'd make it periodically schedulable again (after some timeout) without defining PodGroup queue yet.

I'm not sure yet how Workload aware preemption may interact with it yet.

* If preemption is successful, the pod is nominated on the selected node.
* If preemption fails, the pod is considered unscheduled for this cycle.

The phase can effectively stop once `minCount` pods have a placement,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the POV of optimizing locally, we should continue scheduling all pod (and not stop after minCount) - what changes is that the unschedulability of further pods shouldn't make the whole group unschedulable.

Given that we have minCount at the whole PodGroup level, the local optimization is actually what we should probably do anyway (it would be a different story if we would have a separate minCount per signature, but it's not the case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of pods in the in the PodGroup is larger than minCount there is also a question if we should stop at first pod scheduling failure or maybe only if the number of failures is larger than the difference between the size of the PodGroup and minCount?

My intuition is that if we encounter a pod scheduling failure we should probably skip scheduling of the remaining pods in a given sub-group treating them as unschedulable but if there are more sub-groups and if there is still a chance that we will be able to schedule minCount of pods we should probably try scheudling pods from the sebsequent sub-groups.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is that if we encounter a pod scheduling failure we should probably skip scheduling of the remaining pods in a given sub-group treating them as unschedulable but if there are more sub-groups and if there is still a chance that we will be able to schedule minCount of pods we should probably try scheudling pods from the sebsequent sub-groups.

I agree and that was the idea. Added it explicitly to the KEP

@wojtek-t wojtek-t self-assigned this Dec 11, 2025
// +required
Name string
// PodGroup is the name of the PodGroup within the Workload that this Pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?

@macsko macsko force-pushed the gang_scheduling_beta branch from a63f0ef to 71ef75a Compare December 12, 2025 12:23
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 12, 2025
@macsko macsko force-pushed the gang_scheduling_beta branch from 71ef75a to 9ce3fc7 Compare December 12, 2025 12:23
@macsko macsko force-pushed the gang_scheduling_beta branch from 9ce3fc7 to eae3ddb Compare December 12, 2025 15:30
Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign

// +required
Name string
// PodGroup is the name of the PodGroup within the Workload that this Pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin - I would like to appreciate your feedback from API approver perspective.

the standard pod-by-pod scheduling cycle.

When the scheduler pops a Pod from the active queue, it checks if that Pod
belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the Basic policy section above, I actually think that all pods that belong to workloads should go through this phase (whether they form gangs or just are from basic policy).

I acknowledge that for basic it will be kind of best-effort (if more pods were created we will get all of them, if we only observed 1 it will be just one), but that better opens doors for future extensions once we have pod templates etc. defined in Workload.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
I'd say that Workload scheduling is a phase scheduling PodGroups. Depending on what policy type the group is, the logic is different.

In case of the Basic policy, the group becomes scheduled unconditionally, still pods belonging to that group cannot be scheduled until the PG itself is scheduled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "Gang" from this sentence. I think we are aligned that the basic-policy pods should be scheduled by this phase


*Proposed:* Implement it on PodGroup Level for Beta. However, future migration
to the Workload Level might necessitate non-trivial changes to the phase
introduced by this KEP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest resolving that (and potentially mentioning the risk of non-trivial changes in the Risks section).

2. The scheduler nominates the victims for preemption and the gang Pod
for scheduling on their place. This way, the gang can be attempted
without making any intermediate disruptions to the cluster.
* If the quorum is met, the scheduler continues scheduling the gang Pods pod-by-pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be adjusted once we settle down on the details in #5711

Currently these are not aligned :)

[It's a comment for myself too]

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should take the opportunity this release to shift to top-level PodGroup instances, with the templates living in workload. I believe this clears up a number of things and will make building on top of PodGroups much easier for the rest of the project.

// +required
Name string
// PodGroup is the name of the PodGroup within the Workload that this Pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embedding PodGroups in Workload as done today makes it much harder for clients to track lifecycle events - creation, deletion, resize, etc. of individual PodGroup replicas. And given that (AIUI) individual replica keys are not all stored in the workload, it may not even be possible to track PodGroup replica creation and deletion today without some sort of watch on Pods and inferring them from that.

I don't think it's too late to pull PodGroup out into its own top-level resource, but it is our last chance to do so. I think it's the better design and we should take this opportunity. With that update, clients can use all the standard API machinery rather than something that calculates deltas from changes in workload objects and/or watching Pods.

In that case Workload may contain PodGroupTemplates but the actual instances of PodGroups would be separate objects with a reference back to the workload and template. So we clearly separate lifecycle of the policy configuration and instances of groups based on that policy configuration.

This then would probably need to change from a WorkloadRef to a PodGroup resource name.
Barring that, I prefer what Eric suggests above.

ControllerRef *TypedLocalObjectReference
// PodGroups is the list of pod groups that make up the Workload.
// The maximum number of pod groups is 8. This field is immutable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting PodGroup into its own resource would make it a lot clearer that there is a distinction between the template and the instance, and that the limit applies to defined templates, not instances.


When the scheduler pops a Pod from the active queue, it checks if that Pod
belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler
initiates the Workload Scheduling Cycle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan if PodGroup has been scheduled with TAS constraints based on minCount, and a new Pod comes in that is part of the PodGroup but doesn't fit in the TAS constraints?


*Proposed:* Implement it on PodGroup Level for Beta. However, future migration
to the Workload Level might necessitate non-trivial changes to the phase
introduced by this KEP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Complex true workload controllers should implement any orchestration themselves. In other words, if there really are, say, dependencies between PodGroups, I think we should leave that complexity in the upper layer controller, and the scheduler just deal in PodGroups. So the controller would wait to create the second PodGroup until after it made sure the first one got scheduled. This does mean we don't have full workload atomic scheduling, only PodGroup atomic scheduling. But that level of complexity probably needs something more like Reservation.

end-to-end Pod scheduling flow, it is planned to place this new phase *before*
the standard pod-by-pod scheduling cycle.

When the scheduler pops a Pod from the active queue, it checks if that Pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PodGroup is a separate resource, we can watch for unschedule PodGroups, and defer scheduling any Pod that references a PodGroup until that PodGroup has been scheduled. I think that's a cleaner design than going through the Pod indirection.

Can report dedicated logs and metrics with less confusion to the user.
* *Cons:* Significant and non-trivial architectural change to the scheduling queue
and `scheduleOne` loop.
<<[/UNRESOLVED]>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, alternative 3 is the obvious choice IMO if PodGroup is a separate resource.

and enables the scheduler to optimize the placement of such pod groups by taking the desired state
into account. Ideally, the scheduler should prefer placements that can accommodate
the full `desiredCount`, even if not all pods are created yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add something like this:

When desiredCount is specified, the scheduler is can delay scheduling the first pod it sees for a short amount of time in order to wait for more pods to be created/noticed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Note that the implementation of this specific logic might follow in a Beta stage
of this API field.

#### Delayed Preemption

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletion of Victim takes some time due to resource release and other reasons. During this period, how can the binding process of the Preemptor be blocked to prevent the kubelet from rejecting the Pod due to resource over-provisioning on the node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact design of delayed preemption is proposed in #5711. Let's move this discussion there

# If the purpose of this KEP is to deprecate a user-visible feature
# and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
stage: alpha
stage: beta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely follow everything in this proposal but promoting to beta and introducing two new feature gates raises a lot of eyebrows to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And now we have 4 separate feature gates for this feature..

What are the implications on these features gates in relation to each other?

ie if GenericWorkload is disabled and GangScheduling enabled what do we expect to happen?

Will any of these graduate at a different time than the other features gates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely follow everything in this proposal but promoting to beta and introducing two new feature gates raises a lot of eyebrows to me.

One of the gates (WorkloadBasicPolicyDesiredCount) will start in alpha, but the change is that small that creating a new KEP for it would be an overhead, so we decided to put it in this KEP.

And now we have 4 separate feature gates for this feature..

Tentatively removed one feature gate from the list. I think the workload scheduling cycle could be covered using the gang scheduling gate.

What are the implications on these features gates in relation to each other?

GangScheduling requires GenericWorkload to work, as the latter defines the API that enables the gang scheduling. GenericWorkload itself can be enabled and used to express the "true" workload without a need for it to be gang scheduled by the kube-scheduler.

GenericWorkload is disabled and GangScheduling enabled what do we expect to happen?

GangScheduling gate requires GenericWorkload to be enabled (it is enforced by the feature gate validation.

@kannon92
Copy link
Contributor

Honestly I would really like to see some k8s workloads adopting WAS.

We have #5547 but that seems decoupled from this KEP.

But if we promote the API to beta than we are going to discourage breaking API changes.

Would it be possible to make beta promotion contingent on at least a few workload apis proving out this design?

I don't want to get into a situation where this API gets promoted to beta/ga and then workload authors figure out issues with the API and then we have to revisit the API but we couldn't break the API.

the standard pod-by-pod scheduling cycle.

When the scheduler pops a Pod from the active queue, it checks if that Pod
belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
I'd say that Workload scheduling is a phase scheduling PodGroups. Depending on what policy type the group is, the logic is different.

In case of the Basic policy, the group becomes scheduled unconditionally, still pods belonging to that group cannot be scheduled until the PG itself is scheduled.


*Proposed:* Implement it on PodGroup Level for Beta. However, future migration
to the Workload Level might necessitate non-trivial changes to the phase
introduced by this KEP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
IMO we need to schedule PodGroups themselves, as they define scheduling constraints on a level of group of pods, which means pods cannot be scheduled individually.

If we introduce group of groups, then this phase would have to schedule such a group as a whole since individual PodGroups no longer could be scheduled individually. Whenever we add them, extension of the workload scheduling phase would be incremental.

If we ever had cross workloads scheduling constraints, I bet we'd need to schedule such workloads in one cycle as well, so a single workload may not necessarily be a boundary for this phase.

scheduler's internal cache. Pods are then pushed to the
active queue (restoring their original timestamps to ensure fairness)
to pass through the standard scheduling and binding cycle,
which will respect the nomination.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what "respect" means? Do you mean "consider" like currently or something more? The alternative that I see is consider nomination as required. Inability to follow the nomination would make PodGroup unschedulable.

However, as the first implementation I'd stick to the current logic and use the word "consider". That means that Pods could pick a different node, but only within a constraint selected for a PodGroup. In other words, pod-by-pod cannot reschedule PodGroup itself (assuming scheduling a PodGroup has "an effect", for instance picking a specific topology option).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking validity of the PodGroup level constraint may not be that easy and may require checking the state of other pods in the PodGroup or checking the state of DRA allocations. In the context of topology-aware workload scheduling I believe that it would be much easier to consider the nomination to be a hard requirement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant "consider" here, but @44past4 point is valid. If we want to have a nominate semantic here, PodGroup-level plugins such as TAS would need to provide Filter extension points that will be able to verify the correctness of the new chosen node. Or, be able to provide the placement to the pod-by-pod cycle that will limit that phase to the topology/assignment chosen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To provide TAS Filter extension we would need to store information about selected placement (like the name of the Node label and its value) for the PodGroup. For this we would need probably to use Workload status. This is doable but it is not in scope of the current TAS KEP proposal.

Copy link
Member Author

@macsko macsko Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TAS could check in PreFilter where the pod group's previous pods were scheduled and based on that reject or allow the nodes in Filter. Obviously, it won't be much efficient, but maybe sufficiently good for now.

reducing overall scheduling time.

* If a pod fits, it is tentatively nominated.
* If a pod cannot fit, the scheduler tries preemption by running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't use PostFilter phase in workload scheduling phase but rather "wait" for the proper workload preemption feature. Note that once we have pods that are batched (same signature) the Opportunistic Batching does not have a feature to find feasible-nodes-after-preemption-attempt anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, if workload-aware preemption is not there (or not enabled), we need to provide a way for gangs to perform preemptions. Otherwise, what is the point of having delayed preemption in the beta graduation criteria for gang scheduling?

Opportunistic batching just optimizes the default scheduling algorithm. I don't think preemption would be a big problem anyway because subsequent pod from the homogeneous sub-group can generate new placements for the rest of the batch.


* If `schedulableCount >= minCount`, the cycle succeeds. Pods are pushed
to the active queue and will soon attempt to be scheduled on their
nominated nodes in their own, pod-by-pod cycles. If a pod selects a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any strong reason why PostFilter should not run in pod-by-pod scheduling if we allow nominated pod to be changed. Yes, it's indeed an open question whether we want to allow changing nomination itself, but if we do, then we should allow to run PostFilter as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that allowing such disruptions in pod-by-pod scheduling would make workload scheduling hard to reason about. The reason we'll use the workload-aware preemption is to enable efficient and effective preemption. Otherwise, each pod in the group could preempt some pods or even workloads based on its own needs rather than the needs of the entire pod group. I don't see many advantages to allowing PostFilter to run in pod-by-pod scheduling.

They are then effectively "reserved" on those nodes in the
scheduler's internal cache. Pods are then pushed to the
active queue (restoring their original timestamps to ensure fairness)
to pass through the standard scheduling and binding cycle,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no differences between workload and standard cycle in terms of pods feasibility


Alternative 1 (Modify sorting logic):

Modify the sorting logic within the existing `PriorityQueue` to put all pods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also not modify the queue sorting, but "remove" PodGroup pods form it when PodGroup itself is unschedulable by simply making them unschedulable.

Can report dedicated logs and metrics with less confusion to the user.
* *Cons:* Significant and non-trivial architectural change to the scheduling queue
and `scheduleOne` loop.
<<[/UNRESOLVED]>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider that we should have PodGroups without pods at some point in time and we should be able to schedule them somehow.

However, before we get there, I'd consider scheduling PodGroup simply whenever we encounter the first pod that refers to it and the PodGroup wasn't scheduled yet. It's similar to option 1 but without modifying sorting function.

I think it would be the simplest implementation as all pods stay in the active queue unless an attempt to schedule a PodGroup they belong to, makes the PodGroup unschedulable. Obviously the question is when to reconsider unschedulable PodGroup. At the beginning I'd make it periodically schedulable again (after some timeout) without defining PodGroup queue yet.

I'm not sure yet how Workload aware preemption may interact with it yet.

Would need to inject the workload priority into each of the Pods
or somehow apply the lowest pod's priority to the rest of the group.

Alternative 2 (Store a gang representative):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that one more option which would fall somewhere in between Alternative 2 and Alternative 3 could be changing the implementation of activeQ to operate on groups of Pods and treat single pods as groups of pods containing single pod only. Does this option make sense? Have you considered it?

optional. In the `Beta` timeframe, we may opportunistically apply this cycle to
`Basic` pod groups to leverage the batching performance benefits, but the
"all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to
schedule as many pods from such PodGroup as possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear what will happen in case of a pod scheduling failure in this case. Do we stop at first failure or do we continue? If we continue do we continue from a subsequent pod or maybe from subsequent sub-group?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended this part. The algorithm will continue from a pod from a subsequent sub-group.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: macsko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 22, 2025
@k8s-ci-robot
Copy link
Contributor

@macsko: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify 43b5aa9 link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@macsko macsko changed the title WIP: KEP-4671: Introduce Workload Scheduling Cycle, graduate Workload API and gang scheduling to beta KEP-4671: Introduce Workload Scheduling Cycle, extend basic policy with desiredCount Dec 22, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 22, 2025
@macsko
Copy link
Member Author

macsko commented Dec 22, 2025

/hold
To make sure we don't accidentally merge this PR

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 22, 2025
* If the group (i.e., at least `minCount` Pods) can be placed,
these Pods have the `.status.nominatedNodeName` set.
They are then effectively "reserved" on those nodes in the
scheduler's internal cache. Pods are then pushed to the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reserving Pods using NominatedNodes eliminates the need to consider Pod dequeueing order, too many NominatedNodes can still degrade scheduling performance. A mechanism is needed to ensure that Pods can be dequeued as quickly as possible after a PodGroup resource is nominated.

opportunistic batching itself will provide significant improvements.
Future features like Topology Aware Scheduling can further improve other subsets of use cases.

#### Interaction with Basic Policy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the all-or-nothing semantics of minCount are skipped with WSC enabled for the Basic policy, and desiredCount is only a hint for feasibility checks.

Given that, is it in-scope (maybe in the future) to support using both minCount and desiredCount together to express a more elastic gang behavior (e.g. minCount as a hard lower bound, desiredCount as the target batch size)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.