Kubernetes v1.36 Refines Scheduling with Decoupled Workload and PodGroup APIs

Introduction

Kubernetes continues to evolve its scheduling capabilities to meet the demands of complex workloads, including AI/ML and batch processing. In version 1.35, the project introduced the initial Workload API and basic gang scheduling support. Now, with the release of Kubernetes v1.36, the scheduling architecture takes a significant leap forward by cleanly separating the Workload and PodGroup APIs, enabling more efficient and scalable workload-aware scheduling. This article explores the key enhancements in v1.36, including the decoupled API design, new scheduling cycle improvements, topology-aware scheduling, workload-aware preemption, Dynamic Resource Allocation (DRA) support, and integration with the Job controller.

Kubernetes v1.36 Refines Scheduling with Decoupled Workload and PodGroup APIs

Workload and PodGroup API Redesign

In Kubernetes v1.36, the Workload and PodGroup APIs have been completely restructured. The previous v1alpha1 version combined pod group definitions and their runtime state within a single Workload resource. The new v1alpha2 API group separates these concerns: the Workload now serves as a static template, while the PodGroup manages the runtime state. This decoupling improves performance and scalability by allowing per-replica sharding of status updates. The kube-scheduler can directly read the PodGroup object without needing to parse the Workload template, streamlining its logic and reducing overhead.

Updated Configuration Example

Workload controllers, such as the Job controller, define a Workload object that contains podGroupTemplates. Each template specifies the scheduling policy, for example, gang scheduling with a minimum number of Pods. The controller then stamps out runtime PodGroup instances based on these templates. The PodGroup holds the actual scheduling policy, a reference to the originating template, and a status field that reflects the overall scheduling state of the group through conditions mirroring individual Pod states.

apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: training-job-workload
  namespace: some-ns
spec:
  podGroupTemplates:
  - name: workers
    schedulingPolicy:
      gang:
        minCount: 4

This design enhances modularity and makes it easier to scale status tracking across many replicas.

Scheduler Enhancements for PodGroups

The kube-scheduler in v1.36 introduces a new PodGroup scheduling cycle that enables atomic processing of entire workload groups. This paves the way for future enhancements like advanced co-scheduling and batch job optimizations. The scheduler can now reason about groups of Pods as a single unit, which is critical for distributed training jobs where all workers must start together.

Topology-Aware Scheduling

Kubernetes v1.36 also debuts the first iteration of topology-aware scheduling for PodGroups. This feature allows the scheduler to consider topology constraints—such as node affinity, zone spread, or device placement—when scheduling an entire group. For example, a machine learning training job might require all Pods to be placed on nodes within the same rack to minimize network latency. Topology-aware scheduling helps achieve that by coordinating group placement.

Workload-Aware Preemption

Another new capability is workload-aware preemption. When resources are scarce, the scheduler can preempt lower-priority Pods not just individually, but with awareness of the workload group they belong to. This ensures that preemption decisions do not leave a group partially scheduled, thereby improving overall cluster utilization and job completion times.

Dynamic Resource Allocation for Workloads

ResourceClaim support for workloads now unlocks Dynamic Resource Allocation (DRA) for PodGroups. With DRA, Pods can request specialized resources (e.g., GPUs, FPGAs) on demand. In v1.36, the PodGroup can carry ResourceClaims, enabling the scheduler to allocate resources to the entire group atomically. This is especially valuable for AI/ML workloads that require consistent resource types across all replicas.

Job Controller Integration

To demonstrate real-world readiness, v1.36 delivers the first phase of integration between the Job controller and the new scheduling API. Jobs that use the updated Workload and PodGroup APIs can now be scheduled with the enhanced gang scheduling and topology awareness. This paves the way for future versions to fully leverage the new capabilities for batch and MPI workloads, reducing the need for custom schedulers.

Conclusion

Kubernetes v1.36 marks a major milestone in workload-aware scheduling by decoupling the Workload and PodGroup APIs, introducing a new PodGroup scheduling cycle, and adding topology-aware and workload-aware preemption features. These enhancements improve performance, scalability, and usability for complex workloads like AI/ML and batch processing. The integration with the Job controller validates the approach and sets the stage for broader adoption. As Kubernetes continues to evolve, these foundations will enable even more sophisticated scheduling strategies in future releases.

Tags: