Description: KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale
View nvidia/kai-scheduler on GitHub ↗
Kai-Scheduler is an open-source Kubernetes scheduler extension developed by NVIDIA, designed to optimize resource allocation for accelerated computing workloads, particularly those leveraging GPUs. It addresses limitations in the default Kubernetes scheduler when dealing with complex hardware topologies and diverse workload requirements common in AI, machine learning, and high-performance computing (HPC) environments. Instead of replacing the core Kubernetes scheduler, Kai-Scheduler functions as a pluggable, priority-based scheduler, allowing it to coexist and cooperate with the default scheduler, handling specific workload types while the default scheduler manages others.
The core problem Kai-Scheduler solves is efficient GPU resource allocation. The standard Kubernetes scheduler often struggles with effectively packing GPU workloads considering factors like GPU memory, interconnect bandwidth (like NVLink), and GPU affinity. Kai-Scheduler introduces a more sophisticated scheduling algorithm that understands these hardware characteristics and can make placement decisions that maximize GPU utilization and minimize communication overhead. It achieves this through a "filter and score" framework, similar to the default scheduler, but with custom filters and scoring functions tailored for accelerated workloads. Filters eliminate unsuitable nodes, and scoring functions rank the remaining nodes based on their suitability for the pod.
Key features of Kai-Scheduler include support for various GPU sharing technologies like Multi-Instance GPU (MIG) and virtual GPUs, allowing for finer-grained resource allocation and increased utilization.
Fetching additional details & charts...