Overview
Kueue is an open-source, cloud-native job queueing system designed for Kubernetes, catering to batch, HPC, AI/ML, and similar workloads. It enables organizations to create multi-tenant batch services with resource quotas and hierarchical sharing, ensuring efficient resource allocation across teams. Kueue integrates seamlessly with Kubernetes tools like kube-scheduler and cluster-autoscaler, supporting both on-premises and cloud environments with dynamic and heterogeneous resources. The project welcomes contributions via GitHub Pull Requests.
What is Kueue? 🚀
Kueue is a Kubernetes-native job queueing system built to manage batch workloads, high-performance computing (HPC), AI/ML training, and other similar applications. It helps organizations optimize resource usage by enforcing quotas, prioritizing jobs, and ensuring fair sharing across multiple teams or tenants.
Key Features of Kueue ✨
-
Multi-Tenancy & Resource Quotas
- Define resource quotas (CPU, memory, GPUs) for teams or departments.
- Enforce hierarchical sharing to prevent resource starvation and ensure fairness.
- Example: Team A gets 50% of cluster resources, Team B gets 30%, and Team C gets 20%.
-
Job Scheduling & Prioritization
- Decides when jobs should wait (queueing) and when/where they should run (scheduling).
- Works alongside Kubernetes’ kube-scheduler for optimal placement.
- Supports preemption (higher-priority jobs can interrupt lower-priority ones).
-
Dynamic & Heterogeneous Resource Support
- Works in on-premises and cloud environments.
- Handles heterogeneous resources (e.g., different GPU types, spot instances).
- Integrates with cluster-autoscaler to dynamically provision resources as needed.
-
Seamless Kubernetes Integration
- Built to work with standard Kubernetes tools (e.g., kube-scheduler, cluster-autoscaler).
- No need for custom schedulers—Kueue extends existing Kubernetes functionality.
-
Open-Source & Community-Driven
- Hosted on GitHub with a Pull Request-based contribution model.
- Welcomes new contributors and users to improve the project.
How Kueue Works 🔧
-
Job Submission
- Users submit batch jobs (e.g., AI training, data processing) to Kueue.
- Jobs are assigned to queues based on team, priority, or resource requirements.
-
Quota Enforcement
- Kueue checks if the job fits within the available quota for the team/tenant.
- If resources are available, the job proceeds; otherwise, it waits in the queue.
-
Scheduling & Execution
- Kueue works with kube-scheduler to place jobs on suitable nodes.
- If resources are scarce, preemption may occur (higher-priority jobs take precedence).
-
Dynamic Scaling (Optional)
- If integrated with cluster-autoscaler, Kueue can trigger auto-scaling to meet demand.
Use Cases 🎯
-
AI/ML Training
- Manage GPU/TPU resources efficiently across multiple teams.
- Prevent resource hogging by enforcing quotas.
-
High-Performance Computing (HPC)
- Schedule large-scale simulations or data processing jobs.
- Ensure fair sharing in shared HPC clusters.
-
Batch Processing
- Run ETL (Extract, Transform, Load) jobs at scale.
- Optimize resource usage for cost savings.
-
Multi-Tenant Kubernetes Clusters
- Share a single cluster among multiple teams without conflicts.
- Enforce resource limits to prevent noisy neighbors.
Getting Started with Kueue 🛠️
-
Installation
- Kueue can be installed via Helm or kubectl (YAML manifests).
- Requires Kubernetes 1.22+.
-
Configuration
- Define resource quotas (e.g., CPU, memory, GPUs).
- Set up queues for different teams or job types.
-
Submitting Jobs
- Jobs are submitted as Kubernetes Custom Resources (CRDs).
- Example:
kubectl apply -f job.yaml
-
Monitoring & Logging
- Use Kubernetes-native tools (e.g.,
kubectl get jobs, Prometheus, Grafana).
- Use Kubernetes-native tools (e.g.,
Why Choose Kueue? 🏆
✅ Built for Kubernetes – No need for external job schedulers.
✅ Multi-Tenancy Support – Fair resource sharing across teams.
✅ Dynamic & Cloud-Friendly – Works with autoscaling and heterogeneous resources.
✅ Open-Source & Extensible – Customize and contribute to the project.
✅ Cost-Effective – Optimize resource usage to reduce cloud spending.
Limitations & Considerations ⚠️
- Learning Curve – Requires familiarity with Kubernetes concepts.
- Not a Replacement for All Schedulers – Best suited for batch/HPC/AI workloads (not real-time apps).
- Community-Driven – Features depend on contributions and community adoption.
Extra links
- https://github.com/kubernetes-sigs/kueue
- https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/
- https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
- https://helm.sh/docs/intro/install/