Kueue - Xavki

Overview

Kueue is an open-source, cloud-native job queueing system designed for Kubernetes, catering to batch, HPC, AI/ML, and similar workloads. It enables organizations to create multi-tenant batch services with resource quotas and hierarchical sharing, ensuring efficient resource allocation across teams. Kueue integrates seamlessly with Kubernetes tools like kube-scheduler and cluster-autoscaler, supporting both on-premises and cloud environments with dynamic and heterogeneous resources. The project welcomes contributions via GitHub Pull Requests.

What is Kueue? 🚀

Kueue is a Kubernetes-native job queueing system built to manage batch workloads, high-performance computing (HPC), AI/ML training, and other similar applications. It helps organizations optimize resource usage by enforcing quotas, prioritizing jobs, and ensuring fair sharing across multiple teams or tenants.

Key Features of Kueue ✨

Multi-Tenancy & Resource Quotas
- Define resource quotas (CPU, memory, GPUs) for teams or departments.
- Enforce hierarchical sharing to prevent resource starvation and ensure fairness.
- Example: Team A gets 50% of cluster resources, Team B gets 30%, and Team C gets 20%.
Job Scheduling & Prioritization
- Decides when jobs should wait (queueing) and when/where they should run (scheduling).
- Works alongside Kubernetes’ kube-scheduler for optimal placement.
- Supports preemption (higher-priority jobs can interrupt lower-priority ones).
Dynamic & Heterogeneous Resource Support
- Works in on-premises and cloud environments.
- Handles heterogeneous resources (e.g., different GPU types, spot instances).
- Integrates with cluster-autoscaler to dynamically provision resources as needed.
Seamless Kubernetes Integration
- Built to work with standard Kubernetes tools (e.g., kube-scheduler, cluster-autoscaler).
- No need for custom schedulers—Kueue extends existing Kubernetes functionality.
Open-Source & Community-Driven
- Hosted on GitHub with a Pull Request-based contribution model.
- Welcomes new contributors and users to improve the project.

Découvrez GitHub - kubernetes-sigs/kwok: Kubernetes WithOut Kubelet - Simulates thousands of Nodes and Clusters.

How Kueue Works 🔧

Job Submission
- Users submit batch jobs (e.g., AI training, data processing) to Kueue.
- Jobs are assigned to queues based on team, priority, or resource requirements.
Quota Enforcement
- Kueue checks if the job fits within the available quota for the team/tenant.
- If resources are available, the job proceeds; otherwise, it waits in the queue.
Scheduling & Execution
- Kueue works with kube-scheduler to place jobs on suitable nodes.
- If resources are scarce, preemption may occur (higher-priority jobs take precedence).
Dynamic Scaling (Optional)
- If integrated with cluster-autoscaler, Kueue can trigger auto-scaling to meet demand.

Use Cases 🎯

AI/ML Training
- Manage GPU/TPU resources efficiently across multiple teams.
- Prevent resource hogging by enforcing quotas.
High-Performance Computing (HPC)
- Schedule large-scale simulations or data processing jobs.
- Ensure fair sharing in shared HPC clusters.
Batch Processing
- Run ETL (Extract, Transform, Load) jobs at scale.
- Optimize resource usage for cost savings.
Multi-Tenant Kubernetes Clusters
- Share a single cluster among multiple teams without conflicts.
- Enforce resource limits to prevent noisy neighbors.

Getting Started with Kueue 🛠️

Installation
- Kueue can be installed via Helm or kubectl (YAML manifests).
- Requires Kubernetes 1.22+.
Configuration
- Define resource quotas (e.g., CPU, memory, GPUs).
- Set up queues for different teams or job types.
Submitting Jobs
- Jobs are submitted as Kubernetes Custom Resources (CRDs).
- Example: kubectl apply -f job.yaml
Monitoring & Logging
- Use Kubernetes-native tools (e.g., kubectl get jobs, Prometheus, Grafana).

Why Choose Kueue? 🏆

✅ Built for Kubernetes – No need for external job schedulers.
✅ Multi-Tenancy Support – Fair resource sharing across teams.
✅ Dynamic & Cloud-Friendly – Works with autoscaling and heterogeneous resources.
✅ Open-Source & Extensible – Customize and contribute to the project.
✅ Cost-Effective – Optimize resource usage to reduce cloud spending.

Découvrez Deploy Traefik as a Gateway API - Sidero Documentation

Limitations & Considerations ⚠️

Learning Curve – Requires familiarity with Kubernetes concepts.
Not a Replacement for All Schedulers – Best suited for batch/HPC/AI workloads (not real-time apps).
Community-Driven – Features depend on contributions and community adoption.

Extra links

https://github.com/kubernetes-sigs/kueue
https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/
https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
https://helm.sh/docs/intro/install/

Cookie	Durée	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.