Kueue

Overview

Kueue is an open-source, cloud-native job queueing system designed for Kubernetes, catering to batch, HPC, AI/ML, and similar workloads. It enables organizations to create multi-tenant batch services with resource quotas and hierarchical sharing, ensuring efficient resource allocation across teams. Kueue integrates seamlessly with Kubernetes tools like kube-scheduler and cluster-autoscaler, supporting both on-premises and cloud environments with dynamic and heterogeneous resources. The project welcomes contributions via GitHub Pull Requests.

What is Kueue? 🚀

Kueue is a Kubernetes-native job queueing system built to manage batch workloads, high-performance computing (HPC), AI/ML training, and other similar applications. It helps organizations optimize resource usage by enforcing quotas, prioritizing jobs, and ensuring fair sharing across multiple teams or tenants.


Key Features of Kueue ✨

  • Multi-Tenancy & Resource Quotas

    • Define resource quotas (CPU, memory, GPUs) for teams or departments.
    • Enforce hierarchical sharing to prevent resource starvation and ensure fairness.
    • Example: Team A gets 50% of cluster resources, Team B gets 30%, and Team C gets 20%.
  • Job Scheduling & Prioritization

    • Decides when jobs should wait (queueing) and when/where they should run (scheduling).
    • Works alongside Kubernetes’ kube-scheduler for optimal placement.
    • Supports preemption (higher-priority jobs can interrupt lower-priority ones).
  • Dynamic & Heterogeneous Resource Support

    • Works in on-premises and cloud environments.
    • Handles heterogeneous resources (e.g., different GPU types, spot instances).
    • Integrates with cluster-autoscaler to dynamically provision resources as needed.
  • Seamless Kubernetes Integration

    • Built to work with standard Kubernetes tools (e.g., kube-scheduler, cluster-autoscaler).
    • No need for custom schedulers—Kueue extends existing Kubernetes functionality.
  • Open-Source & Community-Driven

    • Hosted on GitHub with a Pull Request-based contribution model.
    • Welcomes new contributors and users to improve the project.
Découvrez  Deploy Traefik as a Gateway API - Sidero Documentation

How Kueue Works 🔧

  1. Job Submission

    • Users submit batch jobs (e.g., AI training, data processing) to Kueue.
    • Jobs are assigned to queues based on team, priority, or resource requirements.
  2. Quota Enforcement

    • Kueue checks if the job fits within the available quota for the team/tenant.
    • If resources are available, the job proceeds; otherwise, it waits in the queue.
  3. Scheduling & Execution

    • Kueue works with kube-scheduler to place jobs on suitable nodes.
    • If resources are scarce, preemption may occur (higher-priority jobs take precedence).
  4. Dynamic Scaling (Optional)

    • If integrated with cluster-autoscaler, Kueue can trigger auto-scaling to meet demand.

Use Cases 🎯

  • AI/ML Training

    • Manage GPU/TPU resources efficiently across multiple teams.
    • Prevent resource hogging by enforcing quotas.
  • High-Performance Computing (HPC)

    • Schedule large-scale simulations or data processing jobs.
    • Ensure fair sharing in shared HPC clusters.
  • Batch Processing

    • Run ETL (Extract, Transform, Load) jobs at scale.
    • Optimize resource usage for cost savings.
  • Multi-Tenant Kubernetes Clusters

    • Share a single cluster among multiple teams without conflicts.
    • Enforce resource limits to prevent noisy neighbors.

Getting Started with Kueue 🛠️

  • Installation

    • Kueue can be installed via Helm or kubectl (YAML manifests).
    • Requires Kubernetes 1.22+.
  • Configuration

    • Define resource quotas (e.g., CPU, memory, GPUs).
    • Set up queues for different teams or job types.
  • Submitting Jobs

    • Jobs are submitted as Kubernetes Custom Resources (CRDs).
    • Example: kubectl apply -f job.yaml
  • Monitoring & Logging

    • Use Kubernetes-native tools (e.g., kubectl get jobs, Prometheus, Grafana).

Why Choose Kueue? 🏆

Built for Kubernetes – No need for external job schedulers.
Multi-Tenancy Support – Fair resource sharing across teams.
Dynamic & Cloud-Friendly – Works with autoscaling and heterogeneous resources.
Open-Source & Extensible – Customize and contribute to the project.
Cost-Effective – Optimize resource usage to reduce cloud spending.

Découvrez  GitHub - kubernetes-sigs/kwok: Kubernetes WithOut Kubelet - Simulates thousands of Nodes and Clusters.

Limitations & Considerations ⚠️

  • Learning Curve – Requires familiarity with Kubernetes concepts.
  • Not a Replacement for All Schedulers – Best suited for batch/HPC/AI workloads (not real-time apps).
  • Community-Driven – Features depend on contributions and community adoption.

Extra links

  • https://github.com/kubernetes-sigs/kueue
  • https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/
  • https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
  • https://helm.sh/docs/intro/install/