Skip to content

Conversation

@agrawalkhushi18
Copy link
Contributor

@agrawalkhushi18 agrawalkhushi18 commented Dec 16, 2025

This PR introduces optional Kueue support to the GKE TPU v6 (examples/gke-tpu-v6) and GKE TPU 7x
(examples/gke-tpu-7x) blueprints. This integration enables advanced, Kubernetes-native job queuing
and quota management, which is essential for multi-tenant AI/ML clusters and for managing resources
in environments with limited TPU availability.

Key Changes:

  1. New Configuration Templates:
  • Added kueue-configuration.yaml.tftpl for both v6 and 7x.
  • Configures a ClusterQueue specifically for google.com/tpu` resources.
  • Uses a dynamic ResourceFlavor that targets nodes via the cloud.google.com/gke-tpu-accelerator label, ensuring compatibility with specific hardware generations.
  1. Blueprint Updates:
  • Modified gke-tpu-v6-advanced.yaml and gke-tpu-7x-advanced.yaml to include the kueue installation block within the workload-manager-install module.
  • Dynamic Quota Calculation: The tpu_quota for the queue is automatically calculated based on the deployment's static capacity (num_slices * static_node_count * chips_per_node). This ensures the logical queue limit matches the physical cluster size by default.
  1. Sample Workloads:
  • Added kueue-job-sample.yaml for both blueprints.
  • These samples demonstrate how to submit a distributed JobSet to the configured Kueue queue.
  • Configured with correct default topologies (4x4 for v6e, 2x2x1 for 7x) and resource limits to run out-of-the-box on standard deployments.
  1. Documentation:
  • Updated README.md for both examples with a new "Advanced Scheduling with Kueue" section, detailing setup instructions and job submission steps.

Testing:

  1. Deployment: Validated successful deployment of both TPU v6e and TPU 7x clusters using the updated
    advanced blueprints.
  2. Verification: Confirmed that ClusterQueue and LocalQueue resources were correctly created and
    healthy.
  3. Job Submission: Submitted the sample kueue-job-sample.yaml on both clusters. Verified that jobs
    were successfully:
    • Admitted by Kueue (Admitted: True in workload status).
    • Scheduled onto TPU nodes.
    • Completed successfully (verified via Pod logs and status).

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@agrawalkhushi18 agrawalkhushi18 added the release-key-new-features Added to release notes under the "Key New Features" heading. label Dec 16, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @agrawalkhushi18, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates Kueue, a Kubernetes-native job queuing and quota management system, into the GKE TPU v6 and v7x blueprints. This enhancement provides advanced scheduling capabilities crucial for multi-tenant AI/ML environments and efficient resource management of limited TPU availability. The changes include new configuration templates, updates to existing blueprints to enable Kueue installation, and the addition of sample workloads to demonstrate its usage.

Highlights

  • New Configuration Templates: Added kueue-configuration.yaml.tftpl for both GKE TPU v6 and v7x blueprints. These templates configure a ClusterQueue for google.com/tpu resources and use a dynamic ResourceFlavor targeting nodes via the cloud.google.com/gke-tpu-accelerator label for hardware compatibility.
  • Blueprint Updates: Modified gke-tpu-v6-advanced.yaml and gke-tpu-7x-advanced.yaml to include the Kueue installation block within the workload-manager-install module. The tpu_quota for the queue is now dynamically calculated based on the deployment's static capacity, ensuring the logical queue limit matches the physical cluster size.
  • Sample Workloads: Introduced kueue-job-sample.yaml for both blueprints. These samples demonstrate how to submit a distributed JobSet to the configured Kueue queue, with correct default topologies (4x4 for v6e, 2x2x1 for 7x) and resource limits for out-of-the-box execution.
  • Documentation Updates: Updated the README.md files for both examples with a new 'Advanced Scheduling with Kueue' section, providing detailed setup instructions and job submission steps.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable Kueue support for the GKE TPU v6 and v7x blueprints, enhancing job queuing and resource management for multi-tenant AI/ML clusters. The changes are well-implemented, including new configuration templates, blueprint modifications for Kueue integration, and sample workloads. The documentation has also been updated accordingly. My review includes a couple of minor suggestions to improve documentation accuracy and code comment clarity. Overall, this is a strong contribution to the project.

@agrawalkhushi18 agrawalkhushi18 marked this pull request as ready for review December 16, 2025 19:08
@agrawalkhushi18 agrawalkhushi18 requested review from a team and samskillman as code owners December 16, 2025 19:08
bytetwin
bytetwin previously approved these changes Dec 18, 2025
Copy link
Contributor

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@agrawalkhushi18 agrawalkhushi18 merged commit 81907ae into GoogleCloudPlatform:develop Dec 18, 2025
12 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants