AI-Hypercomputer · junjieqian · Feb 19, 2026 · Feb 9, 2026 · Feb 19, 2026 · Feb 19, 2026
diff --git a/Ironwood/guides/automation/autoscaling/README.md b/Ironwood/guides/automation/autoscaling/README.md
@@ -0,0 +1,111 @@
+# Ironwood Benchmark Automation With CCC for nodepool creation
+
+This directory contains the automation framework for running TPU microbenchmarks (HBM, Host-Device, Collectives, etc.) on GKE clusters with autoscaling enabled through CCC. The tool simplifies the workflow of launching multiple benchmark jobs via [Kueue](https://kueue.sigs.k8s.io/), monitoring their status, handling retries, and aggregating the final results into a unified format.
+
+The autoscaling version of the script uses CustomComputeClass (CCC) to manage the creation and deletion of the required nodepools automatically based on the workloads.
+
+## Overview
+
+The automation workflow consists of three main stages:
+1.  **Launch**: Submits Kubernetes Jobs for various benchmark configurations (e.g., different topologies like 2x2x1, 2x2x2) using Kueue for queue management.
+2.  **Monitor & Retry**: Watches the jobs until completion. If any job fails, it automatically retries them (up to 3 times by default).
+3.  **Aggregate**: Once all jobs succeed, an aggregator job is launched to collect all intermediate results from GCS and consolidate them into summary TSV files.
+
+## Prerequisites
+
+Before running the automation script, ensure the following requirements are met:
+
+### 1. Environment Setup
+*   **GKE Cluster**: You must have a GKE cluster.
+*   **Kubectl**: Ensure `kubectl` is installed and authenticated to your cluster.
+*   **GCS Bucket**: A Google Cloud Storage bucket is required to store intermediate and final aggregated results.
+    ```bash
+    gcloud storage buckets create gs://my-unique-bucket-name --location=us-central1
+    ```
+
+### 2. Install Kueue
+The automation relies on Kueue for job queuing. Check if it's already installed:
+
+```bash
+kubectl get namespace kueue-system
+```
+
+If you see `Error from server (NotFound)`, install it with:
+
+```bash
+kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.16.0/manifests.yaml
+```
+
+### 3. Verify requirments for CCC
+In order for CCC to work the correct set of CCC templates need to be created. If you have not already done so, allowing pre-flight checks to run
+when the script prompts for it will install all the required CCC templates (templates for different TPU topologies 2x2x1, 2x2x2, etc)
+
+## Directory Structure
+
+*   `automation_launch.sh`: The main entry point script. Manages the full lifecycle of the benchmark run.
+*   `check_ccc_resources.sh`: Validation script that makes sure all CCC related resources are created.
+*   `create_ccc_templates.sh`: Create the required CCC related resources.
+*   `../aggregator.py`: Python script that downloads results from GCS and produces summary tables.
+*   `../aggregator.yaml`: Kubernetes Job definition for running the aggregator.
+*   `job-queue-CCC.yaml`: Kueue resource definitions (ClusterQueue, LocalQueue).
+*   `*.yaml`: Benchmark job configurations (e.g., `tpu7x-2x2x1-hbm.yaml`).
+
+## Configuration
+
+You can configure the behavior using the following environment variable:
+
+| Variable | Description | Required | Default |
+| :--- | :--- | :--- | :--- |
+| `GCS_BUCKET_ROOT_DIR` | The root GCS path where results will be stored. Must start with `gs://`. | **Yes** | `gs://example-microbenchmark` (Change this!) |
+
+## Usage Guide
+
+1.  **Clone the Repository**:
+    ```bash
+    git clone https://github.com/google/accelerator-microbenchmarks.git
+    cd accelerator-microbenchmarks
+    # Switch to the correct branch if necessary
+    git checkout tpu7x-auto
+    ```
+
+2.  **Set the GCS Bucket**:
+    Export the path to your GCS bucket. This is where all results will be saved.
+    ```bash
+    export GCS_BUCKET_ROOT_DIR="gs://your-unique-bucket-name/benchmark_runs/$(date +%Y%m%d_%H%M%S)"
+    ```
+
+3.  **Run the Automation Script**:
+    Execute the launch script from the root of the repository.
+    ```bash
+    bash Ironwood/guides/automation/automation_launch.sh
+    ```
+
+    **What happens next?**
+    *   If pre-flight checks are enabled, will check and CCC resources (and create if needed) and check GCS permissions
+    *   It applies the Kueue job queue.
+    *   It submits the benchmark jobs defined in the script (e.g., HBM tests).
+    *   It waits for jobs to finish, retrying any failures up to 3 times.
+    *   Finally, it launches the `aggregator` job.
+
+## Output
+
+After the automation completes, check your GCS bucket (`GCS_BUCKET_ROOT_DIR`). You will find:
+
+*   **`aggregated_results/`**: Contains the final summary CSV/TSV files (e.g., `hbm.tsv`, `collectives.tsv`).
+*   **`<job-name>/`**: Directories for each individual job containing intermediate results.
+
+## Troubleshooting
+
+### Job Failures
+If jobs fail even after retries:
+1.  Check the script output to see which specific jobs failed.
+2.  Inspect the logs of a failed job using `kubectl logs job/<job-name>`.
+3.  Manually retry a specific job if needed using the command printed by the script at the end of the run.
+
+### Missing Results
+If the `aggregated_results` folder is empty:
+1.  Check the logs of the aggregator job:
+    ```bash
+    kubectl logs job/aggregator
+    ```
+2.  Ensure the `GCS_BUCKET_ROOT_DIR` was accessible by the pods (check Workload Identity or service account permissions if running in a restricted project).
diff --git a/Ironwood/guides/automation/autoscaling/automation_launch.sh b/Ironwood/guides/automation/autoscaling/automation_launch.sh
@@ -0,0 +1,267 @@
+#!/usr/bin/env bash
+
+######################################################################
+# automation_launch.sh: Run a series of TPU microbenchmark jobs
+######################################################################
+# This script automates the process of launching multiple TPU microbenchmark
+# jobs defined in various YAML files. It handles:
+#   - Pre-flight checks for necessary CCC resources and GCS permissions.
+#   - Applying job YAMLs to a Kubernetes cluster.
+#   - Waiting for jobs to complete, with a timeout.
+#   - Retrying failed jobs up to a configurable number of times.
+#   - Aggregating results using a separate aggregator job.
+#   - Reporting on any jobs that ultimately failed.
+#
+# User-configurable variables are at the top of the script.
+######################################################################
+
+######################################################################
+#                            USER INPUT
+######################################################################
+TIMESTAMP=$(date +%Y-%m-%d_%H-%M-%S)
+export GCS_BUCKET_ROOT_DIR="gs://pulasthi-ccc-testb1/test5"
+export GCS_SA_NAME="gcs-writer"  # Service account with write access to GCS_BUCKET_ROOT_DIR
+export PROJECT_ID=$(gcloud config get-value project 2>/dev/null)
+MAX_RETRIES=3
+TIMEOUT_SECOND=3600
+
+yaml_names=(
+    "tpu7x-2x2x1-hbm.yaml"
+    "tpu7x-2x4x4-collectives.yaml"
+    "tpu7x-2x2x1-gemm_all_reduce.yaml"
+)
+
+################################################################################
+# COLOR OUTPUT
+################################################################################
+
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+function print_success() {
+    echo -e "${GREEN}✅ $1${NC}"
+}
+
+function print_error() {
+    echo -e "${RED}❌ $1${NC}"
+}
+
+function print_info() {
+    echo -e "${BLUE}ℹ️  $1${NC}"
+}
+
+function print_warning() {
+    echo -e "${YELLOW}⚠️  $1${NC}"
+}
+
+######################################################################
+#                        VALIDATION & SETUP
+######################################################################
+
+if [[ -z "${GCS_BUCKET_ROOT_DIR}" || "${GCS_BUCKET_ROOT_DIR}" != "gs://"* ]]; then
+  print_error "GCS_BUCKET_ROOT_DIR must be set and start with gs://"
+  exit 1
+fi
+
+print_info "The intermediate result will be written to ${GCS_BUCKET_ROOT_DIR}"
+
+read -p "Run pre-flight checks (CCC resource validation & GCS permissions)? (y/n): " run_checks
+
+if [[ "$run_checks" == "y" ]]; then
+  print_info "Running CCC resource validation..."
+  required_topologies=($(printf "%s\n" "${yaml_names[@]}" | grep -oE '[0-9]+x[0-9]+x[0-9]+' | sort -u))
+  SCRIPT_DIR="$(dirname "$(realpath "$0")")"
+  if ! bash "${SCRIPT_DIR}/check_ccc_resources.sh"; then
+    print_error "Some required CCC resources are missing. Please run create_ccc_templates.sh first. Make sure to fill the requierd variables."
+    exit 1
+  fi
+
+  print_info "Running GCS permission check..."
+  export SA_NAME="${GCS_SA_NAME}"
+  export PROJECT_ID="${PROJECT_ID}"
+  if ! bash "${SCRIPT_DIR}/../check_gcs_permissions.sh"; then
+      print_error "GCS Permission Check Failed. Exiting."
+      exit 1
+  fi
+else
+  print_warning "Skipping pre-flight checks."
+fi
+
+SCRIPT_DIR="$(dirname "$(realpath "$0")")"
+kubectl apply -f ${SCRIPT_DIR}/job-queue-CCC.yaml
+
+######################################################################
+#                 LAUNCH JOBS & WAIT FOR COMPLETION
+######################################################################
+
+
+# Function to wait for a job to complete or fail
+wait_for_job_completion() {
+    local job_name="$1"
+    local timeout="$2"
+    local start_time=$(date +%s)
+    local end_time=$((start_time + timeout))
+
+    while true; do
+        current_time=$(date +%s)
+        if [[ $current_time -gt $end_time ]]; then
+            print_error "Timeout waiting for job ${job_name}"
+            return 2
+        fi
+
+        # Check for Complete condition
+        if kubectl get job "${job_name}" -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null | grep -q "True"; then
+            print_success "Job ${job_name} completed successfully!"
+            return 0
+        fi
+
+        # Check for Failed condition
+        if kubectl get job "${job_name}" -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null | grep -q "True"; then
+            print_error "Job ${job_name} FAILED!"
+            return 1
+        fi
+
+        sleep 5
+    done
+}
+
+# Function to apply jobs and wait for them to complete
+# Returns a list of failed yaml files in the variable FAILED_JOBS
+apply_and_wait() {
+    local yaml_files=("$@")
+    local job_names_in_batch=()
+    FAILED_JOBS=()
+
+    print_info "Processing batch of ${#yaml_files[@]} jobs..."
+
+    # Launch all jobs
+    for yaml_file in "${yaml_files[@]}"; do
+        local filepath="${SCRIPT_DIR}/${yaml_file}"
+        # Derive job name: remove .yaml, lowercase, replace _ with -
+        local job_name=$(basename "${yaml_file}" .yaml | tr '[:upper:]' '[:lower:]' | tr '_' '-')
+        random_suffix=$(head /dev/urandom | tr -dc a-z0-9 | head -c 5)
+        export JOB_NAME="${job_name}-${random_suffix}"
+        export GCS_PATH="${GCS_BUCKET_ROOT_DIR}/${job_name}"
+
+        print_info "Launching job: ${filepath} (name: ${JOB_NAME})"
+        envsubst '${JOB_NAME} ${GCS_PATH} ${GCS_SA_NAME}' < "${filepath}" | kubectl apply -f -
+        job_names_in_batch+=("${JOB_NAME}")
+    done
+
+    # Monitor jobs
+    local start_time=$(date +%s)
+    local end_time=$((start_time + TIMEOUT_SECOND))
+    local last_print_time=0
+
+    while true; do
+        local current_time=$(date +%s)
+        if [[ $current_time -gt $end_time ]]; then
+            print_error "Timeout waiting for batch completion"
+            break
+        fi
+
+        # Identify active jobs
+        local active_jobs=()
+        for job_name in "${job_names_in_batch[@]}"; do
+            # Check for Complete
+            if kubectl get job "${job_name}" -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null | grep -q "True"; then
+                continue
+            fi
+
+            # Check for Failed
+            if kubectl get job "${job_name}" -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null | grep -q "True"; then
+                continue
+            fi
+
+            # If neither, it's pending/running
+            active_jobs+=("${job_name}")
+        done
+
+        if [[ ${#active_jobs[@]} -eq 0 ]]; then
+            break
+        fi
+
+        # Dashboard View - Print every 60 seconds
+        if [[ $((current_time - last_print_time)) -ge 60 ]]; then
+            print_info "======================================================================"
+            date "+%Y-%m-%d %H:%M:%S"
+            print_info "----------------------------------------------------------------------"
+            kubectl get jobs "${active_jobs[@]}"
+            print_info "======================================================================"
+            last_print_time=$current_time
+        fi
+
+        sleep 10
+    done
+
+    # Collect results and cleanup
+    FAILED_JOBS=()
+    for i in "${!yaml_files[@]}"; do
+        local yaml_file="${yaml_files[$i]}"
+        local job_name="${job_names_in_batch[$i]}"
+        local filepath="${SCRIPT_DIR}/${yaml_file}"
+
+        # Check if failed or still running (timeout)
+        if ! kubectl get job "${job_name}" -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null | grep -q "True"; then
+             FAILED_JOBS+=("${yaml_files[$i]}")
+        fi
+
+        export JOB_NAME="${job_name}"
+        export GCS_PATH="${GCS_BUCKET_ROOT_DIR}/${job_name}"
+        envsubst '${JOB_NAME} ${GCS_PATH}' < "${filepath}" | kubectl delete -f - &> /dev/null
+    done
+}
+
+# Retry loop
+current_batch=("${yaml_names[@]}")
+
+for (( retry=1; retry<=MAX_RETRIES; retry++ )); do
+    apply_and_wait "${current_batch[@]}"
+
+    if [[ ${#FAILED_JOBS[@]} -eq 0 ]]; then
+        print_success "All jobs completed successfully in Round ${retry}!"
+        break
+    fi
+
+    print_error "Round ${retry} finished. ${#FAILED_JOBS[@]} jobs failed."
+    current_batch=("${FAILED_JOBS[@]}")
+
+    if [[ ${retry} -lt ${MAX_RETRIES} ]]; then
+        print_info "Retrying failed jobs..."
+        print_info "========================================"
+        print_info "$((retry + 1)) / ${MAX_RETRIES} max retries"
+        print_info "========================================"
+    else
+        print_error "Max retries reached."
+    fi
+done
+
+echo ""
+print_info "Jobs completed. Aggregating results..."
+echo ""
+
+# Ensure cleanup of any previous aggregator job to avoid immutable field errors
+kubectl delete job aggregator --ignore-not-found=true
+
+envsubst '${GCS_BUCKET_ROOT_DIR} ${GCS_SA_NAME}' < ${SCRIPT_DIR}/../aggregator.yaml | kubectl apply -f -
+wait_for_job_completion "aggregator" ${TIMEOUT_SECOND}
+envsubst '${GCS_BUCKET_ROOT_DIR} ${GCS_SA_NAME}' < ${SCRIPT_DIR}/../aggregator.yaml | kubectl delete -f -
+
+# Print the failed jobs at the end for better visibility.
+
+if [[ ${#FAILED_JOBS[@]} -gt 0 ]]; then
+    print_error "The following jobs finally failed after ${MAX_RETRIES} rounds:"
+    printf '%s\n' "${FAILED_JOBS[@]}"
+
+    echo -e "\nTo retry manually, run:"
+    for yaml_file in "${FAILED_JOBS[@]}"; do
+        job_name=$(basename "${yaml_file}" .yaml | tr '[:upper:]' '[:lower:]' | tr '_' '-')
+        GCS_PATH="${GCS_BUCKET_ROOT_DIR}/${job_name}"
+        echo "JOB_NAME=\"${job_name}\" GCS_PATH=\"${GCS_PATH}\" envsubst '\${JOB_NAME} \${GCS_PATH}' < \"${SCRIPT_DIR}/${yaml_file}\" | kubectl apply -f -"
+    done
+else
+    print_success "Success! All jobs finished."
+fi