diff --git a/Ironwood/guides/automation/autoscaling/README.md b/Ironwood/guides/automation/autoscaling/README.md new file mode 100644 index 0000000..e8bbde5 --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/README.md @@ -0,0 +1,111 @@ +# Ironwood Benchmark Automation With CCC for nodepool creation + +This directory contains the automation framework for running TPU microbenchmarks (HBM, Host-Device, Collectives, etc.) on GKE clusters with autoscaling enabled through CCC. The tool simplifies the workflow of launching multiple benchmark jobs via [Kueue](https://kueue.sigs.k8s.io/), monitoring their status, handling retries, and aggregating the final results into a unified format. + +The autoscaling version of the script uses CustomComputeClass (CCC) to manage the creation and deletion of the required nodepools automatically based on the workloads. + +## Overview + +The automation workflow consists of three main stages: +1. **Launch**: Submits Kubernetes Jobs for various benchmark configurations (e.g., different topologies like 2x2x1, 2x2x2) using Kueue for queue management. +2. **Monitor & Retry**: Watches the jobs until completion. If any job fails, it automatically retries them (up to 3 times by default). +3. **Aggregate**: Once all jobs succeed, an aggregator job is launched to collect all intermediate results from GCS and consolidate them into summary TSV files. + +## Prerequisites + +Before running the automation script, ensure the following requirements are met: + +### 1. Environment Setup +* **GKE Cluster**: You must have a GKE cluster. +* **Kubectl**: Ensure `kubectl` is installed and authenticated to your cluster. +* **GCS Bucket**: A Google Cloud Storage bucket is required to store intermediate and final aggregated results. + ```bash + gcloud storage buckets create gs://my-unique-bucket-name --location=us-central1 + ``` + +### 2. Install Kueue +The automation relies on Kueue for job queuing. Check if it's already installed: + +```bash +kubectl get namespace kueue-system +``` + +If you see `Error from server (NotFound)`, install it with: + +```bash +kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.16.0/manifests.yaml +``` + +### 3. Verify requirments for CCC +In order for CCC to work the correct set of CCC templates need to be created. If you have not already done so, allowing pre-flight checks to run +when the script prompts for it will install all the required CCC templates (templates for different TPU topologies 2x2x1, 2x2x2, etc) + +## Directory Structure + +* `automation_launch.sh`: The main entry point script. Manages the full lifecycle of the benchmark run. +* `check_ccc_resources.sh`: Validation script that makes sure all CCC related resources are created. +* `create_ccc_templates.sh`: Create the required CCC related resources. +* `../aggregator.py`: Python script that downloads results from GCS and produces summary tables. +* `../aggregator.yaml`: Kubernetes Job definition for running the aggregator. +* `job-queue-CCC.yaml`: Kueue resource definitions (ClusterQueue, LocalQueue). +* `*.yaml`: Benchmark job configurations (e.g., `tpu7x-2x2x1-hbm.yaml`). + +## Configuration + +You can configure the behavior using the following environment variable: + +| Variable | Description | Required | Default | +| :--- | :--- | :--- | :--- | +| `GCS_BUCKET_ROOT_DIR` | The root GCS path where results will be stored. Must start with `gs://`. | **Yes** | `gs://example-microbenchmark` (Change this!) | + +## Usage Guide + +1. **Clone the Repository**: + ```bash + git clone https://github.com/google/accelerator-microbenchmarks.git + cd accelerator-microbenchmarks + # Switch to the correct branch if necessary + git checkout tpu7x-auto + ``` + +2. **Set the GCS Bucket**: + Export the path to your GCS bucket. This is where all results will be saved. + ```bash + export GCS_BUCKET_ROOT_DIR="gs://your-unique-bucket-name/benchmark_runs/$(date +%Y%m%d_%H%M%S)" + ``` + +3. **Run the Automation Script**: + Execute the launch script from the root of the repository. + ```bash + bash Ironwood/guides/automation/automation_launch.sh + ``` + + **What happens next?** + * If pre-flight checks are enabled, will check and CCC resources (and create if needed) and check GCS permissions + * It applies the Kueue job queue. + * It submits the benchmark jobs defined in the script (e.g., HBM tests). + * It waits for jobs to finish, retrying any failures up to 3 times. + * Finally, it launches the `aggregator` job. + +## Output + +After the automation completes, check your GCS bucket (`GCS_BUCKET_ROOT_DIR`). You will find: + +* **`aggregated_results/`**: Contains the final summary CSV/TSV files (e.g., `hbm.tsv`, `collectives.tsv`). +* **`/`**: Directories for each individual job containing intermediate results. + +## Troubleshooting + +### Job Failures +If jobs fail even after retries: +1. Check the script output to see which specific jobs failed. +2. Inspect the logs of a failed job using `kubectl logs job/`. +3. Manually retry a specific job if needed using the command printed by the script at the end of the run. + +### Missing Results +If the `aggregated_results` folder is empty: +1. Check the logs of the aggregator job: + ```bash + kubectl logs job/aggregator + ``` +2. Ensure the `GCS_BUCKET_ROOT_DIR` was accessible by the pods (check Workload Identity or service account permissions if running in a restricted project). diff --git a/Ironwood/guides/automation/autoscaling/automation_launch.sh b/Ironwood/guides/automation/autoscaling/automation_launch.sh new file mode 100755 index 0000000..2823a4e --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/automation_launch.sh @@ -0,0 +1,267 @@ +#!/usr/bin/env bash + +###################################################################### +# automation_launch.sh: Run a series of TPU microbenchmark jobs +###################################################################### +# This script automates the process of launching multiple TPU microbenchmark +# jobs defined in various YAML files. It handles: +# - Pre-flight checks for necessary CCC resources and GCS permissions. +# - Applying job YAMLs to a Kubernetes cluster. +# - Waiting for jobs to complete, with a timeout. +# - Retrying failed jobs up to a configurable number of times. +# - Aggregating results using a separate aggregator job. +# - Reporting on any jobs that ultimately failed. +# +# User-configurable variables are at the top of the script. +###################################################################### + +###################################################################### +# USER INPUT +###################################################################### +TIMESTAMP=$(date +%Y-%m-%d_%H-%M-%S) +export GCS_BUCKET_ROOT_DIR="gs://pulasthi-ccc-testb1/test5" +export GCS_SA_NAME="gcs-writer" # Service account with write access to GCS_BUCKET_ROOT_DIR +export PROJECT_ID=$(gcloud config get-value project 2>/dev/null) +MAX_RETRIES=3 +TIMEOUT_SECOND=3600 + +yaml_names=( + "tpu7x-2x2x1-hbm.yaml" + "tpu7x-2x4x4-collectives.yaml" + "tpu7x-2x2x1-gemm_all_reduce.yaml" +) + +################################################################################ +# COLOR OUTPUT +################################################################################ + +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +function print_success() { + echo -e "${GREEN}✅ $1${NC}" +} + +function print_error() { + echo -e "${RED}❌ $1${NC}" +} + +function print_info() { + echo -e "${BLUE}ℹ️ $1${NC}" +} + +function print_warning() { + echo -e "${YELLOW}⚠️ $1${NC}" +} + +###################################################################### +# VALIDATION & SETUP +###################################################################### + +if [[ -z "${GCS_BUCKET_ROOT_DIR}" || "${GCS_BUCKET_ROOT_DIR}" != "gs://"* ]]; then + print_error "GCS_BUCKET_ROOT_DIR must be set and start with gs://" + exit 1 +fi + +print_info "The intermediate result will be written to ${GCS_BUCKET_ROOT_DIR}" + +read -p "Run pre-flight checks (CCC resource validation & GCS permissions)? (y/n): " run_checks + +if [[ "$run_checks" == "y" ]]; then + print_info "Running CCC resource validation..." + required_topologies=($(printf "%s\n" "${yaml_names[@]}" | grep -oE '[0-9]+x[0-9]+x[0-9]+' | sort -u)) + SCRIPT_DIR="$(dirname "$(realpath "$0")")" + if ! bash "${SCRIPT_DIR}/check_ccc_resources.sh"; then + print_error "Some required CCC resources are missing. Please run create_ccc_templates.sh first. Make sure to fill the requierd variables." + exit 1 + fi + + print_info "Running GCS permission check..." + export SA_NAME="${GCS_SA_NAME}" + export PROJECT_ID="${PROJECT_ID}" + if ! bash "${SCRIPT_DIR}/../check_gcs_permissions.sh"; then + print_error "GCS Permission Check Failed. Exiting." + exit 1 + fi +else + print_warning "Skipping pre-flight checks." +fi + +SCRIPT_DIR="$(dirname "$(realpath "$0")")" +kubectl apply -f ${SCRIPT_DIR}/job-queue-CCC.yaml + +###################################################################### +# LAUNCH JOBS & WAIT FOR COMPLETION +###################################################################### + + +# Function to wait for a job to complete or fail +wait_for_job_completion() { + local job_name="$1" + local timeout="$2" + local start_time=$(date +%s) + local end_time=$((start_time + timeout)) + + while true; do + current_time=$(date +%s) + if [[ $current_time -gt $end_time ]]; then + print_error "Timeout waiting for job ${job_name}" + return 2 + fi + + # Check for Complete condition + if kubectl get job "${job_name}" -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null | grep -q "True"; then + print_success "Job ${job_name} completed successfully!" + return 0 + fi + + # Check for Failed condition + if kubectl get job "${job_name}" -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null | grep -q "True"; then + print_error "Job ${job_name} FAILED!" + return 1 + fi + + sleep 5 + done +} + +# Function to apply jobs and wait for them to complete +# Returns a list of failed yaml files in the variable FAILED_JOBS +apply_and_wait() { + local yaml_files=("$@") + local job_names_in_batch=() + FAILED_JOBS=() + + print_info "Processing batch of ${#yaml_files[@]} jobs..." + + # Launch all jobs + for yaml_file in "${yaml_files[@]}"; do + local filepath="${SCRIPT_DIR}/${yaml_file}" + # Derive job name: remove .yaml, lowercase, replace _ with - + local job_name=$(basename "${yaml_file}" .yaml | tr '[:upper:]' '[:lower:]' | tr '_' '-') + random_suffix=$(head /dev/urandom | tr -dc a-z0-9 | head -c 5) + export JOB_NAME="${job_name}-${random_suffix}" + export GCS_PATH="${GCS_BUCKET_ROOT_DIR}/${job_name}" + + print_info "Launching job: ${filepath} (name: ${JOB_NAME})" + envsubst '${JOB_NAME} ${GCS_PATH} ${GCS_SA_NAME}' < "${filepath}" | kubectl apply -f - + job_names_in_batch+=("${JOB_NAME}") + done + + # Monitor jobs + local start_time=$(date +%s) + local end_time=$((start_time + TIMEOUT_SECOND)) + local last_print_time=0 + + while true; do + local current_time=$(date +%s) + if [[ $current_time -gt $end_time ]]; then + print_error "Timeout waiting for batch completion" + break + fi + + # Identify active jobs + local active_jobs=() + for job_name in "${job_names_in_batch[@]}"; do + # Check for Complete + if kubectl get job "${job_name}" -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null | grep -q "True"; then + continue + fi + + # Check for Failed + if kubectl get job "${job_name}" -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null | grep -q "True"; then + continue + fi + + # If neither, it's pending/running + active_jobs+=("${job_name}") + done + + if [[ ${#active_jobs[@]} -eq 0 ]]; then + break + fi + + # Dashboard View - Print every 60 seconds + if [[ $((current_time - last_print_time)) -ge 60 ]]; then + print_info "======================================================================" + date "+%Y-%m-%d %H:%M:%S" + print_info "----------------------------------------------------------------------" + kubectl get jobs "${active_jobs[@]}" + print_info "======================================================================" + last_print_time=$current_time + fi + + sleep 10 + done + + # Collect results and cleanup + FAILED_JOBS=() + for i in "${!yaml_files[@]}"; do + local yaml_file="${yaml_files[$i]}" + local job_name="${job_names_in_batch[$i]}" + local filepath="${SCRIPT_DIR}/${yaml_file}" + + # Check if failed or still running (timeout) + if ! kubectl get job "${job_name}" -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null | grep -q "True"; then + FAILED_JOBS+=("${yaml_files[$i]}") + fi + + export JOB_NAME="${job_name}" + export GCS_PATH="${GCS_BUCKET_ROOT_DIR}/${job_name}" + envsubst '${JOB_NAME} ${GCS_PATH}' < "${filepath}" | kubectl delete -f - &> /dev/null + done +} + +# Retry loop +current_batch=("${yaml_names[@]}") + +for (( retry=1; retry<=MAX_RETRIES; retry++ )); do + apply_and_wait "${current_batch[@]}" + + if [[ ${#FAILED_JOBS[@]} -eq 0 ]]; then + print_success "All jobs completed successfully in Round ${retry}!" + break + fi + + print_error "Round ${retry} finished. ${#FAILED_JOBS[@]} jobs failed." + current_batch=("${FAILED_JOBS[@]}") + + if [[ ${retry} -lt ${MAX_RETRIES} ]]; then + print_info "Retrying failed jobs..." + print_info "========================================" + print_info "$((retry + 1)) / ${MAX_RETRIES} max retries" + print_info "========================================" + else + print_error "Max retries reached." + fi +done + +echo "" +print_info "Jobs completed. Aggregating results..." +echo "" + +# Ensure cleanup of any previous aggregator job to avoid immutable field errors +kubectl delete job aggregator --ignore-not-found=true + +envsubst '${GCS_BUCKET_ROOT_DIR} ${GCS_SA_NAME}' < ${SCRIPT_DIR}/../aggregator.yaml | kubectl apply -f - +wait_for_job_completion "aggregator" ${TIMEOUT_SECOND} +envsubst '${GCS_BUCKET_ROOT_DIR} ${GCS_SA_NAME}' < ${SCRIPT_DIR}/../aggregator.yaml | kubectl delete -f - + +# Print the failed jobs at the end for better visibility. + +if [[ ${#FAILED_JOBS[@]} -gt 0 ]]; then + print_error "The following jobs finally failed after ${MAX_RETRIES} rounds:" + printf '%s\n' "${FAILED_JOBS[@]}" + + echo -e "\nTo retry manually, run:" + for yaml_file in "${FAILED_JOBS[@]}"; do + job_name=$(basename "${yaml_file}" .yaml | tr '[:upper:]' '[:lower:]' | tr '_' '-') + GCS_PATH="${GCS_BUCKET_ROOT_DIR}/${job_name}" + echo "JOB_NAME=\"${job_name}\" GCS_PATH=\"${GCS_PATH}\" envsubst '\${JOB_NAME} \${GCS_PATH}' < \"${SCRIPT_DIR}/${yaml_file}\" | kubectl apply -f -" + done +else + print_success "Success! All jobs finished." +fi diff --git a/Ironwood/guides/automation/autoscaling/check_ccc_resources.sh b/Ironwood/guides/automation/autoscaling/check_ccc_resources.sh new file mode 100644 index 0000000..cb085d6 --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/check_ccc_resources.sh @@ -0,0 +1,82 @@ +#!/bin/bash + +###################################################################### +# check_ccc_resources.sh: Validate existence of CCC resources +###################################################################### +# This script checks if the required Google Cloud Compute resource policies +# and Kubernetes Custom Compute Class (CCC) manifests exist for a given +# list of TPU topologies. +# +# It iterates through the provided TOPOLOGIES array: +# - For multi-host topologies, it verifies the presence of the +# expected workload policy using gcloud. +# - It checks for the existence of the Custom Compute Class resource +# in the Kubernetes cluster using kubectl. +# +# The script exits with status 1 if any required resource is missing, +# and status 0 if all resources are found. +###################################################################### + +export TOPOLOGIES=(2x2x1 2x2x2 2x2x4 2x4x4 4x4x4 4x4x8) +PROJECT_ID="${PROJECT_ID:-$(gcloud config get-value project 2>/dev/null)}" +export REGION=$(kubectl get nodes -o jsonpath='{.items[0].metadata.labels.topology\.kubernetes\.io/region}') +CLUSTER_NAME=$(kubectl config current-context | cut -d '_' -f 4) +export RESOURCE_NAME=${CLUSTER_NAME%-gke} + +################################################################################ +# COLOR OUTPUT +################################################################################ + +RED='\033[0;31m' +GREEN='\033[0;32m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +function print_success() { + echo -e "${GREEN}✅ $1${NC}" +} + +function print_error() { + echo -e "${RED}❌ $1${NC}" +} + +function print_info() { + echo -e "${BLUE}ℹ️ $1${NC}" +} + +print_info "Checking CCC resources for all topologies" +missing_resources=false + +for TOPOLOGY in "${TOPOLOGIES[@]}" +do + print_info "Checking resources for topology: ${TOPOLOGY}" + # Check workload policy for multi-host topologies + if [[ "${TOPOLOGY}" != "2x2x1" ]]; then + WORKLOAD_POLICY_NAME="${RESOURCE_NAME}-workload-policy${TOPOLOGY}" + if gcloud compute resource-policies describe ${WORKLOAD_POLICY_NAME} --project=${PROJECT_ID} --region=${REGION} &> /dev/null; then + print_success "Workload policy ${WORKLOAD_POLICY_NAME} exists." + else + print_error "Workload policy ${WORKLOAD_POLICY_NAME} is MISSING." + missing_resources=true + fi + else + print_info "Skipping workload policy check for single-host topology ${TOPOLOGY}." + fi + + # Check Custom Compute Class + CCC_NAME="tpuv7-${TOPOLOGY}-class" + if kubectl get computeclass ${CCC_NAME} &> /dev/null; then + print_success "Custom Compute Class ${CCC_NAME} exists." + else + print_error "Custom Compute Class ${CCC_NAME} is MISSING." + missing_resources=true + fi +done + +if [[ "${missing_resources}" == "true" ]]; then + print_error "One or more required resources are missing. Please create them." + exit 1 +else + print_success "All required CCC resources exist." + exit 0 +fi diff --git a/Ironwood/guides/automation/autoscaling/create_ccc_templates.sh b/Ironwood/guides/automation/autoscaling/create_ccc_templates.sh new file mode 100755 index 0000000..5630885 --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/create_ccc_templates.sh @@ -0,0 +1,98 @@ +#!/bin/bash + +###################################################################### +# create_ccc_templates.sh: Create Custom Compute Class templates +###################################################################### +# This script creates the necessary Google Cloud Compute resource policies +# and Kubernetes Custom Compute Class (CCC) manifests for various TPU +# topologies. +# +# It iterates through a predefined list of TOPOLOGIES: +# - For multi-host topologies, it creates a HIGH_THROUGHPUT +# workload policy if it doesn't already exist. +# - It then uses envsubst to populate a template YAML +# (tpu-ccc-template.yaml) with the correct TPU_TOPOLOGY, +# RESERVATION_NAME, PROJECT_ID, and POLICY_NAME. +# - The resulting manifest is applied to the Kubernetes cluster using +# kubectl apply. +# +# Required environment variables: +# - RESERVATION_NAME: The name of the GCE reservation to use. +# - PROJECT_ID: The Google Cloud Project ID. +# - REGION: The Google Cloud Region. +# - RESOURCE_NAME: A base name used for naming resources. +###################################################################### + +export RESERVATION_NAME="" + + +export TOPOLOGIES=(2x2x1 2x2x2 2x2x4 2x4x4 4x4x4 4x4x8) +SCRIPT_DIR="$(dirname "$(realpath "$0")")" +PROJECT_ID="${PROJECT_ID:-$(gcloud config get-value project 2>/dev/null)}" +export REGION=$(kubectl get nodes -o jsonpath='{.items[0].metadata.labels.topology\.kubernetes\.io/region}') +CLUSTER_NAME=$(kubectl config current-context | cut -d '_' -f 4) +export RESOURCE_NAME=${CLUSTER_NAME%-gke} # assumes cluster was created with setup script which creates cluster with ${RESOURCE_NAME}-gke as name +################################################################################ +# COLOR OUTPUT +################################################################################ + +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +function print_header() { + echo -e "\n${BLUE}========================================${NC}" + echo -e "${BLUE}$1${NC}" + echo -e "${BLUE}========================================${NC}\n" +} + +function print_success() { + echo -e "${GREEN}✅ $1${NC}" +} + +function print_error() { + echo -e "${RED}❌ $1${NC}" +} + +function print_warning() { + echo -e "${YELLOW}⚠️ $1${NC}" +} + +function print_info() { + echo -e "${BLUE}ℹ️ $1${NC}" +} + +print_info "Creating CCC templates for all topoligies" +# Create workload policy +for TOPOLOGY in "${TOPOLOGIES[@]}" +do + export TPU_TOPOLOGY=${TOPOLOGY} + if [[ "${TOPOLOGY}" == "2x2x1" ]]; then + print_warning "Skipping workload policy creation for ${TOPOLOGY} as it is not needed for single host topologies." + export POLICY_NAME="" # No policy for single host + else + WORKLOAD_POLICY_NAME="${RESOURCE_NAME}-workload-policy${TOPOLOGY}" + if gcloud compute resource-policies describe ${WORKLOAD_POLICY_NAME} --project=${PROJECT_ID} --region=${REGION} &> /dev/null; then + print_info "Workload policy ${WORKLOAD_POLICY_NAME} already exists." + else + print_info "Creating workload policy ${WORKLOAD_POLICY_NAME}..." + gcloud compute resource-policies create workload-policy ${WORKLOAD_POLICY_NAME} \ + --type HIGH_THROUGHPUT \ + --accelerator-topology ${TOPOLOGY} \ + --project ${PROJECT_ID} \ + --region ${REGION} + print_success "Workload policy ${WORKLOAD_POLICY_NAME} created." + fi + export POLICY_NAME=${WORKLOAD_POLICY_NAME} + fi + + echo "${TPU_TOPOLOGY} ${RESERVATION_NAME} ${PROJECT_ID} ${POLICY_NAME}" + if [[ "${TOPOLOGY}" == "2x2x1" ]]; then + envsubst '${TPU_TOPOLOGY} ${RESERVATION_NAME} ${PROJECT_ID}' < ${SCRIPT_DIR}/tpu-ccc-template.yaml | sed '/placement:/,/policyName:/d' | kubectl apply -f - + else + envsubst '${TPU_TOPOLOGY} ${RESERVATION_NAME} ${PROJECT_ID} ${POLICY_NAME}' < ${SCRIPT_DIR}/tpu-ccc-template.yaml | kubectl apply -f - + fi + print_success "Applied TPU Compute Class for ${TOPOLOGY}" +done diff --git a/Ironwood/guides/automation/autoscaling/job-queue-CCC.yaml b/Ironwood/guides/automation/autoscaling/job-queue-CCC.yaml new file mode 100644 index 0000000..c1c155c --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/job-queue-CCC.yaml @@ -0,0 +1,40 @@ +apiVersion: kueue.x-k8s.io/v1beta2 +kind: ResourceFlavor +metadata: + name: "flavor-tpu7x" +spec: + nodeLabels: + cloud.google.com/gke-tpu-accelerator: tpu7x +--- +apiVersion: kueue.x-k8s.io/v1beta2 +kind: ClusterQueue +metadata: + name: cluster-queue-tpu7x +spec: + flavorFungibility: + whenCanBorrow: MayStopSearch + whenCanPreempt: TryNextFlavor + namespaceSelector: {} + preemption: + borrowWithinCohort: + policy: Never + reclaimWithinCohort: Never + withinClusterQueue: LowerPriority + queueingStrategy: BestEffortFIFO + resourceGroups: + - coveredResources: + - google.com/tpu + flavors: + - name: flavor-tpu7x + resources: + - name: google.com/tpu + nominalQuota: 128 + stopPolicy: None +--- +apiVersion: kueue.x-k8s.io/v1beta1 +kind: LocalQueue +metadata: + namespace: default + name: "user-queue-tpu7x" +spec: + clusterQueue: "cluster-queue-tpu7x" \ No newline at end of file diff --git a/Ironwood/guides/automation/autoscaling/tpu-ccc-template.yaml b/Ironwood/guides/automation/autoscaling/tpu-ccc-template.yaml new file mode 100644 index 0000000..100175f --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/tpu-ccc-template.yaml @@ -0,0 +1,19 @@ +apiVersion: cloud.google.com/v1 +kind: ComputeClass +metadata: + name: tpuv7-${TPU_TOPOLOGY}-class +spec: + priorities: + - tpu: + type: tpu7x + topology: ${TPU_TOPOLOGY} + count: 4 + reservations: + specific: + - name: ${RESERVATION_NAME} + project: ${PROJECT_ID} + affinity: Specific + placement: + policyName: ${POLICY_NAME} + nodePoolAutoCreation: + enabled: true \ No newline at end of file diff --git a/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-bmm.yaml b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-bmm.yaml new file mode 100644 index 0000000..1a4c777 --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-bmm.yaml @@ -0,0 +1,62 @@ +apiVersion: v1 +kind: Service +metadata: + name: headless-svc-${JOB_NAME} +spec: + clusterIP: None + selector: + job-name: ${JOB_NAME} +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: ${JOB_NAME} + labels: + kueue.x-k8s.io/queue-name: user-queue-tpu7x +spec: + completionMode: Indexed + suspend: true + parallelism: 1 + completions: 1 + backoffLimit: 0 + template: + spec: + subdomain: headless-svc-${JOB_NAME} + serviceAccountName: ${GCS_SA_NAME} + restartPolicy: Never + nodeSelector: + cloud.google.com/compute-class: tpuv7-2x2x1-class + cloud.google.com/gke-tpu-accelerator: tpu7x + cloud.google.com/gke-tpu-topology: 2x2x1 + containers: + - name: jax-tpu + image: python:3.12 + securityContext: + privileged: false + env: + - name: JAX_PLATFORMS + value: "tpu,cpu" + - name: TPU_VMODULE + value: "singleton_tpu_system_manager=10,tpu_version_flag=10,device_util=10,device_scanner=10,mesh_builder=10,master=10" + - name: XLA_IR_DEBUG + value: "1" + - name: XLA_HLO_DEBUG + value: "1" + command: + - bash + - -c + - | + set -ex + + git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git + cd accelerator-microbenchmarks + git checkout tpu7x-auto + pip install -r requirements.txt + + GCS_BUCKET_DIR=${GCS_PATH} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/bmm/single_device_bmm.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + resources: + requests: + google.com/tpu: 4 + limits: + google.com/tpu: 4 \ No newline at end of file diff --git a/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-collectives.yaml b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-collectives.yaml new file mode 100644 index 0000000..eb15298 --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-collectives.yaml @@ -0,0 +1,64 @@ +apiVersion: v1 +kind: Service +metadata: + name: headless-svc-${JOB_NAME} +spec: + clusterIP: None + selector: + job-name: ${JOB_NAME} +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: ${JOB_NAME} + labels: + kueue.x-k8s.io/queue-name: user-queue-tpu7x +spec: + completionMode: Indexed + suspend: true + parallelism: 1 + completions: 1 + backoffLimit: 0 + template: + spec: + subdomain: headless-svc-${JOB_NAME} + serviceAccountName: ${GCS_SA_NAME} + restartPolicy: Never + nodeSelector: + cloud.google.com/compute-class: tpuv7-2x2x1-class + cloud.google.com/gke-tpu-accelerator: tpu7x + cloud.google.com/gke-tpu-topology: 2x2x1 + containers: + - name: jax-tpu + image: python:3.12 + securityContext: + privileged: false + env: + - name: JAX_PLATFORMS + value: "tpu,cpu" + - name: TPU_VMODULE + value: "singleton_tpu_system_manager=10,tpu_version_flag=10,device_util=10,device_scanner=10,mesh_builder=10,master=10" + - name: XLA_IR_DEBUG + value: "1" + - name: XLA_HLO_DEBUG + value: "1" + command: + - bash + - -c + - | + set -ex + + git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git + cd accelerator-microbenchmarks + git checkout tpu7x-auto + pip install -r requirements.txt + + GCS_BUCKET_DIR=${GCS_PATH} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_gather_tpu7x_2x2x1.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_reduce_tpu7x_2x2x1.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_to_all_tpu7x_2x2x1.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + resources: + requests: + google.com/tpu: 4 + limits: + google.com/tpu: 4 \ No newline at end of file diff --git a/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-gemm.yaml b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-gemm.yaml new file mode 100644 index 0000000..822a224 --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-gemm.yaml @@ -0,0 +1,62 @@ +apiVersion: v1 +kind: Service +metadata: + name: headless-svc-${JOB_NAME} +spec: + clusterIP: None + selector: + job-name: ${JOB_NAME} +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: ${JOB_NAME} + labels: + kueue.x-k8s.io/queue-name: user-queue-tpu7x +spec: + completionMode: Indexed + suspend: true + parallelism: 1 + completions: 1 + backoffLimit: 0 + template: + spec: + subdomain: headless-svc-${JOB_NAME} + serviceAccountName: ${GCS_SA_NAME} + restartPolicy: Never + nodeSelector: + cloud.google.com/compute-class: tpuv7-2x2x1-class + cloud.google.com/gke-tpu-accelerator: tpu7x + cloud.google.com/gke-tpu-topology: 2x2x1 + containers: + - name: jax-tpu + image: python:3.12 + securityContext: + privileged: false + env: + - name: JAX_PLATFORMS + value: "tpu,cpu" + - name: TPU_VMODULE + value: "singleton_tpu_system_manager=10,tpu_version_flag=10,device_util=10,device_scanner=10,mesh_builder=10,master=10" + - name: XLA_IR_DEBUG + value: "1" + - name: XLA_HLO_DEBUG + value: "1" + command: + - bash + - -c + - | + set -ex + + git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git + cd accelerator-microbenchmarks + git checkout tpu7x-auto + pip install -r requirements.txt + + GCS_BUCKET_DIR=${GCS_PATH} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/gemm/gemm_multiple_run_more.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + resources: + requests: + google.com/tpu: 4 + limits: + google.com/tpu: 4 \ No newline at end of file diff --git a/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-gemm_all_reduce.yaml b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-gemm_all_reduce.yaml new file mode 100644 index 0000000..11b1fce --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-gemm_all_reduce.yaml @@ -0,0 +1,62 @@ +apiVersion: v1 +kind: Service +metadata: + name: headless-svc-${JOB_NAME} +spec: + clusterIP: None + selector: + job-name: ${JOB_NAME} +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: ${JOB_NAME} + labels: + kueue.x-k8s.io/queue-name: user-queue-tpu7x +spec: + completionMode: Indexed + suspend: true + parallelism: 1 + completions: 1 + backoffLimit: 0 + template: + spec: + subdomain: headless-svc-${JOB_NAME} + serviceAccountName: ${GCS_SA_NAME} + restartPolicy: Never + nodeSelector: + cloud.google.com/compute-class: tpuv7-2x2x1-class + cloud.google.com/gke-tpu-accelerator: tpu7x + cloud.google.com/gke-tpu-topology: 2x2x1 + containers: + - name: jax-tpu + image: python:3.12 + securityContext: + privileged: false + env: + - name: JAX_PLATFORMS + value: "tpu,cpu" + - name: TPU_VMODULE + value: "singleton_tpu_system_manager=10,tpu_version_flag=10,device_util=10,device_scanner=10,mesh_builder=10,master=10" + - name: XLA_IR_DEBUG + value: "1" + - name: XLA_HLO_DEBUG + value: "1" + command: + - bash + - -c + - | + set -ex + + git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git + cd accelerator-microbenchmarks + git checkout tpu7x-auto + pip install -r requirements.txt + + GCS_BUCKET_DIR=${GCS_PATH} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/gemm_all_reduce/gemm_all_reduce.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + resources: + requests: + google.com/tpu: 4 + limits: + google.com/tpu: 4 diff --git a/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-hbm.yaml b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-hbm.yaml new file mode 100644 index 0000000..f589cc9 --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-hbm.yaml @@ -0,0 +1,62 @@ +apiVersion: v1 +kind: Service +metadata: + name: headless-svc-${JOB_NAME} +spec: + clusterIP: None + selector: + job-name: ${JOB_NAME} +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: ${JOB_NAME} + labels: + kueue.x-k8s.io/queue-name: user-queue-tpu7x +spec: + completionMode: Indexed + suspend: true + parallelism: 1 + completions: 1 + backoffLimit: 0 + template: + spec: + subdomain: headless-svc-${JOB_NAME} + serviceAccountName: ${GCS_SA_NAME} + restartPolicy: Never + nodeSelector: + cloud.google.com/compute-class: tpuv7-2x2x1-class + cloud.google.com/gke-tpu-accelerator: tpu7x + cloud.google.com/gke-tpu-topology: 2x2x1 + containers: + - name: jax-tpu + image: python:3.12 + securityContext: + privileged: false + env: + - name: JAX_PLATFORMS + value: "tpu,cpu" + - name: TPU_VMODULE + value: "singleton_tpu_system_manager=10,tpu_version_flag=10,device_util=10,device_scanner=10,mesh_builder=10,master=10" + - name: XLA_IR_DEBUG + value: "1" + - name: XLA_HLO_DEBUG + value: "1" + command: + - bash + - -c + - | + set -ex + + git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git + cd accelerator-microbenchmarks + git checkout tpu7x-auto + pip install -r requirements.txt + + GCS_BUCKET_DIR=${GCS_PATH} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/hbm/hbm.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + resources: + requests: + google.com/tpu: 4 + limits: + google.com/tpu: 4 \ No newline at end of file diff --git a/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-host_device.yaml b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-host_device.yaml new file mode 100644 index 0000000..bc9c081 --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x1-host_device.yaml @@ -0,0 +1,62 @@ +apiVersion: v1 +kind: Service +metadata: + name: headless-svc-${JOB_NAME} +spec: + clusterIP: None + selector: + job-name: ${JOB_NAME} +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: ${JOB_NAME} + labels: + kueue.x-k8s.io/queue-name: user-queue-tpu7x +spec: + completionMode: Indexed + suspend: true + parallelism: 1 + completions: 1 + backoffLimit: 0 + template: + spec: + subdomain: headless-svc-${JOB_NAME} + serviceAccountName: ${GCS_SA_NAME} + restartPolicy: Never + nodeSelector: + cloud.google.com/compute-class: tpuv7-2x2x1-class + cloud.google.com/gke-tpu-accelerator: tpu7x + cloud.google.com/gke-tpu-topology: 2x2x1 + containers: + - name: jax-tpu + image: python:3.12 + securityContext: + privileged: false + env: + - name: JAX_PLATFORMS + value: "tpu,cpu" + - name: TPU_VMODULE + value: "singleton_tpu_system_manager=10,tpu_version_flag=10,device_util=10,device_scanner=10,mesh_builder=10,master=10" + - name: XLA_IR_DEBUG + value: "1" + - name: XLA_HLO_DEBUG + value: "1" + command: + - bash + - -c + - | + set -ex + + git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git + cd accelerator-microbenchmarks + git checkout tpu7x-auto + pip install -r requirements.txt + + GCS_BUCKET_DIR=${GCS_PATH} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/host_device/host_device.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + resources: + requests: + google.com/tpu: 4 + limits: + google.com/tpu: 4 \ No newline at end of file diff --git a/Ironwood/guides/automation/autoscaling/tpu7x-2x2x2-collectives.yaml b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x2-collectives.yaml new file mode 100644 index 0000000..7915ff2 --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x2-collectives.yaml @@ -0,0 +1,64 @@ +apiVersion: v1 +kind: Service +metadata: + name: headless-svc-${JOB_NAME} +spec: + clusterIP: None + selector: + job-name: ${JOB_NAME} +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: ${JOB_NAME} + labels: + kueue.x-k8s.io/queue-name: user-queue-tpu7x +spec: + completionMode: Indexed + suspend: true + parallelism: 2 + completions: 2 + backoffLimit: 0 + template: + spec: + subdomain: headless-svc-${JOB_NAME} + serviceAccountName: ${GCS_SA_NAME} + restartPolicy: Never + nodeSelector: + cloud.google.com/compute-class: tpuv7-2x2x2-class + cloud.google.com/gke-tpu-accelerator: tpu7x + cloud.google.com/gke-tpu-topology: 2x2x2 + containers: + - name: jax-tpu + image: python:3.12 + securityContext: + privileged: false + env: + - name: JAX_PLATFORMS + value: "tpu,cpu" + - name: TPU_VMODULE + value: "singleton_tpu_system_manager=10,tpu_version_flag=10,device_util=10,device_scanner=10,mesh_builder=10,master=10" + - name: XLA_IR_DEBUG + value: "1" + - name: XLA_HLO_DEBUG + value: "1" + command: + - bash + - -c + - | + set -ex + + git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git + cd accelerator-microbenchmarks + git checkout tpu7x-auto + pip install -r requirements.txt + + GCS_BUCKET_DIR=${GCS_PATH} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_gather_tpu7x_2x2x2.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_reduce_tpu7x_2x2x2.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_to_all_tpu7x_2x2x2.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + resources: + requests: + google.com/tpu: 4 + limits: + google.com/tpu: 4 \ No newline at end of file diff --git a/Ironwood/guides/automation/autoscaling/tpu7x-2x2x4-collectives.yaml b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x4-collectives.yaml new file mode 100644 index 0000000..6028296 --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/tpu7x-2x2x4-collectives.yaml @@ -0,0 +1,64 @@ +apiVersion: v1 +kind: Service +metadata: + name: headless-svc-${JOB_NAME} +spec: + clusterIP: None + selector: + job-name: ${JOB_NAME} +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: ${JOB_NAME} + labels: + kueue.x-k8s.io/queue-name: user-queue-tpu7x +spec: + completionMode: Indexed + suspend: true + parallelism: 4 + completions: 4 + backoffLimit: 0 + template: + spec: + subdomain: headless-svc-${JOB_NAME} + serviceAccountName: ${GCS_SA_NAME} + restartPolicy: Never + nodeSelector: + cloud.google.com/compute-class: tpuv7-2x2x4-class + cloud.google.com/gke-tpu-accelerator: tpu7x + cloud.google.com/gke-tpu-topology: 2x2x4 + containers: + - name: jax-tpu + image: python:3.12 + securityContext: + privileged: false + env: + - name: JAX_PLATFORMS + value: "tpu,cpu" + - name: TPU_VMODULE + value: "singleton_tpu_system_manager=10,tpu_version_flag=10,device_util=10,device_scanner=10,mesh_builder=10,master=10" + - name: XLA_IR_DEBUG + value: "1" + - name: XLA_HLO_DEBUG + value: "1" + command: + - bash + - -c + - | + set -ex + + git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git + cd accelerator-microbenchmarks + git checkout tpu7x-auto + pip install -r requirements.txt + + GCS_BUCKET_DIR=${GCS_PATH} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_gather_tpu7x_2x2x4.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_reduce_tpu7x_2x2x4.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_to_all_tpu7x_2x2x4.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + resources: + requests: + google.com/tpu: 4 + limits: + google.com/tpu: 4 \ No newline at end of file diff --git a/Ironwood/guides/automation/autoscaling/tpu7x-2x4x4-collectives.yaml b/Ironwood/guides/automation/autoscaling/tpu7x-2x4x4-collectives.yaml new file mode 100644 index 0000000..343bbf0 --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/tpu7x-2x4x4-collectives.yaml @@ -0,0 +1,64 @@ +apiVersion: v1 +kind: Service +metadata: + name: headless-svc-${JOB_NAME} +spec: + clusterIP: None + selector: + job-name: ${JOB_NAME} +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: ${JOB_NAME} + labels: + kueue.x-k8s.io/queue-name: user-queue-tpu7x +spec: + completionMode: Indexed + suspend: true + parallelism: 8 + completions: 8 + backoffLimit: 0 + template: + spec: + subdomain: headless-svc-${JOB_NAME} + serviceAccountName: ${GCS_SA_NAME} + restartPolicy: Never + nodeSelector: + cloud.google.com/compute-class: tpuv7-2x4x4-class + cloud.google.com/gke-tpu-accelerator: tpu7x + cloud.google.com/gke-tpu-topology: 2x4x4 + containers: + - name: jax-tpu + image: python:3.12 + securityContext: + privileged: false + env: + - name: JAX_PLATFORMS + value: "tpu,cpu" + - name: TPU_VMODULE + value: "singleton_tpu_system_manager=10,tpu_version_flag=10,device_util=10,device_scanner=10,mesh_builder=10,master=10" + - name: XLA_IR_DEBUG + value: "1" + - name: XLA_HLO_DEBUG + value: "1" + command: + - bash + - -c + - | + set -ex + + git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git + cd accelerator-microbenchmarks + git checkout tpu7x-auto + pip install -r requirements.txt + + GCS_BUCKET_DIR=${GCS_PATH} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_gather_tpu7x_2x4x4.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_reduce_tpu7x_2x4x4.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_to_all_tpu7x_2x4x4.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + resources: + requests: + google.com/tpu: 4 + limits: + google.com/tpu: 4 \ No newline at end of file diff --git a/Ironwood/guides/automation/autoscaling/tpu7x-4x4x4-collectives.yaml b/Ironwood/guides/automation/autoscaling/tpu7x-4x4x4-collectives.yaml new file mode 100644 index 0000000..23f0fb3 --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/tpu7x-4x4x4-collectives.yaml @@ -0,0 +1,64 @@ +apiVersion: v1 +kind: Service +metadata: + name: headless-svc-${JOB_NAME} +spec: + clusterIP: None + selector: + job-name: ${JOB_NAME} +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: ${JOB_NAME} + labels: + kueue.x-k8s.io/queue-name: user-queue-tpu7x +spec: + completionMode: Indexed + suspend: true + parallelism: 16 + completions: 16 + backoffLimit: 0 + template: + spec: + subdomain: headless-svc-${JOB_NAME} + serviceAccountName: ${GCS_SA_NAME} + restartPolicy: Never + nodeSelector: + cloud.google.com/compute-class: tpuv7-4x4x4-class + cloud.google.com/gke-tpu-accelerator: tpu7x + cloud.google.com/gke-tpu-topology: 4x4x4 + containers: + - name: jax-tpu + image: python:3.12 + securityContext: + privileged: false + env: + - name: JAX_PLATFORMS + value: "tpu,cpu" + - name: TPU_VMODULE + value: "singleton_tpu_system_manager=10,tpu_version_flag=10,device_util=10,device_scanner=10,mesh_builder=10,master=10" + - name: XLA_IR_DEBUG + value: "1" + - name: XLA_HLO_DEBUG + value: "1" + command: + - bash + - -c + - | + set -ex + + git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git + cd accelerator-microbenchmarks + git checkout tpu7x-auto + pip install -r requirements.txt + + GCS_BUCKET_DIR=${GCS_PATH} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_gather_tpu7x_4x4x4.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_reduce_tpu7x_4x4x4.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_to_all_tpu7x_4x4x4.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + resources: + requests: + google.com/tpu: 4 + limits: + google.com/tpu: 4 \ No newline at end of file diff --git a/Ironwood/guides/automation/autoscaling/tpu7x-4x4x8-collectives.yaml b/Ironwood/guides/automation/autoscaling/tpu7x-4x4x8-collectives.yaml new file mode 100644 index 0000000..25655ca --- /dev/null +++ b/Ironwood/guides/automation/autoscaling/tpu7x-4x4x8-collectives.yaml @@ -0,0 +1,64 @@ +apiVersion: v1 +kind: Service +metadata: + name: headless-svc-${JOB_NAME} +spec: + clusterIP: None + selector: + job-name: ${JOB_NAME} +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: ${JOB_NAME} + labels: + kueue.x-k8s.io/queue-name: user-queue-tpu7x +spec: + completionMode: Indexed + suspend: true + parallelism: 32 + completions: 32 + backoffLimit: 0 + template: + spec: + subdomain: headless-svc-${JOB_NAME} + serviceAccountName: ${GCS_SA_NAME} + restartPolicy: Never + nodeSelector: + cloud.google.com/compute-class: tpuv7-4x4x8-class + cloud.google.com/gke-tpu-accelerator: tpu7x + cloud.google.com/gke-tpu-topology: 4x4x8 + containers: + - name: jax-tpu + image: python:3.12 + securityContext: + privileged: false + env: + - name: JAX_PLATFORMS + value: "tpu,cpu" + - name: TPU_VMODULE + value: "singleton_tpu_system_manager=10,tpu_version_flag=10,device_util=10,device_scanner=10,mesh_builder=10,master=10" + - name: XLA_IR_DEBUG + value: "1" + - name: XLA_HLO_DEBUG + value: "1" + command: + - bash + - -c + - | + set -ex + + git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git + cd accelerator-microbenchmarks + git checkout tpu7x-auto + pip install -r requirements.txt + + GCS_BUCKET_DIR=${GCS_PATH} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_gather_tpu7x_4x4x8.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_reduce_tpu7x_4x4x8.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + python Ironwood/src/run_benchmark.py --config="Ironwood/configs/collectives/all_to_all_tpu7x_4x4x8.yaml" --gcs-bucket-csv-dir=${GCS_BUCKET_DIR} + resources: + requests: + google.com/tpu: 4 + limits: + google.com/tpu: 4 \ No newline at end of file