feat: add OpenShift platform on Carbide by fabiendupont · Pull Request #129 · NVIDIA/ai-cloud-validation

fabiendupont · 2026-03-10T17:05:53Z

Summary

Depends on #128

OpenShift platform validation on Carbide bare metal. 12 configs covering the full stack:

Phase	Config	What it validates
KaaS provision	kaas-provision	AI + Carbide orchestration (iPXE flow)
KaaS validation	kaas	Cluster health, GPU operator (shared k8s-inventory.sh)
IAM	iam	Keycloak via RHBK operator + OAuth + user lifecycle
Network	network	Network Operator, SR-IOV, NetworkPolicy, Multus, RDMA, NCCL
Storage	storage	ODF on NVMe (StorageCluster + PVC + pod mount)
VM	vm	KubeVirt + GPU passthrough (VFIO)
MachineSet	machineset	Carbide Machine API scaling (up/down + GPU workload)
Security	security	SELinux, Secure Boot, FIPS, Compliance Operator
GPU Health	gpu-health	DCGM exporter, Prometheus metrics, NHC
DRA	dra	Dynamic Resource Allocation driver + ResourceClaim
ComputeDomain	computedomain	IMEX channels, tray allocation, NVLink NCCL
ARM	arm	aarch64 arch, SMMUv3, GICv4.1, hugepages, NUMA
Hosted CP	hosted-provision	CAPI provider, BareMetalHosts, second cluster

Test catalog entries for all 12 OpenShift platforms.

Test plan

All tests pass
Dry-run succeeds for all configs
Live test on OpenShift + Carbide lab

🤖 Generated with Claude Code

Override configs can now merge into base check lists instead of replacing them entirely, by including a `{__merge__: true}` marker. Matching checks are deep-merged by key, new checks are appended, and `"__remove__"` drops a check. Without the marker, lists are replaced as before (backward compat). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Keys set to "__remove__" are deleted during deep_merge, enabling selective removal of inherited values in nested dicts (e.g., dropping a single label from expected_labels without replacing the entire dict). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Templates now define WHAT to validate (checks with context variable defaults), while provider configs define HOW to provision (commands and stubs). This enables composable validation: isvctl test run -f templates/kaas.yaml -f aws/eks.yaml Key changes: - Remove commands block from all 7 templates - Replace hardcoded values with {{ context.X | default('Y') }} - Template stubs in stubs/ remain as copy-paste starting points - Existing self-contained provider configs (aws/) keep working - Update README with layered usage documentation The merge engine combines tests from the template with commands from the provider. Context variables flow through Jinja2 rendering into validation parameters at runtime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Adds eks-layered.yaml that supplies only commands (Terraform stubs) and context overrides, designed to pair with templates/kaas.yaml: isvctl test run \ -f isvctl/configs/templates/kaas.yaml \ -f isvctl/configs/aws/eks-layered.yaml The existing self-contained eks.yaml is unchanged (backward compat). Adds 6 integration tests verifying: - Templates have no commands block - Layered merge produces both commands and tests - Context overrides flow through - Standalone eks.yaml still works - Layered and standalone have the same validation check names - All 7 templates are validation-only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Implements the control-plane template for the Carbide provider using carbidecli. Maps template concepts to Carbide resources: - API health → tenant get, site list - Access keys → SSH key group + SSH key CRUD - Tenants → VPC CRUD Includes: - carbide/control-plane.yaml: layered provider config - stubs/carbide/common/carbide.py: shared helper (run_carbide, state mgmt) - stubs/carbide/control-plane/: 10 stub scripts matching template steps Usage: isvctl test run \ -f isvctl/configs/templates/control-plane.yaml \ -f isvctl/configs/carbide/control-plane.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Implements 3 upstream templates for the Carbide provider: Network (carbide/network.yaml + 8 stubs): VPC CRUD, subnet configuration, VPC isolation, NSG security rules, connectivity and traffic validation via carbidecli. Image Registry (carbide/image-registry.yaml + 6 stubs): OperatingSystem CRUD, instance launch from OS image, install config lifecycle — validates Carbide's image management capabilities. Bare Metal (carbide/bm.yaml + 7 stubs): Instance launch/describe/list/reboot/teardown via carbidecli. Reinstall is skipped (not supported). NIM deploy/teardown reuse the shared template stubs. All providers use the layered approach: isvctl test run -f templates/<template>.yaml -f carbide/<provider>.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

IAM provider (carbide/iam.yaml + 3 stubs): Validates API token, checks scope coverage for all templates, proves write access via temp SSH key lifecycle. Full Carbide API surface (common/carbide.py): CARBIDE_API_RESOURCES maps all 24 resources from the OpenAPI spec (bare-metal-manager-rest) with their operations and scope names. CRUD library (common/resources.py): Pre-configured CarbideResource instances for all resources: site, vpc, vpc-prefix, subnet, nsg, ipblock, allocation, instance, instance-type, machine, expected-machine, operating-system, infiniband-partition, nvlink-logical-partition, nvlink-interface, dpu-extension-service, sshkeygroup, sshkey, tenant, rack, tray, sku, audit. Dynamic scope calculation: effective_scopes_for_template() reduces required scopes when pre-existing resources are set via CARBIDE_*_ID env vars. TEMPLATE_REQUIRED_SCOPES defines minimum scopes per template. Pre-existing resource support in create/teardown stubs: CARBIDE_VPC_ID, CARBIDE_VPC_PREFIX_ID, CARBIDE_SUBNET_ID, CARBIDE_SSH_KEY_GROUP_ID, CARBIDE_OS_ID, CARBIDE_INSTANCE_ID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Register Carbide provider configs in PLATFORM_CONFIGS so test coverage tracking knows which validations are used by the Carbide provider (control-plane, network, image-registry, bm, iam). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Orchestrates Assisted Installer + Carbide for OpenShift deployment: 1. Create cluster in AI, collect iPXE discovery config 2. Create OperatingSystem in Carbide with iPXE config 3. Batch-create bare-metal instances with OS + InstanceType 4. Collect instance UUIDs + MACs, monitor host registration 5. Start and monitor OpenShift installation, collect kubeconfig 6. Install GPU Operator via OperatorHub Supports pre-existing Carbide resources (VPC, subnet, Instance Type) via env vars. Deprovision gated by TEARDOWN_ENABLED=true. Usage: isvctl test run \ -f isvctl/configs/templates/kaas.yaml \ -f isvctl/configs/openshift/kaas-provision.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Factored cluster inventory collection into a shared script (stubs/common/k8s-inventory.sh) that uses kubectl for all standard K8s queries: nodes, GPUs, operator namespace, driver version, RuntimeClass, GPU product. Any provider sources this script. OpenShift setup.sh sources the shared script and adds oc-specific queries (clusterversion, infrastructure name). The JSON output matches the upstream kaas template contract. Files: stubs/common/k8s-inventory.sh — shared kubectl inventory (130 lines) openshift/kaas.yaml — layered provider config openshift/kaas-overrides.yaml — site-specific tuning template stubs/openshift/kaas/setup.sh — sources shared + adds oc queries stubs/openshift/kaas/teardown.sh — test namespace cleanup Usage: isvctl test run \ -f isvctl/configs/templates/kaas.yaml \ -f isvctl/configs/openshift/kaas.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Implements the IAM template for OpenShift using Red Hat Build of Keycloak (rhbk-operator from redhat-operators catalog). create_user.py (idempotent): Deploy RHBK operator, create Keycloak instance, configure realm + OIDC client, patch OAuth CR with IdP + ingress CA, create user test_credentials.py: oc login with test user, verify identity, RBAC grant/deny checks teardown.py: Delete test user from Keycloak + OpenShift identity objects. Keycloak stays deployed for reuse. Usage: isvctl test run \ -f isvctl/configs/templates/iam.yaml \ -f isvctl/configs/openshift/iam.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Standalone config (not layered on infra network template) for in-cluster GPU networking: - NVIDIA Network Operator: MOFED drivers, RDMA device plugin - SR-IOV Operator: VF provisioning for InfiniBand/RoCE - NetworkPolicy: OVN-Kubernetes enforcement (block/allow) - Multus: secondary network interface attachment - RDMA: device availability, GPUDirect nvidia_peermem - Multi-node NCCL: AllReduce over RDMA (2+ GPU nodes) Graceful handling: missing operators report as failed checks, NCCL test skips if fewer than 2 GPU nodes available. Usage: isvctl test run -f isvctl/configs/openshift/network.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Standalone config for OpenShift Data Foundation validation: deploy_odf.py (idempotent): Install ODF operator from OperatorHub, discover local NVMe disks via LocalVolumeDiscovery, create StorageCluster, wait for Ready. Auto-detects worker nodes or uses ODF_STORAGE_NODES env var. verify_storage_classes.py: Verify CephRBD and CephFS StorageClasses are provisioned. test_pvc_binding.py: Create PVCs for both StorageClasses, verify they reach Bound. test_pod_mount.py: Mount RBD PVC in a pod, write data, read back, verify integrity. teardown.py: Clean up test PVCs and namespace. ODF stays deployed. Usage: isvctl test run -f isvctl/configs/openshift/storage.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Implements the VM template for OpenShift using KubeVirt with GPU passthrough via VFIO. Layers on templates/vm.yaml for validations. launch_instance.py: Create VirtualMachine CR with GPU devices, DataVolume from cloud image URL, cloud-init SSH config. Wait for Running + SSH ready. Auto-detects GPU device name from node labels. list_instances.py: List VirtualMachineInstances, verify target VM present. reboot_instance.py: Restart via virtctl (or VMI delete fallback), wait for SSH, report uptime for reboot verification. deploy_nim.py: Skipped by default. Placeholder for NIM-in-VM deployment. teardown.py: Delete VM, DataVolume, and test namespace. Usage: VM_IMAGE_URL=<qcow2-url> VM_SSH_PUBKEY="ssh-ed25519 ..." \ isvctl test run \ -f isvctl/configs/templates/vm.yaml \ -f isvctl/configs/openshift/vm.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Validates dynamic GPU node provisioning via Carbide Machine API: verify_machine_api.py: Check Machine API operator available, Carbide provider configured. create_machineset.py: Create GPU MachineSet with Carbide providerSpec (instance type, site ID). Configure MachineAutoscaler for scale-up tests. Wait for machines to provision and nodes to join. test_gpu_workload.py: Schedule a GPU pod on a MachineSet-provisioned node, verify nvidia-smi succeeds (proves GPU scheduling on dynamic nodes). test_scale_up.py: Increase MachineSet replicas, wait for new machines to provision and join as Ready nodes (proves Carbide create works). test_scale_down.py: Reduce replicas back to minimum, verify excess machines are deprovisioned (proves Carbide delete works). teardown.py: Delete test namespace. MachineSet deletion gated by TEARDOWN_ENABLED=true. Usage: CARBIDE_INSTANCE_TYPE=<uuid> CARBIDE_SITE_ID=<uuid> \ isvctl test run -f isvctl/configs/openshift/machineset.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Security (4 read-only checks): - SELinux: verify Enforcing on all nodes (oc debug) - Secure Boot: check mokutil --sb-state (warns if unsupported) - FIPS: check install-config and /proc/sys/crypto/fips_enabled - Compliance Operator: check scan results if operator installed GPU Health (4 read-only checks): - DCGM Exporter: pods running, metrics service, ServiceMonitor - DCGM Metrics: scrape exporter, validate expected metric names - NHC Operator: NodeHealthCheck CRD and configs (if installed) - Prometheus: query Thanos for GPU metrics end-to-end Both are standalone configs, no infrastructure changes. Usage: isvctl test run -f isvctl/configs/openshift/security.yaml isvctl test run -f isvctl/configs/openshift/gpu-health.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

DRA (Dynamic Resource Allocation): - Verify DRA driver pods running and ResourceClass exists - Create ResourceClaim + pod, verify GPU allocation works - Check SCC compatibility with DRA driver pods Requires GPU Operator switched to DRA mode. ComputeDomain (IMEX): - Verify ComputeDomain CRD exists - Create domain spanning GPU nodes, wait for IMEX channels - Tray allocation verification - Multi-job scheduling within domain boundaries - NCCL AllReduce over NVLink via IMEX channels Requires DRA mode + NVL72 NVLink fabric. Usage: isvctl test run -f isvctl/configs/openshift/dra.yaml isvctl test run -f isvctl/configs/openshift/computedomain.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Read-only checks for ARM-specific platform features. All tests skip gracefully on x86_64 clusters. arch_check.py: Verify aarch64 architecture on nodes smmu_check.py: SMMUv3 detection via dmesg + IOMMU groups gicv4_check.py: GICv4.1 interrupt controller detection hugepage_check.py: 1Gi/2Mi hugepage capacity on nodes topology_check.py: NUMA topology, CPU model, GPU-NUMA affinity Usage: isvctl test run -f isvctl/configs/openshift/arm.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Deploys a second OpenShift cluster from the management cluster using Cluster API Provider for Carbide: deploy_capi_provider.py: Install CAPI core + Carbide infrastructure provider (idempotent). provision_baremetal_hosts.py: Batch-create Carbide instances, register as BareMetalHost resources in the management cluster for CAPI consumption. create_hosted_realm.py: Create a Keycloak realm on the management cluster for the hosted cluster's OAuth configuration. deploy_hosted_cluster.py: Create CAPI Cluster + CarbideCluster CRs, wait for hosted cluster to be provisioned, extract kubeconfig. configure_hosted_cluster.py: Install GPU Operator and Network Operator on the hosted cluster via its kubeconfig. verify_hosted_cluster.py: Basic health checks: nodes Ready, GPU Operator running, GPU nodes detected. Outputs kubeconfig path for subsequent test runs. teardown.py: Delete Cluster CR, BareMetalHosts, Carbide instances, namespaces. Gated by TEARDOWN_ENABLED=true. After provisioning, re-run validation tests targeting the hosted cluster: KUBECONFIG=/tmp/ncp-hosted-kubeconfig \ isvctl test run \ -f isvctl/configs/templates/kaas.yaml \ -f isvctl/configs/openshift/kaas.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Register all 12 OpenShift validation configs in PLATFORM_CONFIGS for coverage tracking: kaas, iam, network, storage, vm, machineset, security, gpu-health, dra, computedomain, arm, hosted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>

copy-pr-bot · 2026-03-10T17:05:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

fabiendupont and others added 20 commits March 10, 2026 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add OpenShift platform on Carbide#129

feat: add OpenShift platform on Carbide#129
fabiendupont wants to merge 20 commits into
NVIDIA:mainfrom
fabiendupont:feat/openshift-carbide

fabiendupont commented Mar 10, 2026

Uh oh!

copy-pr-bot Bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fabiendupont commented Mar 10, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant