Skip to content

feat: add OpenShift platform on Carbide#129

Draft
fabiendupont wants to merge 20 commits into
NVIDIA:mainfrom
fabiendupont:feat/openshift-carbide
Draft

feat: add OpenShift platform on Carbide#129
fabiendupont wants to merge 20 commits into
NVIDIA:mainfrom
fabiendupont:feat/openshift-carbide

Conversation

@fabiendupont

Copy link
Copy Markdown
Collaborator

Summary

Depends on #128

OpenShift platform validation on Carbide bare metal. 12 configs covering the full stack:

Phase Config What it validates
KaaS provision kaas-provision AI + Carbide orchestration (iPXE flow)
KaaS validation kaas Cluster health, GPU operator (shared k8s-inventory.sh)
IAM iam Keycloak via RHBK operator + OAuth + user lifecycle
Network network Network Operator, SR-IOV, NetworkPolicy, Multus, RDMA, NCCL
Storage storage ODF on NVMe (StorageCluster + PVC + pod mount)
VM vm KubeVirt + GPU passthrough (VFIO)
MachineSet machineset Carbide Machine API scaling (up/down + GPU workload)
Security security SELinux, Secure Boot, FIPS, Compliance Operator
GPU Health gpu-health DCGM exporter, Prometheus metrics, NHC
DRA dra Dynamic Resource Allocation driver + ResourceClaim
ComputeDomain computedomain IMEX channels, tray allocation, NVLink NCCL
ARM arm aarch64 arch, SMMUv3, GICv4.1, hugepages, NUMA
Hosted CP hosted-provision CAPI provider, BareMetalHosts, second cluster

Test catalog entries for all 12 OpenShift platforms.

Test plan

  • All tests pass
  • Dry-run succeeds for all configs
  • Live test on OpenShift + Carbide lab

🤖 Generated with Claude Code

fabiendupont and others added 20 commits March 10, 2026 12:03
Override configs can now merge into base check lists instead of replacing
them entirely, by including a `{__merge__: true}` marker. Matching checks
are deep-merged by key, new checks are appended, and `"__remove__"` drops
a check. Without the marker, lists are replaced as before (backward compat).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Keys set to "__remove__" are deleted during deep_merge, enabling
selective removal of inherited values in nested dicts (e.g., dropping
a single label from expected_labels without replacing the entire dict).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Templates now define WHAT to validate (checks with context variable
defaults), while provider configs define HOW to provision (commands
and stubs). This enables composable validation:

  isvctl test run -f templates/kaas.yaml -f aws/eks.yaml

Key changes:
- Remove commands block from all 7 templates
- Replace hardcoded values with {{ context.X | default('Y') }}
- Template stubs in stubs/ remain as copy-paste starting points
- Existing self-contained provider configs (aws/) keep working
- Update README with layered usage documentation

The merge engine combines tests from the template with commands
from the provider. Context variables flow through Jinja2 rendering
into validation parameters at runtime.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Adds eks-layered.yaml that supplies only commands (Terraform stubs)
and context overrides, designed to pair with templates/kaas.yaml:

  isvctl test run \
    -f isvctl/configs/templates/kaas.yaml \
    -f isvctl/configs/aws/eks-layered.yaml

The existing self-contained eks.yaml is unchanged (backward compat).

Adds 6 integration tests verifying:
- Templates have no commands block
- Layered merge produces both commands and tests
- Context overrides flow through
- Standalone eks.yaml still works
- Layered and standalone have the same validation check names
- All 7 templates are validation-only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Implements the control-plane template for the Carbide provider using
carbidecli. Maps template concepts to Carbide resources:
  - API health → tenant get, site list
  - Access keys → SSH key group + SSH key CRUD
  - Tenants → VPC CRUD

Includes:
  - carbide/control-plane.yaml: layered provider config
  - stubs/carbide/common/carbide.py: shared helper (run_carbide, state mgmt)
  - stubs/carbide/control-plane/: 10 stub scripts matching template steps

Usage:
  isvctl test run \
    -f isvctl/configs/templates/control-plane.yaml \
    -f isvctl/configs/carbide/control-plane.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Implements 3 upstream templates for the Carbide provider:

Network (carbide/network.yaml + 8 stubs):
  VPC CRUD, subnet configuration, VPC isolation, NSG security rules,
  connectivity and traffic validation via carbidecli.

Image Registry (carbide/image-registry.yaml + 6 stubs):
  OperatingSystem CRUD, instance launch from OS image, install config
  lifecycle — validates Carbide's image management capabilities.

Bare Metal (carbide/bm.yaml + 7 stubs):
  Instance launch/describe/list/reboot/teardown via carbidecli.
  Reinstall is skipped (not supported). NIM deploy/teardown reuse
  the shared template stubs.

All providers use the layered approach:
  isvctl test run -f templates/<template>.yaml -f carbide/<provider>.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
IAM provider (carbide/iam.yaml + 3 stubs):
  Validates API token, checks scope coverage for all templates,
  proves write access via temp SSH key lifecycle.

Full Carbide API surface (common/carbide.py):
  CARBIDE_API_RESOURCES maps all 24 resources from the OpenAPI spec
  (bare-metal-manager-rest) with their operations and scope names.

CRUD library (common/resources.py):
  Pre-configured CarbideResource instances for all resources:
  site, vpc, vpc-prefix, subnet, nsg, ipblock, allocation,
  instance, instance-type, machine, expected-machine, operating-system,
  infiniband-partition, nvlink-logical-partition, nvlink-interface,
  dpu-extension-service, sshkeygroup, sshkey, tenant, rack, tray,
  sku, audit.

Dynamic scope calculation:
  effective_scopes_for_template() reduces required scopes when
  pre-existing resources are set via CARBIDE_*_ID env vars.
  TEMPLATE_REQUIRED_SCOPES defines minimum scopes per template.

Pre-existing resource support in create/teardown stubs:
  CARBIDE_VPC_ID, CARBIDE_VPC_PREFIX_ID, CARBIDE_SUBNET_ID,
  CARBIDE_SSH_KEY_GROUP_ID, CARBIDE_OS_ID, CARBIDE_INSTANCE_ID.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Register Carbide provider configs in PLATFORM_CONFIGS so test
coverage tracking knows which validations are used by the
Carbide provider (control-plane, network, image-registry, bm, iam).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Orchestrates Assisted Installer + Carbide for OpenShift deployment:
1. Create cluster in AI, collect iPXE discovery config
2. Create OperatingSystem in Carbide with iPXE config
3. Batch-create bare-metal instances with OS + InstanceType
4. Collect instance UUIDs + MACs, monitor host registration
5. Start and monitor OpenShift installation, collect kubeconfig
6. Install GPU Operator via OperatorHub

Supports pre-existing Carbide resources (VPC, subnet, Instance Type)
via env vars. Deprovision gated by TEARDOWN_ENABLED=true.

Usage:
  isvctl test run \
    -f isvctl/configs/templates/kaas.yaml \
    -f isvctl/configs/openshift/kaas-provision.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Factored cluster inventory collection into a shared script
(stubs/common/k8s-inventory.sh) that uses kubectl for all standard
K8s queries: nodes, GPUs, operator namespace, driver version,
RuntimeClass, GPU product. Any provider sources this script.

OpenShift setup.sh sources the shared script and adds oc-specific
queries (clusterversion, infrastructure name). The JSON output
matches the upstream kaas template contract.

Files:
  stubs/common/k8s-inventory.sh — shared kubectl inventory (130 lines)
  openshift/kaas.yaml — layered provider config
  openshift/kaas-overrides.yaml — site-specific tuning template
  stubs/openshift/kaas/setup.sh — sources shared + adds oc queries
  stubs/openshift/kaas/teardown.sh — test namespace cleanup

Usage:
  isvctl test run \
    -f isvctl/configs/templates/kaas.yaml \
    -f isvctl/configs/openshift/kaas.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Implements the IAM template for OpenShift using Red Hat Build of
Keycloak (rhbk-operator from redhat-operators catalog).

create_user.py (idempotent):
  Deploy RHBK operator, create Keycloak instance, configure realm
  + OIDC client, patch OAuth CR with IdP + ingress CA, create user

test_credentials.py:
  oc login with test user, verify identity, RBAC grant/deny checks

teardown.py:
  Delete test user from Keycloak + OpenShift identity objects.
  Keycloak stays deployed for reuse.

Usage:
  isvctl test run \
    -f isvctl/configs/templates/iam.yaml \
    -f isvctl/configs/openshift/iam.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Standalone config (not layered on infra network template) for
in-cluster GPU networking:

  - NVIDIA Network Operator: MOFED drivers, RDMA device plugin
  - SR-IOV Operator: VF provisioning for InfiniBand/RoCE
  - NetworkPolicy: OVN-Kubernetes enforcement (block/allow)
  - Multus: secondary network interface attachment
  - RDMA: device availability, GPUDirect nvidia_peermem
  - Multi-node NCCL: AllReduce over RDMA (2+ GPU nodes)

Graceful handling: missing operators report as failed checks,
NCCL test skips if fewer than 2 GPU nodes available.

Usage:
  isvctl test run -f isvctl/configs/openshift/network.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Standalone config for OpenShift Data Foundation validation:

  deploy_odf.py (idempotent):
    Install ODF operator from OperatorHub, discover local NVMe disks
    via LocalVolumeDiscovery, create StorageCluster, wait for Ready.
    Auto-detects worker nodes or uses ODF_STORAGE_NODES env var.

  verify_storage_classes.py:
    Verify CephRBD and CephFS StorageClasses are provisioned.

  test_pvc_binding.py:
    Create PVCs for both StorageClasses, verify they reach Bound.

  test_pod_mount.py:
    Mount RBD PVC in a pod, write data, read back, verify integrity.

  teardown.py:
    Clean up test PVCs and namespace. ODF stays deployed.

Usage:
  isvctl test run -f isvctl/configs/openshift/storage.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Implements the VM template for OpenShift using KubeVirt with GPU
passthrough via VFIO. Layers on templates/vm.yaml for validations.

launch_instance.py:
  Create VirtualMachine CR with GPU devices, DataVolume from cloud
  image URL, cloud-init SSH config. Wait for Running + SSH ready.
  Auto-detects GPU device name from node labels.

list_instances.py:
  List VirtualMachineInstances, verify target VM present.

reboot_instance.py:
  Restart via virtctl (or VMI delete fallback), wait for SSH,
  report uptime for reboot verification.

deploy_nim.py:
  Skipped by default. Placeholder for NIM-in-VM deployment.

teardown.py:
  Delete VM, DataVolume, and test namespace.

Usage:
  VM_IMAGE_URL=<qcow2-url> VM_SSH_PUBKEY="ssh-ed25519 ..." \
    isvctl test run \
      -f isvctl/configs/templates/vm.yaml \
      -f isvctl/configs/openshift/vm.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Validates dynamic GPU node provisioning via Carbide Machine API:

  verify_machine_api.py:
    Check Machine API operator available, Carbide provider configured.

  create_machineset.py:
    Create GPU MachineSet with Carbide providerSpec (instance type,
    site ID). Configure MachineAutoscaler for scale-up tests.
    Wait for machines to provision and nodes to join.

  test_gpu_workload.py:
    Schedule a GPU pod on a MachineSet-provisioned node, verify
    nvidia-smi succeeds (proves GPU scheduling on dynamic nodes).

  test_scale_up.py:
    Increase MachineSet replicas, wait for new machines to provision
    and join as Ready nodes (proves Carbide create works).

  test_scale_down.py:
    Reduce replicas back to minimum, verify excess machines are
    deprovisioned (proves Carbide delete works).

  teardown.py:
    Delete test namespace. MachineSet deletion gated by
    TEARDOWN_ENABLED=true.

Usage:
  CARBIDE_INSTANCE_TYPE=<uuid> CARBIDE_SITE_ID=<uuid> \
    isvctl test run -f isvctl/configs/openshift/machineset.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Security (4 read-only checks):
  - SELinux: verify Enforcing on all nodes (oc debug)
  - Secure Boot: check mokutil --sb-state (warns if unsupported)
  - FIPS: check install-config and /proc/sys/crypto/fips_enabled
  - Compliance Operator: check scan results if operator installed

GPU Health (4 read-only checks):
  - DCGM Exporter: pods running, metrics service, ServiceMonitor
  - DCGM Metrics: scrape exporter, validate expected metric names
  - NHC Operator: NodeHealthCheck CRD and configs (if installed)
  - Prometheus: query Thanos for GPU metrics end-to-end

Both are standalone configs, no infrastructure changes.

Usage:
  isvctl test run -f isvctl/configs/openshift/security.yaml
  isvctl test run -f isvctl/configs/openshift/gpu-health.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
DRA (Dynamic Resource Allocation):
  - Verify DRA driver pods running and ResourceClass exists
  - Create ResourceClaim + pod, verify GPU allocation works
  - Check SCC compatibility with DRA driver pods
  Requires GPU Operator switched to DRA mode.

ComputeDomain (IMEX):
  - Verify ComputeDomain CRD exists
  - Create domain spanning GPU nodes, wait for IMEX channels
  - Tray allocation verification
  - Multi-job scheduling within domain boundaries
  - NCCL AllReduce over NVLink via IMEX channels
  Requires DRA mode + NVL72 NVLink fabric.

Usage:
  isvctl test run -f isvctl/configs/openshift/dra.yaml
  isvctl test run -f isvctl/configs/openshift/computedomain.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Read-only checks for ARM-specific platform features.
All tests skip gracefully on x86_64 clusters.

  arch_check.py:    Verify aarch64 architecture on nodes
  smmu_check.py:    SMMUv3 detection via dmesg + IOMMU groups
  gicv4_check.py:   GICv4.1 interrupt controller detection
  hugepage_check.py: 1Gi/2Mi hugepage capacity on nodes
  topology_check.py: NUMA topology, CPU model, GPU-NUMA affinity

Usage:
  isvctl test run -f isvctl/configs/openshift/arm.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Deploys a second OpenShift cluster from the management cluster
using Cluster API Provider for Carbide:

  deploy_capi_provider.py:
    Install CAPI core + Carbide infrastructure provider (idempotent).

  provision_baremetal_hosts.py:
    Batch-create Carbide instances, register as BareMetalHost
    resources in the management cluster for CAPI consumption.

  create_hosted_realm.py:
    Create a Keycloak realm on the management cluster for the
    hosted cluster's OAuth configuration.

  deploy_hosted_cluster.py:
    Create CAPI Cluster + CarbideCluster CRs, wait for hosted
    cluster to be provisioned, extract kubeconfig.

  configure_hosted_cluster.py:
    Install GPU Operator and Network Operator on the hosted
    cluster via its kubeconfig.

  verify_hosted_cluster.py:
    Basic health checks: nodes Ready, GPU Operator running,
    GPU nodes detected. Outputs kubeconfig path for subsequent
    test runs.

  teardown.py:
    Delete Cluster CR, BareMetalHosts, Carbide instances,
    namespaces. Gated by TEARDOWN_ENABLED=true.

After provisioning, re-run validation tests targeting the hosted
cluster:
  KUBECONFIG=/tmp/ncp-hosted-kubeconfig \
    isvctl test run \
      -f isvctl/configs/templates/kaas.yaml \
      -f isvctl/configs/openshift/kaas.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Register all 12 OpenShift validation configs in PLATFORM_CONFIGS
for coverage tracking: kaas, iam, network, storage, vm, machineset,
security, gpu-health, dra, computedomain, arm, hosted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
@copy-pr-bot

copy-pr-bot Bot commented Mar 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant