feat: add OpenShift platform on Carbide#129
Draft
fabiendupont wants to merge 20 commits into
Draft
Conversation
Override configs can now merge into base check lists instead of replacing
them entirely, by including a `{__merge__: true}` marker. Matching checks
are deep-merged by key, new checks are appended, and `"__remove__"` drops
a check. Without the marker, lists are replaced as before (backward compat).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Keys set to "__remove__" are deleted during deep_merge, enabling selective removal of inherited values in nested dicts (e.g., dropping a single label from expected_labels without replacing the entire dict). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Templates now define WHAT to validate (checks with context variable
defaults), while provider configs define HOW to provision (commands
and stubs). This enables composable validation:
isvctl test run -f templates/kaas.yaml -f aws/eks.yaml
Key changes:
- Remove commands block from all 7 templates
- Replace hardcoded values with {{ context.X | default('Y') }}
- Template stubs in stubs/ remain as copy-paste starting points
- Existing self-contained provider configs (aws/) keep working
- Update README with layered usage documentation
The merge engine combines tests from the template with commands
from the provider. Context variables flow through Jinja2 rendering
into validation parameters at runtime.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Adds eks-layered.yaml that supplies only commands (Terraform stubs)
and context overrides, designed to pair with templates/kaas.yaml:
isvctl test run \
-f isvctl/configs/templates/kaas.yaml \
-f isvctl/configs/aws/eks-layered.yaml
The existing self-contained eks.yaml is unchanged (backward compat).
Adds 6 integration tests verifying:
- Templates have no commands block
- Layered merge produces both commands and tests
- Context overrides flow through
- Standalone eks.yaml still works
- Layered and standalone have the same validation check names
- All 7 templates are validation-only
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Implements the control-plane template for the Carbide provider using
carbidecli. Maps template concepts to Carbide resources:
- API health → tenant get, site list
- Access keys → SSH key group + SSH key CRUD
- Tenants → VPC CRUD
Includes:
- carbide/control-plane.yaml: layered provider config
- stubs/carbide/common/carbide.py: shared helper (run_carbide, state mgmt)
- stubs/carbide/control-plane/: 10 stub scripts matching template steps
Usage:
isvctl test run \
-f isvctl/configs/templates/control-plane.yaml \
-f isvctl/configs/carbide/control-plane.yaml
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Implements 3 upstream templates for the Carbide provider: Network (carbide/network.yaml + 8 stubs): VPC CRUD, subnet configuration, VPC isolation, NSG security rules, connectivity and traffic validation via carbidecli. Image Registry (carbide/image-registry.yaml + 6 stubs): OperatingSystem CRUD, instance launch from OS image, install config lifecycle — validates Carbide's image management capabilities. Bare Metal (carbide/bm.yaml + 7 stubs): Instance launch/describe/list/reboot/teardown via carbidecli. Reinstall is skipped (not supported). NIM deploy/teardown reuse the shared template stubs. All providers use the layered approach: isvctl test run -f templates/<template>.yaml -f carbide/<provider>.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>
IAM provider (carbide/iam.yaml + 3 stubs): Validates API token, checks scope coverage for all templates, proves write access via temp SSH key lifecycle. Full Carbide API surface (common/carbide.py): CARBIDE_API_RESOURCES maps all 24 resources from the OpenAPI spec (bare-metal-manager-rest) with their operations and scope names. CRUD library (common/resources.py): Pre-configured CarbideResource instances for all resources: site, vpc, vpc-prefix, subnet, nsg, ipblock, allocation, instance, instance-type, machine, expected-machine, operating-system, infiniband-partition, nvlink-logical-partition, nvlink-interface, dpu-extension-service, sshkeygroup, sshkey, tenant, rack, tray, sku, audit. Dynamic scope calculation: effective_scopes_for_template() reduces required scopes when pre-existing resources are set via CARBIDE_*_ID env vars. TEMPLATE_REQUIRED_SCOPES defines minimum scopes per template. Pre-existing resource support in create/teardown stubs: CARBIDE_VPC_ID, CARBIDE_VPC_PREFIX_ID, CARBIDE_SUBNET_ID, CARBIDE_SSH_KEY_GROUP_ID, CARBIDE_OS_ID, CARBIDE_INSTANCE_ID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Register Carbide provider configs in PLATFORM_CONFIGS so test coverage tracking knows which validations are used by the Carbide provider (control-plane, network, image-registry, bm, iam). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Orchestrates Assisted Installer + Carbide for OpenShift deployment:
1. Create cluster in AI, collect iPXE discovery config
2. Create OperatingSystem in Carbide with iPXE config
3. Batch-create bare-metal instances with OS + InstanceType
4. Collect instance UUIDs + MACs, monitor host registration
5. Start and monitor OpenShift installation, collect kubeconfig
6. Install GPU Operator via OperatorHub
Supports pre-existing Carbide resources (VPC, subnet, Instance Type)
via env vars. Deprovision gated by TEARDOWN_ENABLED=true.
Usage:
isvctl test run \
-f isvctl/configs/templates/kaas.yaml \
-f isvctl/configs/openshift/kaas-provision.yaml
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Factored cluster inventory collection into a shared script
(stubs/common/k8s-inventory.sh) that uses kubectl for all standard
K8s queries: nodes, GPUs, operator namespace, driver version,
RuntimeClass, GPU product. Any provider sources this script.
OpenShift setup.sh sources the shared script and adds oc-specific
queries (clusterversion, infrastructure name). The JSON output
matches the upstream kaas template contract.
Files:
stubs/common/k8s-inventory.sh — shared kubectl inventory (130 lines)
openshift/kaas.yaml — layered provider config
openshift/kaas-overrides.yaml — site-specific tuning template
stubs/openshift/kaas/setup.sh — sources shared + adds oc queries
stubs/openshift/kaas/teardown.sh — test namespace cleanup
Usage:
isvctl test run \
-f isvctl/configs/templates/kaas.yaml \
-f isvctl/configs/openshift/kaas.yaml
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Implements the IAM template for OpenShift using Red Hat Build of
Keycloak (rhbk-operator from redhat-operators catalog).
create_user.py (idempotent):
Deploy RHBK operator, create Keycloak instance, configure realm
+ OIDC client, patch OAuth CR with IdP + ingress CA, create user
test_credentials.py:
oc login with test user, verify identity, RBAC grant/deny checks
teardown.py:
Delete test user from Keycloak + OpenShift identity objects.
Keycloak stays deployed for reuse.
Usage:
isvctl test run \
-f isvctl/configs/templates/iam.yaml \
-f isvctl/configs/openshift/iam.yaml
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Standalone config (not layered on infra network template) for in-cluster GPU networking: - NVIDIA Network Operator: MOFED drivers, RDMA device plugin - SR-IOV Operator: VF provisioning for InfiniBand/RoCE - NetworkPolicy: OVN-Kubernetes enforcement (block/allow) - Multus: secondary network interface attachment - RDMA: device availability, GPUDirect nvidia_peermem - Multi-node NCCL: AllReduce over RDMA (2+ GPU nodes) Graceful handling: missing operators report as failed checks, NCCL test skips if fewer than 2 GPU nodes available. Usage: isvctl test run -f isvctl/configs/openshift/network.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Standalone config for OpenShift Data Foundation validation:
deploy_odf.py (idempotent):
Install ODF operator from OperatorHub, discover local NVMe disks
via LocalVolumeDiscovery, create StorageCluster, wait for Ready.
Auto-detects worker nodes or uses ODF_STORAGE_NODES env var.
verify_storage_classes.py:
Verify CephRBD and CephFS StorageClasses are provisioned.
test_pvc_binding.py:
Create PVCs for both StorageClasses, verify they reach Bound.
test_pod_mount.py:
Mount RBD PVC in a pod, write data, read back, verify integrity.
teardown.py:
Clean up test PVCs and namespace. ODF stays deployed.
Usage:
isvctl test run -f isvctl/configs/openshift/storage.yaml
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Implements the VM template for OpenShift using KubeVirt with GPU
passthrough via VFIO. Layers on templates/vm.yaml for validations.
launch_instance.py:
Create VirtualMachine CR with GPU devices, DataVolume from cloud
image URL, cloud-init SSH config. Wait for Running + SSH ready.
Auto-detects GPU device name from node labels.
list_instances.py:
List VirtualMachineInstances, verify target VM present.
reboot_instance.py:
Restart via virtctl (or VMI delete fallback), wait for SSH,
report uptime for reboot verification.
deploy_nim.py:
Skipped by default. Placeholder for NIM-in-VM deployment.
teardown.py:
Delete VM, DataVolume, and test namespace.
Usage:
VM_IMAGE_URL=<qcow2-url> VM_SSH_PUBKEY="ssh-ed25519 ..." \
isvctl test run \
-f isvctl/configs/templates/vm.yaml \
-f isvctl/configs/openshift/vm.yaml
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Validates dynamic GPU node provisioning via Carbide Machine API:
verify_machine_api.py:
Check Machine API operator available, Carbide provider configured.
create_machineset.py:
Create GPU MachineSet with Carbide providerSpec (instance type,
site ID). Configure MachineAutoscaler for scale-up tests.
Wait for machines to provision and nodes to join.
test_gpu_workload.py:
Schedule a GPU pod on a MachineSet-provisioned node, verify
nvidia-smi succeeds (proves GPU scheduling on dynamic nodes).
test_scale_up.py:
Increase MachineSet replicas, wait for new machines to provision
and join as Ready nodes (proves Carbide create works).
test_scale_down.py:
Reduce replicas back to minimum, verify excess machines are
deprovisioned (proves Carbide delete works).
teardown.py:
Delete test namespace. MachineSet deletion gated by
TEARDOWN_ENABLED=true.
Usage:
CARBIDE_INSTANCE_TYPE=<uuid> CARBIDE_SITE_ID=<uuid> \
isvctl test run -f isvctl/configs/openshift/machineset.yaml
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Security (4 read-only checks): - SELinux: verify Enforcing on all nodes (oc debug) - Secure Boot: check mokutil --sb-state (warns if unsupported) - FIPS: check install-config and /proc/sys/crypto/fips_enabled - Compliance Operator: check scan results if operator installed GPU Health (4 read-only checks): - DCGM Exporter: pods running, metrics service, ServiceMonitor - DCGM Metrics: scrape exporter, validate expected metric names - NHC Operator: NodeHealthCheck CRD and configs (if installed) - Prometheus: query Thanos for GPU metrics end-to-end Both are standalone configs, no infrastructure changes. Usage: isvctl test run -f isvctl/configs/openshift/security.yaml isvctl test run -f isvctl/configs/openshift/gpu-health.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>
DRA (Dynamic Resource Allocation): - Verify DRA driver pods running and ResourceClass exists - Create ResourceClaim + pod, verify GPU allocation works - Check SCC compatibility with DRA driver pods Requires GPU Operator switched to DRA mode. ComputeDomain (IMEX): - Verify ComputeDomain CRD exists - Create domain spanning GPU nodes, wait for IMEX channels - Tray allocation verification - Multi-job scheduling within domain boundaries - NCCL AllReduce over NVLink via IMEX channels Requires DRA mode + NVL72 NVLink fabric. Usage: isvctl test run -f isvctl/configs/openshift/dra.yaml isvctl test run -f isvctl/configs/openshift/computedomain.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Read-only checks for ARM-specific platform features. All tests skip gracefully on x86_64 clusters. arch_check.py: Verify aarch64 architecture on nodes smmu_check.py: SMMUv3 detection via dmesg + IOMMU groups gicv4_check.py: GICv4.1 interrupt controller detection hugepage_check.py: 1Gi/2Mi hugepage capacity on nodes topology_check.py: NUMA topology, CPU model, GPU-NUMA affinity Usage: isvctl test run -f isvctl/configs/openshift/arm.yaml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Deploys a second OpenShift cluster from the management cluster
using Cluster API Provider for Carbide:
deploy_capi_provider.py:
Install CAPI core + Carbide infrastructure provider (idempotent).
provision_baremetal_hosts.py:
Batch-create Carbide instances, register as BareMetalHost
resources in the management cluster for CAPI consumption.
create_hosted_realm.py:
Create a Keycloak realm on the management cluster for the
hosted cluster's OAuth configuration.
deploy_hosted_cluster.py:
Create CAPI Cluster + CarbideCluster CRs, wait for hosted
cluster to be provisioned, extract kubeconfig.
configure_hosted_cluster.py:
Install GPU Operator and Network Operator on the hosted
cluster via its kubeconfig.
verify_hosted_cluster.py:
Basic health checks: nodes Ready, GPU Operator running,
GPU nodes detected. Outputs kubeconfig path for subsequent
test runs.
teardown.py:
Delete Cluster CR, BareMetalHosts, Carbide instances,
namespaces. Gated by TEARDOWN_ENABLED=true.
After provisioning, re-run validation tests targeting the hosted
cluster:
KUBECONFIG=/tmp/ncp-hosted-kubeconfig \
isvctl test run \
-f isvctl/configs/templates/kaas.yaml \
-f isvctl/configs/openshift/kaas.yaml
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Register all 12 OpenShift validation configs in PLATFORM_CONFIGS for coverage tracking: kaas, iam, network, storage, vm, machineset, security, gpu-health, dra, computedomain, arm, hosted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OpenShift platform validation on Carbide bare metal. 12 configs covering the full stack:
Test catalog entries for all 12 OpenShift platforms.
Test plan
🤖 Generated with Claude Code