Add Operation CRD and agent daemon operation controller with soft-reboot support#59
Add Operation CRD and agent daemon operation controller with soft-reboot support#59
Conversation
12852aa to
df6331a
Compare
Introduce a unified action queue pattern for the agent daemon that processes both Machine CR updates and operation requests sequentially through a single worker goroutine. Operations are delivered via ConfigMaps labeled unbounded.io/agent-op=<machine-name> as a temporary shim until a dedicated Operation CRD is introduced. - Refactor daemon to use Action-typed workqueue with discriminated dispatch - Extract machine reconciler into reconciler_machine.go - Add ConfigMap operation shim (opshim.go) and watch loop (opwatch.go) - Implement soft-reboot: systemctl restart of nspawn service, avoiding the machinectl disable/enable cycle that breaks re-enablement - Add kubectl unbounded machine soft-reboot command - Move Redfish check from getMachine to runReboot where it belongs - Make nspawn.conf [Files] section unconditional with /lib/modules bind - Add RBAC for ConfigMap access (annotated as POC/temporary) - Add tests for opshim, soft-restart reconciler, and machine reconciler
df6331a to
b412e35
Compare
|
Introduce a proper Operation custom resource (unbounded-kube.io/v1alpha3) for managing machine operations, replacing the temporary ConfigMap-based approach. The Operation CRD supports SoftReboot (agent-executed) and HardReboot (reserved for machina controller) types with status subresource, owner references for GC, and TTL-based cleanup of completed operations. - Add Operation CRD types, deepcopy, and generated CRD YAML - Rewrite agent operation watcher to watch Operation CRs (was ConfigMaps) - Replace reconciler_softrestart with reconciler_operation dispatching by type - Update action queue to use ActionOperation instead of ActionSoftRestart - Rewrite kubectl soft-reboot to create Operation CR with owner ref and TTL - Update RBAC from ConfigMap rules to operations and operations/status - Add 7 operation reconciler tests covering all phases and TTL cleanup - Remove opshim.go, opshim_test.go, reconciler_softrestart.go and its tests
Align the Operation CRD with the MachineOperation design from PR #46: - Rename Operation CR to MachineOperation (shortName: mop) - Rename spec.type to spec.operationName (enum-as-string) - Rename SoftReboot/HardReboot to Reboot/PowerCycle plus Shutdown, PowerOff, PowerOn, RestartService placeholders - Rename Completed phase to Complete - Add spec.parameters map[string]string for operation arguments - Add unbounded-kube.io/machine label on every MachineOperation for label-selector-based informer scoping in the agent - Update agent opwatch to use label selector instead of client-side filter - Update RBAC from operations to machineoperations - Update kubectl soft-reboot, tests, and e2e
E2E: Kubernetes Version Upgrade via MachineConfiguration (v1.34.3 -> v1.35.1)Cluster: AKS Flow
Agent LogsController LogsMachine CR Status (after)status:
phase: Joining
message: node update completed
configuration:
name: upgrade-test
version: 1
versionName: upgrade-test-v1
conditions:
- type: NodeUpdated
status: "True"
reason: Succeeded
message: node update completed
operations:
repaveCounter: 2MCV Statusspec:
version: 1
template:
kubernetes:
version: v1.35.1
nodeLabels:
kubernetes.azure.com/cluster: bahe-test-nodes
kubernetes.azure.com/managed: "false"Node (after)Summary
kubectl plugin commands tested |
…ller, agent MCV resolution, and kubectl config commands Introduce a Deployment/ReplicaSet-style versioning model for machine configuration. MachineConfiguration acts as the mutable profile; edits automatically create or update MachineConfigurationVersion snapshots (immutable once deployed). Agent changes: - reconcileUpdateMachine resolves MCV from Machine.spec.configurationRef instead of reading config from Machine.Spec.Kubernetes/Agent directly - Fails if configurationRef is missing (no fallback) - Records applied MCV in Machine.status.configuration after success kubectl unbounded config commands: - config create: creates a MachineConfiguration with k8s version, agent image, node labels, taints, and update strategy - config get: lists MachineConfigurations or shows detail for one - config versions: lists MCVs for a MachineConfiguration - config assign: sets configurationRef on a Machine with optional version pin Also adds machineconfigurationversions RBAC to bootstrapper ClusterRole. E2E validated on oc-vm3: v1.34.3 -> v1.35.1 upgrade via blue/green repave in ~16 seconds, node rejoined as Ready.
E2E: Node Delete as Repave TriggerValidated the Node-deletion repave signal on Resource SetupMachineConfiguration ( apiVersion: unbounded-kube.io/v1alpha3
kind: MachineConfiguration
metadata:
name: upgrade-test
spec:
priority: 0
revisionHistoryLimit: 10
template:
kubernetes:
version: v1.35.1
nodeLabels:
kubernetes.azure.com/cluster: bahe-test-nodes
kubernetes.azure.com/managed: "false"MachineConfigurationVersion v1 ( apiVersion: unbounded-kube.io/v1alpha3
kind: MachineConfigurationVersion
metadata:
name: upgrade-test-v1
labels:
unbounded-kube.io/configuration: upgrade-test
spec:
version: 1
template:
kubernetes:
version: v1.35.1
nodeLabels:
kubernetes.azure.com/cluster: bahe-test-nodes
kubernetes.azure.com/managed: "false"MachineConfigurationVersion v2 ( apiVersion: unbounded-kube.io/v1alpha3
kind: MachineConfigurationVersion
metadata:
name: upgrade-test-v2
labels:
unbounded-kube.io/configuration: upgrade-test
spec:
version: 2
template:
kubernetes:
version: v1.34.3
nodeLabels:
kubernetes.azure.com/cluster: bahe-test-nodes
kubernetes.azure.com/managed: "false"Machine CR ( apiVersion: unbounded-kube.io/v1alpha3
kind: Machine
metadata:
name: agent
spec:
configurationRef:
name: upgrade-test
version: 2
kubernetes:
bootstrapTokenRef:
name: bootstrap-token-ftbv20
nodeLabels:
kubernetes.azure.com/cluster: bahe-test-nodes
kubernetes.azure.com/managed: "false"Test 1: Node delete with no config driftAssigned MCV v1 (v1.35.1, same as applied), deleted Node Result: Agent detected Node deletion, bypassed operation counter check (forceRepave=true), but Test 2: Node delete with version change (v1.35.1 -> v1.34.3)Assigned MCV v2 (v1.34.3), deleted Node Total repave time: ~10 seconds (22:49:24 -> 22:49:34). Blue/green: kube2 -> kube1. Post-repave stateMachine status after repave: status:
phase: Joining
message: node update completed
configuration:
name: upgrade-test
version: 2
versionName: upgrade-test-v2
conditions:
- type: NodeUpdated
status: "True"
reason: Succeeded
message: node update completed
observedGeneration: 12Node re-registered with new version: Agent watchersOn startup, agent now runs three concurrent watch loops: RBACAdded to - apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]New files
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
| description: OperationName is the operation to perform on the target | ||
| machine. | ||
| enum: | ||
| - Reboot |
There was a problem hiding this comment.
The CRD definition is using the values I suggested (Reboot/PowerCycle) but I think the POC is using SoftRestart and HardRestart). I personally prefer reboot/powercycle as I think they're clearer, but either way can we be consistent?
There was a problem hiding this comment.
I kind of prefer the Soft/Hard Restart (or Reboot) myself. PowerCycle implies something very specific which may or may not be happening depending on the provider implementation below us.
There was a problem hiding this comment.
OK, I'm certainly open to what others think is clearer!
|
I feel like the |
But the config is actually assigned to multiple machines, not sure if putting it as "sub command" under machine causes confusion. @phealy wdyt? |
# Conflicts: # hack/agent/e2e-kind/e2e.py
Align naming with PR #59 review feedback (phealy/plombardi89): - OperationPowerCycle -> OperationHardReboot in API types and CRD - softRestart -> softReboot in agent executor interface and reconciler - Update all tests and log messages accordingly
E2E Test Results —
|
| Test | Result | Duration | Details |
|---|---|---|---|
| Node-delete repave | PASS | ~10s | v1.34.3 → v1.35.1, kube1 → kube2, MCV v2 → v3 |
| HardReboot operation | PASS (expected fail) | <1s | Correctly rejected: "handled by machina controller, not the agent" |
| Soft Reboot operation | PASS | ~1s | kube2 soft rebooted, node re-registered Ready |
Command Flows
1. MachineConfiguration + MCV lifecycle
# Create a MachineConfiguration
kubectl apply -f - <<YAML
apiVersion: unbounded-kube.io/v1alpha3
kind: MachineConfiguration
metadata:
name: upgrade-test
spec:
template:
kubernetes:
version: "v1.34.3"
nodeLabels:
kubernetes.azure.com/cluster: bahe-test-nodes
YAML
# Create versioned snapshots (MCVs)
kubectl apply -f - <<YAML
apiVersion: unbounded-kube.io/v1alpha3
kind: MachineConfigurationVersion
metadata:
name: upgrade-test-v1
labels:
unbounded-kube.io/configuration: upgrade-test
spec:
version: 1
template:
kubernetes:
version: "v1.34.3"
nodeLabels:
kubernetes.azure.com/cluster: bahe-test-nodes
YAML
# List configurations and versions
kubectl get machineconfigurations
kubectl get machineconfigurationversions2. Assign a config version to a Machine
# Point Machine to a specific MCV
kubectl patch machine agent --type=merge \
-p '{"spec":{"configurationRef":{"name":"upgrade-test","version":3}}}'
# Verify assignment
kubectl get machine agent -o jsonpath='{.spec.configurationRef}'
kubectl get machine agent -o jsonpath='{.status.configuration}'3. Trigger repave via Node delete (OnDelete strategy)
# Agent detects drift but waits for Node delete signal.
# Delete the Node to trigger repave:
kubectl delete node oc-vm3
# Agent detects deletion -> repaves to target MCV version.
# Node re-registers automatically after repave (~10-15s).
kubectl get node oc-vm34. MachineOperations
# Soft reboot (handled by in-VM agent)
kubectl apply -f - <<YAML
apiVersion: unbounded-kube.io/v1alpha3
kind: MachineOperation
metadata:
name: reboot-1
labels:
unbounded-kube.io/machine: agent
spec:
machineRef: agent
operationName: Reboot
YAML
# HardReboot (rejected by agent - handled by machina controller)
kubectl apply -f - <<YAML
apiVersion: unbounded-kube.io/v1alpha3
kind: MachineOperation
metadata:
name: hardreboot-1
labels:
unbounded-kube.io/machine: agent
spec:
machineRef: agent
operationName: HardReboot
YAML
# Check operation status
kubectl get machineoperations -o wide5. kubectl unbounded plugin commands
# Create a configuration
kubectl unbounded config create my-config --k8s-version v1.35.1
# List configurations
kubectl unbounded config get
# List versions for a configuration
kubectl unbounded config versions upgrade-test
# Assign a version to a machine
kubectl unbounded config assign upgrade-test --version 3 --machine agentRemove Shutdown, PowerOff, PowerOn, and RestartService from the OperationName enum. Agent now silently ignores operations it does not handle (leaving status untouched for the machina controller) instead of marking them Failed.
# Conflicts: # api/machina/v1alpha3/machine_types.go # cmd/agent/internal/daemon/daemon.go # cmd/agent/internal/daemon/update.go # cmd/agent/internal/phases/nodestart/persist_config.go # hack/agent/e2e-kind/e2e.py
…ersion CRDs Port custom resource type definitions from hbc/agent-op-cr-poc: - MachineOperation: discrete operations (Reboot, HardReboot) on machines - MachineConfiguration: deployment-like config profiles with update strategies - MachineConfigurationVersion: immutable versioned snapshots of configurations - Machine CR additions: configurationRef, configuration status, NodeUpdated condition, and configuration version annotation
The reconciler requires spec.configurationRef to resolve a MachineConfigurationVersion. Update install_machine_crd to install MachineConfiguration and MachineConfigurationVersion CRDs, and update trigger_upgrade to create an MCV CR and set configurationRef on the Machine CR before bumping the repaveCounter.
Adopt CRD type definitions from #96 (MachineOperation, MachineConfiguration, MachineConfigurationVersion). Update implementations: - Rename OperationName -> OperationKind, OperationReboot -> OperationSoftReboot - Convert RegisterWithTaints from []string to []corev1.Taint in MCV overlay - Add taint parse/format helpers for kubectl-unbounded - Remove old operation_types.go (superseded by machineoperation_types.go) - Add MachineConditionConfigurationPending constant
…tus update - Skip reconciliation gracefully when Machine CR has no configurationRef (e.g. during initial bootstrap before configuration is assigned) - Re-read Machine CR before final status updates to avoid resourceVersion conflicts from concurrent reconciliation triggered by Provisioning phase change events
The condition-setting code (condStatus, condReason, SetStatusCondition) was accidentally removed in a previous edit. Without it, the NodeUpdated condition stayed at InProgress even after a successful update, causing the e2e validation to fail.
Move pkg/agent/utilexec back to pkg/agent/internal/utilexec to restore proper encapsulation. To eliminate the cross-boundary import from cmd/agent/, expand the executor interface with machineRun and systemctlRestart methods and implement them on defaultExecutor with local exec helpers. Also restore nspawn.conf from origin/main.
Summary
Replace the temporary ConfigMap-based operation shim with a proper Operation custom resource (
unbounded-kube.io/v1alpha3) for managing machine operations like soft-reboot.Changes
SoftRebootandHardReboottypes, status subresource, owner references for GC, and TTL-based cleanupreconciler_softrestartwith genericreconciler_operationthat dispatches by typesoft-rebootcommand to create an Operation CR with owner reference and TTL (default 300s), then watch status until completionoperationsandoperations/statusverbsE2E Validation
Validated on
oc-vm3(Ubuntu 24.04, Standard_D2as_v7):Node returned to
Readyafter soft reboot completed successfully.