Add Operation CRD and agent daemon operation controller with soft-reboot support by bcho · Pull Request #59 · Azure/unbounded

bcho · 2026-04-20T21:55:40Z

Summary

Replace the temporary ConfigMap-based operation shim with a proper Operation custom resource (unbounded-kube.io/v1alpha3) for managing machine operations like soft-reboot.

Changes

Operation CRD: New cluster-scoped CR with SoftReboot and HardReboot types, status subresource, owner references for GC, and TTL-based cleanup
Agent daemon: Rewrite operation watcher to watch Operation CRs instead of ConfigMaps; replace reconciler_softrestart with generic reconciler_operation that dispatches by type
kubectl plugin: Rewrite soft-reboot command to create an Operation CR with owner reference and TTL (default 300s), then watch status until completion
RBAC: Replace ConfigMap rules with operations and operations/status verbs
Tests: 7 operation reconciler tests covering all phases, TTL cleanup, and edge cases

E2E Validation

Validated on oc-vm3 (Ubuntu 24.04, Standard_D2as_v7):

$ kubectl get machines -o wide
NAME    HOST   PHASE   K8S VERSION   AGE
agent                                14m

$ ./bin/kubectl-unbounded machine soft-reboot agent
  --> Soft-rebooting Machine agent...
  operation         agent-softreboot-1776734492
  --> Operation SoftReboot: agent-softreboot-1776734492 in progress...
  --> Operation SoftReboot: agent-softreboot-1776734492 completed
  ready

$ kubectl get operations -o wide
NAME                          MACHINE   TYPE         PHASE       AGE
agent-softreboot-1776734492   agent     SoftReboot   Completed   4m5s

$ kubectl get operations -o yaml
apiVersion: unbounded-kube.io/v1alpha3
kind: Operation
metadata:
  name: agent-softreboot-1776734492
  ownerReferences:
  - apiVersion: unbounded-kube.io/v1alpha3
    kind: Machine
    name: agent
    uid: d5fc240c-0b2a-4a3f-9468-1d2aabf691bb
spec:
  machineRef: agent
  ttlSecondsAfterFinished: 300
  type: SoftReboot
status:
  completedAt: "2026-04-21T01:21:33Z"
  phase: Completed
  startedAt: "2026-04-21T01:21:32Z"

Node returned to Ready after soft reboot completed successfully.

Introduce a unified action queue pattern for the agent daemon that processes both Machine CR updates and operation requests sequentially through a single worker goroutine. Operations are delivered via ConfigMaps labeled unbounded.io/agent-op=<machine-name> as a temporary shim until a dedicated Operation CRD is introduced. - Refactor daemon to use Action-typed workqueue with discriminated dispatch - Extract machine reconciler into reconciler_machine.go - Add ConfigMap operation shim (opshim.go) and watch loop (opwatch.go) - Implement soft-reboot: systemctl restart of nspawn service, avoiding the machinectl disable/enable cycle that breaks re-enablement - Add kubectl unbounded machine soft-reboot command - Move Redfish check from getMachine to runReboot where it belongs - Make nspawn.conf [Files] section unconditional with /lib/modules bind - Add RBAC for ConfigMap access (annotated as POC/temporary) - Add tests for opshim, soft-restart reconciler, and machine reconciler

bcho · 2026-04-21T00:33:33Z

Events:
  Type     Reason                   Age                    From             Message
  ----     ------                   ----                   ----             -------
  Normal   Starting                 2m4s                   kube-proxy
  Normal   Starting                 15s                    kube-proxy
  Normal   Starting                 11m                    kube-proxy
  Normal   NodeHasNoDiskPressure    6m11s (x8 over 11m)    kubelet          Node oc-vm3 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     6m11s (x8 over 11m)    kubelet          Node oc-vm3 status is now: NodeHasSufficientPID
  Normal   NodeHasSufficientMemory  6m11s (x8 over 11m)    kubelet          Node oc-vm3 status is now: NodeHasSufficientMemory
  Normal   NodeHasSufficientMemory  2m28s (x7 over 2m29s)  kubelet          Node oc-vm3 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    2m28s (x7 over 2m29s)  kubelet          Node oc-vm3 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     2m28s (x7 over 2m29s)  kubelet          Node oc-vm3 status is now: NodeHasSufficientPID
  Normal   RegisteredNode           2m24s                  node-controller  Node oc-vm3 event: Registered Node oc-vm3 in Controller
  Normal   NodeReady                113s                   kubelet          Node oc-vm3 status is now: NodeReady
  Normal   Starting                 18s                    kubelet          Starting kubelet.
  Normal   NodeHasSufficientMemory  18s (x2 over 18s)      kubelet          Node oc-vm3 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    18s (x2 over 18s)      kubelet          Node oc-vm3 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     18s (x2 over 18s)      kubelet          Node oc-vm3 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  18s                    kubelet          Updated Node Allocatable limit across pods
  Warning  Rebooted                 18s                    kubelet          Node oc-vm3 has been rebooted, boot id: 583dc021-bcda-4f75-a08c-4d7156b8364e
  Warning  InvalidDiskCapacity      18s                    kubelet          invalid capacity 0 on image filesystem

Introduce a proper Operation custom resource (unbounded-kube.io/v1alpha3) for managing machine operations, replacing the temporary ConfigMap-based approach. The Operation CRD supports SoftReboot (agent-executed) and HardReboot (reserved for machina controller) types with status subresource, owner references for GC, and TTL-based cleanup of completed operations. - Add Operation CRD types, deepcopy, and generated CRD YAML - Rewrite agent operation watcher to watch Operation CRs (was ConfigMaps) - Replace reconciler_softrestart with reconciler_operation dispatching by type - Update action queue to use ActionOperation instead of ActionSoftRestart - Rewrite kubectl soft-reboot to create Operation CR with owner ref and TTL - Update RBAC from ConfigMap rules to operations and operations/status - Add 7 operation reconciler tests covering all phases and TTL cleanup - Remove opshim.go, opshim_test.go, reconciler_softrestart.go and its tests

Align the Operation CRD with the MachineOperation design from PR #46: - Rename Operation CR to MachineOperation (shortName: mop) - Rename spec.type to spec.operationName (enum-as-string) - Rename SoftReboot/HardReboot to Reboot/PowerCycle plus Shutdown, PowerOff, PowerOn, RestartService placeholders - Rename Completed phase to Complete - Add spec.parameters map[string]string for operation arguments - Add unbounded-kube.io/machine label on every MachineOperation for label-selector-based informer scoping in the agent - Update agent opwatch to use label selector instead of client-side filter - Update RBAC from operations to machineoperations - Update kubectl soft-reboot, tests, and e2e

bcho · 2026-04-21T22:10:24Z

E2E: Kubernetes Version Upgrade via MachineConfiguration (v1.34.3 -> v1.35.1)

Cluster: AKS bahe-test-nodes (southcentralus, v1.34.3)
VM: oc-vm3 (Ubuntu 24.04, nspawn blue/green)

Flow

kubectl unbounded config create upgrade-test --kubernetes-version=v1.35.1 --node-labels=...
Controller auto-creates upgrade-test-v1 MCV
kubectl unbounded config assign upgrade-test agent --version=1
Bump spec.operations.repaveCounter to trigger repave
Agent resolves MCV, detects version drift, performs full blue/green node update

Agent Logs

[I] daemon starting [machine_cr=agent] [nspawn_machine=kube1] [applied_version=1.34.3]
[I] operation counter drift detected [current_version=1.34.3] [desired_version=v1.35.1] [mcv=upgrade-test-v1]
[I] starting node update [old_machine=kube1] [new_machine=kube2] [old_version=1.34.3] [new_version=1.35.1]
[I] pulling OCI image [image=ghcr.io/azure/agent-ubuntu2404:v20260409]
[I] downloading kubernetes binary [binary=kubelet] [url=https://dl.k8s.io/v1.35.1/bin/linux/amd64/kubelet]
[I] downloading kubernetes binary [binary=kubectl] [url=https://dl.k8s.io/v1.35.1/bin/linux/amd64/kubectl]
[I] downloading kubernetes binary [binary=kube-proxy] [url=https://dl.k8s.io/v1.35.1/bin/linux/amd64/kube-proxy]
[I] stopping machine [machine=kube1]
[I] kubelet is active [machine=kube2]
[I] removing machine rootfs [machine=kube1]
[I] node update completed [active_machine=kube2] [version=1.35.1]
[I] reconciliation completed [new_version=1.35.1] [mcv=upgrade-test-v1]

Controller Logs

INFO  creating initial MachineConfigurationVersion  {"name": "upgrade-test", "version": 1}

Machine CR Status (after)

status:
  phase: Joining
  message: node update completed
  configuration:
    name: upgrade-test
    version: 1
    versionName: upgrade-test-v1
  conditions:
  - type: NodeUpdated
    status: "True"
    reason: Succeeded
    message: node update completed
  operations:
    repaveCounter: 2

MCV Status

spec:
  version: 1
  template:
    kubernetes:
      version: v1.35.1
      nodeLabels:
        kubernetes.azure.com/cluster: bahe-test-nodes
        kubernetes.azure.com/managed: "false"

Node (after)

NAME     STATUS   ROLES    AGE     VERSION
oc-vm3   Ready    <none>   3h17m   v1.35.1

Summary

Total repave time: ~16 seconds (rootfs provision 13.4s + stop/start/cleanup 2.6s)
Blue/green swap: kube1 (v1.34.3) -> kube2 (v1.35.1)
Node rejoined cluster as Ready with v1.35.1
Machine status.configuration correctly records the applied MCV
RBAC fix: added machineconfigurationversions (get/list/watch) to bootstrapper ClusterRole

kubectl plugin commands tested

kubectl unbounded config create upgrade-test --kubernetes-version=v1.35.1 --node-labels=...
kubectl unbounded config get
kubectl unbounded config get upgrade-test
kubectl unbounded config versions upgrade-test
kubectl unbounded config assign upgrade-test agent --version=1

…ller, agent MCV resolution, and kubectl config commands Introduce a Deployment/ReplicaSet-style versioning model for machine configuration. MachineConfiguration acts as the mutable profile; edits automatically create or update MachineConfigurationVersion snapshots (immutable once deployed). Agent changes: - reconcileUpdateMachine resolves MCV from Machine.spec.configurationRef instead of reading config from Machine.Spec.Kubernetes/Agent directly - Fails if configurationRef is missing (no fallback) - Records applied MCV in Machine.status.configuration after success kubectl unbounded config commands: - config create: creates a MachineConfiguration with k8s version, agent image, node labels, taints, and update strategy - config get: lists MachineConfigurations or shows detail for one - config versions: lists MCVs for a MachineConfiguration - config assign: sets configurationRef on a Machine with optional version pin Also adds machineconfigurationversions RBAC to bootstrapper ClusterRole. E2E validated on oc-vm3: v1.34.3 -> v1.35.1 upgrade via blue/green repave in ~16 seconds, node rejoined as Ready.

bcho · 2026-04-21T22:59:54Z

E2E: Node Delete as Repave Trigger

Validated the Node-deletion repave signal on oc-vm3. The agent now watches the Kubernetes Node object by hostname and triggers a forced repave (bypassing operation counter drift check) when the Node is deleted.

Resource Setup

MachineConfiguration (upgrade-test):

apiVersion: unbounded-kube.io/v1alpha3
kind: MachineConfiguration
metadata:
  name: upgrade-test
spec:
  priority: 0
  revisionHistoryLimit: 10
  template:
    kubernetes:
      version: v1.35.1
      nodeLabels:
        kubernetes.azure.com/cluster: bahe-test-nodes
        kubernetes.azure.com/managed: "false"

MachineConfigurationVersion v1 (upgrade-test-v1) - initial version matching applied:

apiVersion: unbounded-kube.io/v1alpha3
kind: MachineConfigurationVersion
metadata:
  name: upgrade-test-v1
  labels:
    unbounded-kube.io/configuration: upgrade-test
spec:
  version: 1
  template:
    kubernetes:
      version: v1.35.1
      nodeLabels:
        kubernetes.azure.com/cluster: bahe-test-nodes
        kubernetes.azure.com/managed: "false"

MachineConfigurationVersion v2 (upgrade-test-v2) - target version for repave:

apiVersion: unbounded-kube.io/v1alpha3
kind: MachineConfigurationVersion
metadata:
  name: upgrade-test-v2
  labels:
    unbounded-kube.io/configuration: upgrade-test
spec:
  version: 2
  template:
    kubernetes:
      version: v1.34.3
      nodeLabels:
        kubernetes.azure.com/cluster: bahe-test-nodes
        kubernetes.azure.com/managed: "false"

Machine CR (agent) - assigned to MCV v2:

apiVersion: unbounded-kube.io/v1alpha3
kind: Machine
metadata:
  name: agent
spec:
  configurationRef:
    name: upgrade-test
    version: 2
  kubernetes:
    bootstrapTokenRef:
      name: bootstrap-token-ftbv20
    nodeLabels:
      kubernetes.azure.com/cluster: bahe-test-nodes
      kubernetes.azure.com/managed: "false"

Test 1: Node delete with no config drift

Assigned MCV v1 (v1.35.1, same as applied), deleted Node oc-vm3:

22:44:17 [I] Node deleted, enqueuing repave [watcher=node] [node=oc-vm3]
22:44:17 [I] Node deleted, forcing repave [action=NodeDeleted] [source=oc-vm3]
                [current_version=1.35.1] [desired_version=v1.35.1] [mcv=upgrade-test-v1]
22:44:17 [I] no config drift detected, skipping node update [action=NodeDeleted] [source=oc-vm3]
22:44:17 [I] reconciliation completed [new_version=1.35.1] [mcv=upgrade-test-v1] [action=NodeDeleted]

Result: Agent detected Node deletion, bypassed operation counter check (forceRepave=true), but updateNode correctly found no config drift and skipped the expensive repave. No unnecessary work.

Test 2: Node delete with version change (v1.35.1 -> v1.34.3)

Assigned MCV v2 (v1.34.3), deleted Node oc-vm3:

22:49:24 [I] Node deleted, enqueuing repave [watcher=node] [node=oc-vm3]
22:49:24 [I] Node deleted, forcing repave [action=NodeDeleted] [source=oc-vm3]
                [current_version=1.35.1] [desired_version=v1.34.3] [mcv=upgrade-test-v2]
22:49:24 [I] starting node update [old_machine=kube2] [new_machine=kube1]
                [old_version=1.35.1] [new_version=1.34.3]
22:49:24 [I] pulling OCI image [image=ghcr.io/azure/agent-ubuntu2404:v20260409]
22:49:26 [I] OCI image extraction complete                              (2.0s)
22:49:28 [I] downloaded kube binaries (v1.34.3)                         (2.0s)
22:49:30 [I] stopping machine [machine=kube2]
22:49:32 [I] [stop-node] completed                                      (1.9s)
22:49:32 [I] [start-nspawn-machine] started (kube1)
22:49:33 [I] kubelet is active [machine=kube1]
22:49:33 [I] removing machine rootfs [machine=kube2]
22:49:34 [I] node update completed [active_machine=kube1] [version=1.34.3]
22:49:34 [I] reconciliation completed [new_version=1.34.3] [mcv=upgrade-test-v2]

Total repave time: ~10 seconds (22:49:24 -> 22:49:34). Blue/green: kube2 -> kube1.

Post-repave state

Machine status after repave:

status:
  phase: Joining
  message: node update completed
  configuration:
    name: upgrade-test
    version: 2
    versionName: upgrade-test-v2
  conditions:
  - type: NodeUpdated
    status: "True"
    reason: Succeeded
    message: node update completed
    observedGeneration: 12

Node re-registered with new version:

NAME     STATUS   ROLES    AGE   VERSION   INTERNAL-IP
oc-vm3   Ready    <none>   9m    v1.34.3   10.1.0.4

Agent watchers

On startup, agent now runs three concurrent watch loops:

22:43:30 [I] daemon starting [machine_cr=agent] [nspawn_machine=kube2] [applied_version=1.35.1]
22:43:30 [I] Node watch starting [watcher=node] [node=oc-vm3]
22:43:30 [I] watching Node [name=oc-vm3]
22:43:30 [I] watching MachineOperation CRs [machineRef=agent]
22:43:30 [I] watching Machine CR [name=agent]

RBAC

Added to 07-bootstrapper-rbac.yaml:

- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch"]

New files

cmd/agent/internal/daemon/nodewatch.go - Node watch by hostname, enqueues ActionNodeDeleted on delete
cmd/agent/internal/daemon/nodewatch_test.go - 2 tests (delete enqueue, hostname error)
Updated reconciler.go dispatch for ActionNodeDeleted
Updated reconcileUpdateMachine with forceRepave bool parameter (skips operation counter drift check)
2 new reconciler tests (ForceRepaveSkipsDriftCheck, NoDrift_NormalPath)

phealy · 2026-04-22T14:18:27Z

+                description: OperationName is the operation to perform on the target
+                  machine.
+                enum:
+                - Reboot


The CRD definition is using the values I suggested (Reboot/PowerCycle) but I think the POC is using SoftRestart and HardRestart). I personally prefer reboot/powercycle as I think they're clearer, but either way can we be consistent?

I kind of prefer the Soft/Hard Restart (or Reboot) myself. PowerCycle implies something very specific which may or may not be happening depending on the provider implementation below us.

OK, I'm certainly open to what others think is clearer!

plombardi89 · 2026-04-22T14:52:39Z

I feel like the unbounded config should move to unbounded machine config or if we want a top-level command unbounded machine-config to express the relationship better. The current relationship unbounded config makes me think it's a high-level system configuration.

bcho · 2026-04-22T17:04:37Z

I feel like the unbounded config should move to unbounded machine config or if we want a top-level command unbounded machine-config to express the relationship better. The current relationship unbounded config makes me think it's a high-level system configuration.

But the config is actually assigned to multiple machines, not sure if putting it as "sub command" under machine causes confusion. @phealy wdyt?

# Conflicts: # hack/agent/e2e-kind/e2e.py

Align naming with PR #59 review feedback (phealy/plombardi89): - OperationPowerCycle -> OperationHardReboot in API types and CRD - softRestart -> softReboot in agent executor interface and reconciler - Update all tests and log messages accordingly

bcho · 2026-04-22T17:37:44Z

E2E Test Results — `eacfce0` (PowerCycle→HardReboot / softRestart→softReboot rename)

All 3 tests passed on oc-vm3 with the latest agent binary.

Test	Result	Duration	Details
Node-delete repave	PASS	~10s	v1.34.3 → v1.35.1, kube1 → kube2, MCV v2 → v3
HardReboot operation	PASS (expected fail)	<1s	Correctly rejected: "handled by machina controller, not the agent"
Soft Reboot operation	PASS	~1s	kube2 soft rebooted, node re-registered Ready

Command Flows

1. MachineConfiguration + MCV lifecycle

# Create a MachineConfiguration
kubectl apply -f - <<YAML
apiVersion: unbounded-kube.io/v1alpha3
kind: MachineConfiguration
metadata:
  name: upgrade-test
spec:
  template:
    kubernetes:
      version: "v1.34.3"
      nodeLabels:
        kubernetes.azure.com/cluster: bahe-test-nodes
YAML

# Create versioned snapshots (MCVs)
kubectl apply -f - <<YAML
apiVersion: unbounded-kube.io/v1alpha3
kind: MachineConfigurationVersion
metadata:
  name: upgrade-test-v1
  labels:
    unbounded-kube.io/configuration: upgrade-test
spec:
  version: 1
  template:
    kubernetes:
      version: "v1.34.3"
      nodeLabels:
        kubernetes.azure.com/cluster: bahe-test-nodes
YAML

# List configurations and versions
kubectl get machineconfigurations
kubectl get machineconfigurationversions

2. Assign a config version to a Machine

# Point Machine to a specific MCV
kubectl patch machine agent --type=merge \
  -p '{"spec":{"configurationRef":{"name":"upgrade-test","version":3}}}'

# Verify assignment
kubectl get machine agent -o jsonpath='{.spec.configurationRef}'
kubectl get machine agent -o jsonpath='{.status.configuration}'

3. Trigger repave via Node delete (OnDelete strategy)

# Agent detects drift but waits for Node delete signal.
# Delete the Node to trigger repave:
kubectl delete node oc-vm3

# Agent detects deletion -> repaves to target MCV version.
# Node re-registers automatically after repave (~10-15s).
kubectl get node oc-vm3

4. MachineOperations

# Soft reboot (handled by in-VM agent)
kubectl apply -f - <<YAML
apiVersion: unbounded-kube.io/v1alpha3
kind: MachineOperation
metadata:
  name: reboot-1
  labels:
    unbounded-kube.io/machine: agent
spec:
  machineRef: agent
  operationName: Reboot
YAML

# HardReboot (rejected by agent - handled by machina controller)
kubectl apply -f - <<YAML
apiVersion: unbounded-kube.io/v1alpha3
kind: MachineOperation
metadata:
  name: hardreboot-1
  labels:
    unbounded-kube.io/machine: agent
spec:
  machineRef: agent
  operationName: HardReboot
YAML

# Check operation status
kubectl get machineoperations -o wide

5. kubectl unbounded plugin commands

# Create a configuration
kubectl unbounded config create my-config --k8s-version v1.35.1

# List configurations
kubectl unbounded config get

# List versions for a configuration
kubectl unbounded config versions upgrade-test

# Assign a version to a machine
kubectl unbounded config assign upgrade-test --version 3 --machine agent

Remove Shutdown, PowerOff, PowerOn, and RestartService from the OperationName enum. Agent now silently ignores operations it does not handle (leaving status untouched for the machina controller) instead of marking them Failed.

…plugin

# Conflicts: # api/machina/v1alpha3/machine_types.go # cmd/agent/internal/daemon/daemon.go # cmd/agent/internal/daemon/update.go # cmd/agent/internal/phases/nodestart/persist_config.go # hack/agent/e2e-kind/e2e.py

…ersion CRDs Port custom resource type definitions from hbc/agent-op-cr-poc: - MachineOperation: discrete operations (Reboot, HardReboot) on machines - MachineConfiguration: deployment-like config profiles with update strategies - MachineConfigurationVersion: immutable versioned snapshots of configurations - Machine CR additions: configurationRef, configuration status, NodeUpdated condition, and configuration version annotation

The reconciler requires spec.configurationRef to resolve a MachineConfigurationVersion. Update install_machine_crd to install MachineConfiguration and MachineConfigurationVersion CRDs, and update trigger_upgrade to create an MCV CR and set configurationRef on the Machine CR before bumping the repaveCounter.

Adopt CRD type definitions from #96 (MachineOperation, MachineConfiguration, MachineConfigurationVersion). Update implementations: - Rename OperationName -> OperationKind, OperationReboot -> OperationSoftReboot - Convert RegisterWithTaints from []string to []corev1.Taint in MCV overlay - Add taint parse/format helpers for kubectl-unbounded - Remove old operation_types.go (superseded by machineoperation_types.go) - Add MachineConditionConfigurationPending constant

…tus update - Skip reconciliation gracefully when Machine CR has no configurationRef (e.g. during initial bootstrap before configuration is assigned) - Re-read Machine CR before final status updates to avoid resourceVersion conflicts from concurrent reconciliation triggered by Provisioning phase change events

The condition-setting code (condStatus, condReason, SetStatusCondition) was accidentally removed in a previous edit. Without it, the NodeUpdated condition stayed at InProgress even after a successful update, causing the e2e validation to fail.

Move pkg/agent/utilexec back to pkg/agent/internal/utilexec to restore proper encapsulation. To eliminate the cross-boundary import from cmd/agent/, expand the executor interface with machineRun and systemctlRestart methods and implement them on defaultExecutor with local exec helpers. Also restore nspawn.conf from origin/main.

Add Machine CR watch-based trigger for agent daemon

fb02549

bcho force-pushed the hbc/agent-op-cr-poc branch 2 times, most recently from 12852aa to df6331a Compare April 20, 2026 22:54

bcho force-pushed the hbc/agent-op-cr-poc branch from df6331a to b412e35 Compare April 20, 2026 22:57

bcho changed the title ~~POC: ConfigMap-based operation controller with soft-reboot~~ Replace ConfigMap-based operation shim with Operation CRD Apr 21, 2026

bcho changed the title ~~Replace ConfigMap-based operation shim with Operation CRD~~ Add Operation CRD and agent daemon operation controller with soft-reboot support Apr 21, 2026

Add Node deletion as repave trigger for OnDelete update strategy

d913374

This comment was marked as outdated.

Sign in to view

phealy reviewed Apr 22, 2026

View reviewed changes

bcho added 2 commits April 22, 2026 10:11

Merge remote-tracking branch 'origin/main' into hbc/agent-op-cr-poc

126144f

# Conflicts: # hack/agent/e2e-kind/e2e.py

bcho added 6 commits April 22, 2026 10:44

Fix lint: errcheck, capitalized errors, unnecessary type assertion

36d465b

Fix errcheck: properly handle fmt.Fprintf/Fprintln errors in kubectl …

92d6b56

…plugin

Merge remote-tracking branch 'origin/main' into hbc/agent-op-cr-poc

9e26b01

# Conflicts: # api/machina/v1alpha3/machine_types.go # cmd/agent/internal/daemon/daemon.go # cmd/agent/internal/daemon/update.go # cmd/agent/internal/phases/nodestart/persist_config.go # hack/agent/e2e-kind/e2e.py

Fix gofmt issues from utilexec package move

cf0d329

bcho mentioned this pull request Apr 28, 2026

Add MachineOperation, MachineConfiguration, and MachineConfigurationVersion CRDs #96

Merged

bcho added 8 commits April 28, 2026 11:54

Merge remote-tracking branch 'origin/main' into hbc/agent-op-cr-poc

7245f87

Fix import ordering in pkg/agent packages

9029d0b

Merge remote-tracking branch 'origin/main' into hbc/agent-op-cr-poc

6bd4a3a

Simplify executor to delegate to internal/executil

c8cbc8f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Operation CRD and agent daemon operation controller with soft-reboot support#59

Add Operation CRD and agent daemon operation controller with soft-reboot support#59
bcho wants to merge 23 commits intomainfrom
hbc/agent-op-cr-poc

bcho commented Apr 20, 2026 •

edited

Loading

Uh oh!

bcho commented Apr 21, 2026

Uh oh!

bcho commented Apr 21, 2026

Uh oh!

bcho commented Apr 21, 2026

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

phealy Apr 22, 2026

Uh oh!

plombardi89 Apr 22, 2026

Uh oh!

phealy Apr 22, 2026

Uh oh!

plombardi89 commented Apr 22, 2026

Uh oh!

bcho commented Apr 22, 2026

Uh oh!

bcho commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bcho commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

E2E Validation

Uh oh!

bcho commented Apr 21, 2026

Uh oh!

bcho commented Apr 21, 2026

E2E: Kubernetes Version Upgrade via MachineConfiguration (v1.34.3 -> v1.35.1)

Flow

Agent Logs

Controller Logs

Machine CR Status (after)

MCV Status

Node (after)

Summary

kubectl plugin commands tested

Uh oh!

bcho commented Apr 21, 2026

E2E: Node Delete as Repave Trigger

Resource Setup

Test 1: Node delete with no config drift

Test 2: Node delete with version change (v1.35.1 -> v1.34.3)

Post-repave state

Agent watchers

RBAC

New files

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

phealy Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

plombardi89 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

phealy Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

plombardi89 commented Apr 22, 2026

Uh oh!

bcho commented Apr 22, 2026

Uh oh!

bcho commented Apr 22, 2026

E2E Test Results — eacfce0 (PowerCycle→HardReboot / softRestart→softReboot rename)

Command Flows

1. MachineConfiguration + MCV lifecycle

2. Assign a config version to a Machine

3. Trigger repave via Node delete (OnDelete strategy)

4. MachineOperations

5. kubectl unbounded plugin commands

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bcho commented Apr 20, 2026 •

edited

Loading

E2E Test Results — `eacfce0` (PowerCycle→HardReboot / softRestart→softReboot rename)