Conversation
966d3c3 to
526c54f
Compare
a3603db to
18429a5
Compare
plombardi89
left a comment
There was a problem hiding this comment.
Mostly looks good. I'd make sure all the errors are annotated as it helps debugging when something goes wrong if you have a bunch of traceable errors.
- Annotate bare error returns in goalstates/resolve.go (plombardi89) - Add MachineConditionNodeUpdated const to api/v1alpha3 (jveski) - Use apimeta.SetStatusCondition instead of manual condition loop (jveski) - Set NodeUpdated condition alongside Provisioning phase update (jveski) - Drop validate-machine-cr-created step after rejoin in e2e workflow
There was a problem hiding this comment.
Pull request overview
Implements an unbounded-agent daemon that watches the Machine CR for the current node and reconciles the local nspawn machine to the desired state, plus supporting changes (applied-config integrity checks, systemd lifecycle, RBAC, and updated e2e flow).
Changes:
- Add a watch-driven agent daemon with async reconciliation (workqueue) and machine self-registration.
- Persist applied config + SHA-256 sidecar, and verify integrity on read.
- Wire daemon into start/reset flows, update RBAC + config paths, and extend agent e2e to cover upgrade + rejoin.
Reviewed changes
Copilot reviewed 46 out of 49 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/provision/assets/unbounded-agent-uninstall.sh | Update uninstall cleanup for new /etc/unbounded/... config directories |
| images/host-ubuntu2404/assets/vendor-data.tmpl | Update cloud-init paths/env var to new agent config location |
| hack/agent/e2e-kind/run-local.sh | Convert e2e runner to single linear flow (join → upgrade → reset/rejoin) |
| hack/agent/e2e-kind/e2e.py | Add daemon validation + upgrade trigger/validation steps; remove old two-case flow |
| go.mod | Bump indirect golang.org/x/oauth2 version |
| go.sum | Update sums for golang.org/x/oauth2 bump |
| docs/static/img/agent-daemon-overview.svg | Add architecture diagram asset for daemon doc |
| docs/content/reference/agent/daemon.md | Add daemon design/architecture/reference documentation |
| deploy/machina/07-bootstrapper-rbac.yaml | Expand bootstrapper RBAC for daemon watch + status updates |
| cmd/kubectl-unbounded/app/machine_manual_bootstrap_test.go | Update expected config file paths in tests |
| cmd/kubectl-unbounded/app/assets/node-bootstrap/cloud-init.yaml | Update config file path + env var path in cloud-init asset |
| cmd/agent/internal/utilexec/exec.go | Add shared utilexec.MachineRun helper |
| cmd/agent/internal/phases/rootfs/provision.go | Extract composite rootfs provisioning task |
| cmd/agent/internal/phases/reset/nspawn.go | Add applied-config removal + composite machine cleanup task |
| cmd/agent/internal/phases/reset/daemon.go | Add reset task to stop/disable/remove daemon unit |
| cmd/agent/internal/phases/reset/cleanup.go | Update reset cleanup paths to new config directory |
| cmd/agent/internal/phases/nodestop/stop.go | Implement graceful node stop (stop services inside machine then stop nspawn) |
| cmd/agent/internal/phases/nodestop/cri.go | Remove unused/no-op stop tasks |
| cmd/agent/internal/phases/nodestart/wait_kubelet.go | Add kubelet active polling task for update verification |
| cmd/agent/internal/phases/nodestart/start.go | Extract composite node-start sequence (includes persisting applied config) |
| cmd/agent/internal/phases/nodestart/persist_config.go | Persist applied config + checksum sidecar after successful node start |
| cmd/agent/internal/phases/nodestart/nvidia.go | Switch to shared utilexec.MachineRun |
| cmd/agent/internal/phases/nodestart/nspawn.go | Switch to shared utilexec.MachineRun and remove local helper |
| cmd/agent/internal/phases/nodestart/machine_test.go | Remove tests for old RegisterMachine phase (moved to daemon) |
| cmd/agent/internal/phases/nodestart/machine.go | Remove old RegisterMachine phase implementation |
| cmd/agent/internal/phases/nodestart/kubelet.go | Switch to shared utilexec.MachineRun |
| cmd/agent/internal/phases/nodestart/cri.go | Switch to shared utilexec.MachineRun |
| cmd/agent/internal/phases/host/enable_daemon.go | Add task to install/enable/start systemd daemon unit |
| cmd/agent/internal/phases/host/assets/unbounded-agent-daemon.service | New systemd unit for unbounded-agent daemon |
| cmd/agent/internal/goalstates/resolve_test.go | Update tests to new goalstates package + exported ResolveOCIImage |
| cmd/agent/internal/goalstates/resolve.go | Add ResolveMachine goalstate resolution (rootfs + nodestart) |
| cmd/agent/internal/goalstates/hostkernel_other.go | Move to goalstates package; adjust error message |
| cmd/agent/internal/goalstates/hostkernel_linux.go | Move to goalstates package |
| cmd/agent/internal/goalstates/constants_test.go | Add tests for alternating machine + applied-config path helpers |
| cmd/agent/internal/goalstates/constants.go | Add new config dirs, daemon unit const, alternate machine + applied-config path helper |
| cmd/agent/internal/goalstates/checksum_test.go | Add checksum/sidecar behavior tests |
| cmd/agent/internal/goalstates/checksum.go | Implement SHA-256 sidecar + verification (missing sidecar tolerated) |
| cmd/agent/internal/daemon/update_test.go | Add drift tests; partial coverage around applied-config handling |
| cmd/agent/internal/daemon/update.go | Implement active-machine discovery, drift detection, and node update orchestration |
| cmd/agent/internal/daemon/scheme.go | Add scheme construction for core + Machine CR |
| cmd/agent/internal/daemon/register_test.go | Add tests for daemon machine self-registration |
| cmd/agent/internal/daemon/daemon.go | Implement watch loop + workqueue reconciliation + status/condition updates |
| cmd/agent/internal/cmd/start.go | Use new goalstate + composite tasks; enable daemon after initial provisioning |
| cmd/agent/internal/cmd/reset.go | Stop daemon first during reset |
| cmd/agent/internal/cmd/daemon.go | Add unbounded-agent daemon CLI subcommand |
| cmd/agent/internal/cmd/cmd.go | Register daemon subcommand |
| api/v1alpha3/machine_types.go | Add NodeUpdated Machine condition constant |
| .gitignore | Ignore hack/bin/ |
| .github/workflows/agent-e2e-kind.yaml | Update CI e2e workflow to linear flow including daemon + upgrade validation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func TestFindActiveMachine_Kube1(t *testing.T) { | ||
| dir := t.TempDir() | ||
|
|
||
| // Override the config dir for testing. | ||
| origPath := goalstates.AgentConfigDir | ||
| // We can't override the const, so we test the lower-level function | ||
| // by writing directly and reading back. | ||
| cfg := baseConfig() | ||
| data, err := json.MarshalIndent(cfg, "", " ") | ||
| require.NoError(t, err) | ||
|
|
||
| configPath := filepath.Join(dir, "kube1-applied-config.json") | ||
| require.NoError(t, os.WriteFile(configPath, data, 0o600)) | ||
|
|
||
| // Read it back to verify the format. | ||
| readData, err := os.ReadFile(configPath) | ||
| require.NoError(t, err) | ||
|
|
||
| var readCfg provision.AgentConfig | ||
| require.NoError(t, json.Unmarshal(readData, &readCfg)) | ||
| assert.Equal(t, cfg.MachineName, readCfg.MachineName) | ||
| assert.Equal(t, cfg.Cluster.Version, readCfg.Cluster.Version) | ||
|
|
||
| _ = origPath // Note: findActiveMachine uses the const, so this test | ||
| // validates the serialization/deserialization roundtrip rather than | ||
| // the full findActiveMachine flow (which requires root filesystem access). | ||
| } |
There was a problem hiding this comment.
TestFindActiveMachine_Kube1 doesn't exercise findActiveMachine at all (it just round-trips JSON), so the selection logic (kube1 vs kube2), checksum verification, and error paths are currently untested. Consider refactoring findActiveMachine to accept a base directory (or injecting filesystem ops) so the test can create tempdir files and assert the actual behavior.
|
|
||
| // Worker goroutine: processes items from the queue. | ||
| go runWorker(ctx, log, kubeClient, queue) | ||
|
|
There was a problem hiding this comment.
The daemon only reconciles when a watch event enqueues the Machine name. On startup, if spec.operations.* already drifts from status (or if the daemon was down during a counter bump), there may be no subsequent watch event and the drift will never be processed. Consider enqueueing machineName once after registerMachine (or doing an initial handleMachineEvent/GET+reconcile pass) before entering the watch loop so startup always converges state.
| // Always perform one startup reconciliation so any pre-existing drift | |
| // between spec.operations.* and status is converged even if no watch | |
| // event is emitted after registration. | |
| queue.Add(machineName) |
| // Set status to Provisioning before starting work. | ||
| if err := updateMachinePhase(ctx, c, machine, v1alpha3.MachinePhaseProvisioning, "agent daemon reconciling"); err != nil { | ||
| log.Warn("failed to update phase to Provisioning", "error", err) | ||
| // Continue with reconciliation even if status update fails. | ||
| } | ||
|
|
||
| // Execute the node update with the desired config. | ||
| if err := updateNode(ctx, log, active, desired); err != nil { | ||
| // Update status to Failed. | ||
| failMsg := fmt.Sprintf("node update failed: %v", err) | ||
| if updateErr := updateMachineStatus(ctx, c, machine, v1alpha3.MachinePhaseFailed, failMsg, false); updateErr != nil { | ||
| log.Warn("failed to update status after failure", "error", updateErr) | ||
| } | ||
|
|
||
| return fmt.Errorf("node update: %w", err) | ||
| } | ||
|
|
||
| // Acknowledge operation counters. | ||
| acknowledgeOperations(machine) | ||
|
|
||
| // Update status to Joining with success. | ||
| if err := updateMachineStatus(ctx, c, machine, v1alpha3.MachinePhaseJoining, "node update completed", true); err != nil { | ||
| log.Warn("failed to update status after success", "error", err) | ||
| } |
There was a problem hiding this comment.
handleMachineEvent fetches the Machine once, then performs a long-running updateNode, and finally calls Status().Update on the same in-memory object. In practice the Machine status/resourceVersion is likely to change during reconciliation (from the daemon's own Provisioning update or from the controller), so the final status update can conflict and be dropped, leaving spec.operations > status.operations and causing repeated retries. Re-fetch (or use a status Patch with client.MergeFrom) and wrap status writes in retry.RetryOnConflict so phase/conditions and operation acknowledgements are reliably persisted.
| for _, name := range []string{goalstates.NSpawnMachineKube1, goalstates.NSpawnMachineKube2} { | ||
| path := goalstates.AppliedConfigPath(name) | ||
|
|
||
| data, err := os.ReadFile(path) | ||
| if errors.Is(err, os.ErrNotExist) { | ||
| continue | ||
| } | ||
|
|
||
| if err != nil { | ||
| return nil, fmt.Errorf("read applied config %s: %w", path, err) | ||
| } | ||
|
|
||
| // Verify the sidecar checksum before trusting the config data. | ||
| checksumPath := goalstates.AppliedConfigChecksumPath(name) | ||
| if err := goalstates.VerifyChecksum(data, checksumPath); err != nil { | ||
| return nil, fmt.Errorf("verify applied config checksum for %s: %w", name, err) | ||
| } | ||
|
|
||
| // If the sidecar file is missing, log a warning so operators | ||
| // know the integrity check was skipped. | ||
| if _, statErr := os.Stat(checksumPath); errors.Is(statErr, os.ErrNotExist) { | ||
| log.Warn("no checksum sidecar found, skipping integrity check", | ||
| "config_path", path, | ||
| "checksum_path", checksumPath, | ||
| ) | ||
| } | ||
|
|
||
| var cfg provision.AgentConfig | ||
| if err := json.Unmarshal(data, &cfg); err != nil { | ||
| return nil, fmt.Errorf("decode applied config %s: %w", path, err) | ||
| } | ||
|
|
||
| return &ActiveMachine{Name: name, Config: &cfg}, nil | ||
| } |
There was a problem hiding this comment.
findActiveMachine returns the first applied config it finds (kube1 before kube2). During an update there is a window where both applied configs can exist (new config persisted before old is cleaned up), and a crash/restart in that window would cause the daemon to pick the wrong active machine/config. Consider detecting the "both present" case and either selecting by file mtime / machinectl state, or returning an error that forces operator intervention.
| # Patch the Machine CR to set the new version and repaveCounter = 1. | ||
| # The CRD requires spec.kubernetes.bootstrapTokenRef when | ||
| # spec.kubernetes is present, so we include a placeholder value. | ||
| patch = json.dumps({ | ||
| "spec": { | ||
| "kubernetes": { | ||
| "version": upgrade_version, | ||
| "bootstrapTokenRef": { | ||
| "name": "not-used-by-agent-daemon", | ||
| }, | ||
| }, | ||
| "operations": { | ||
| "repaveCounter": 1, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
This merge patch overwrites spec.kubernetes.bootstrapTokenRef with a placeholder. That can inadvertently destroy the correct bootstrapTokenRef created by the daemon/machina and may affect any controller logic that relies on it. Since the CR already has bootstrapTokenRef, patch only the fields you need (version + operations.repaveCounter), and consider bumping repaveCounter relative to the current value instead of hard-coding it to 1 to make the step rerunnable.
| # 6. Remove nftables flush service and unbounded-kube config dir | ||
| # ----------------------------------------------------------------- | ||
| echo "Removing nftables flush service..." | ||
| systemctl disable --now nftables-flush.service 2>/dev/null || true | ||
| rm -f /etc/systemd/system/nftables-flush.service | ||
| rm -rf /etc/unbounded-kube | ||
| rm -rf /etc/unbounded/kube | ||
|
|
There was a problem hiding this comment.
This uninstall step now removes only the new config directory (/etc/unbounded/kube). If the host was provisioned by an older version (which used /etc/unbounded-kube), those legacy files will be left behind after uninstall. Consider removing both the legacy and new paths for backward-compatible cleanup.
| echo "Removing agent binaries and configuration..." | ||
| rm -f /usr/local/bin/unbounded-agent | ||
| rm -f /usr/local/bin/unbounded-agent-install.sh | ||
| rm -f /usr/local/bin/unbounded-agent-uninstall.sh | ||
| rm -rf /etc/unbounded-agent | ||
| rm -rf /etc/unbounded/agent | ||
| rm -rf /tmp/unbounded-agent |
There was a problem hiding this comment.
This uninstall step now removes only /etc/unbounded/agent. If the host was provisioned by an older version that used /etc/unbounded-agent, the legacy directory will remain after uninstall/reset, which can confuse operators and future runs. Consider removing both legacy and new agent config directories.
* Replace gRPC daemon with Machine CR watch-based daemon Replace the gRPC task-pull daemon with a watch-based daemon that monitors the Machine CR on the control plane and reconciles the local node to match the desired state. - Watch Machine CR for spec drift (version, image) and operation counter drift (reimageCounter, rebootCounter) - Perform alternating nspawn machine updates (kube1/kube2) on drift - Update Machine CR status: phase, conditions (NodeUpdated), and acknowledge operation counters - Authenticate with bootstrap token from applied config (not kubelet kubeconfig which has nspawn-internal paths) - Add RBAC (ClusterRole + ClusterRoleBinding) for system:bootstrappers - Add e2e tests: daemon validation, version upgrade via reimageCounter patch, applied config verification - Fix double-reconciliation bug: re-GET Machine CR before drift detection to avoid stale watch events * Refactor: extract shared phase helpers and consolidate machineRun Extract rootfs.Provision, nodestart.StartNode, and nodestop.StopNode as shared composite tasks used by both the initial agent start and the node update flow. This removes duplicated phase orchestration from start.go and nodeupdate.Execute. Move the duplicated machineRun helper into utilexec.MachineRun so all packages (nodestart, nodestop, nodeupdate) share a single implementation. Replace the empty nodestop stubs (StopContainerd, StopKubelet) with a real StopNode task that gracefully stops services before nspawn teardown. * Refactor agent daemon: consolidate into daemon package and extract phase helpers - Move nodeupdate package to cmd/agent/internal/daemon, rename Execute to updateNode (unexported). The daemon package now owns the full watch loop, reconciliation, kube client, scheme, drift detection, and node update logic. - Slim cmd/daemon.go to just cobra wiring that calls daemon.Run(ctx, log). Delete cmd/scheme.go (moved to daemon package). - Remove Config field from NodeStart goal state. PersistAppliedConfig and StartNode now take *provision.AgentConfig as an explicit parameter instead of smuggling it through the goal state struct. - Extract WaitForKubelet into phases/nodestart/wait_kubelet.go. - Extract PersistAppliedConfig into phases/nodestart/persist_config.go. - Add reset.RemoveAppliedConfig task and reset.CleanupMachine composite that combines RemoveNSpawnConfig + RemoveMachine + RemoveAppliedConfig. Replace inline os.Remove in node update with the composite task. * Simplify e2e test flow to single linear sequence Remove the two-case structure (pre-existing CR vs no CR with VM recreation) and replace with a single linear flow: join, validate self-registered CR, upgrade, reset, rejoin. This removes the need for VM recreation between test cases. * Move Machine CR registration from start command to daemon The daemon now registers the Machine CR at startup before entering the watch loop, instead of the start command doing it as a separate phase. This ensures registration happens even on rejoin after reset, and keeps all Machine CR interaction in the daemon package. Remove the now-unused nodestart.RegisterMachine phase and its tests. Add tests for the daemon's registerMachine and buildMachineCR functions. * Move EnableDaemon into task list, add machines.target dep, unexport findActiveMachine - Move EnableDaemon from a separate call into the Serial task list so all phases are in one place. - Add machines.target dependency to the daemon systemd unit so it waits for the nspawn machine to be running before starting. - Unexport FindActiveMachine since it is only used within the daemon package. - Add polling to validate_machine_cr_created in e2e tests since the daemon now registers the Machine CR asynchronously after startup. * Fix implicit string concatenation flagged by CodeQL Use explicit + concatenation for the multi-line shell command string in validate_upgrade to avoid the implicit-concatenation-in-list warning. * Fix lint: lowercase error string per Go conventions * Update RBAC and daemon doc to match current architecture Add create verb to machines resource in agent RBAC (needed for daemon self-registration). Update daemon.md to accurately describe bootstrap token auth, system:bootstrappers group, Machine CR registration at startup, operation counter drift as sole reconciliation trigger, and machines.target systemd dependency. * Address PR #37 review nits and drop post-rejoin CR check - Annotate bare error returns in goalstates/resolve.go (plombardi89) - Add MachineConditionNodeUpdated const to api/v1alpha3 (jveski) - Use apimeta.SetStatusCondition instead of manual condition loop (jveski) - Set NodeUpdated condition alongside Provisioning phase update (jveski) - Drop validate-machine-cr-created step after rejoin in e2e workflow * Decouple watch loop from reconciliation with async worker Move reconciliation to a worker goroutine signalled via a buffered channel (capacity 1). The watch loop now performs a non-blocking send on each MODIFIED/ADDED event and immediately returns to draining the watch stream. This prevents backpressure on the API server's HTTP/2 connection when reconciliation takes time (rootfs provisioning ~15s). The worker calls handleMachineEvent which re-GETs the Machine CR from the API server, so coalesced signals naturally pick up the latest state. Multiple events arriving during an in-flight reconciliation are merged into a single follow-up reconciliation. * Use client-go workqueue for async reconciliation Replace hand-rolled channel-based worker with client-go's TypedRateLimitingInterface workqueue. This is the standard Kubernetes controller building block and provides deduplication, rate limiting with exponential backoff on failures, and proper shutdown semantics. The watch loop calls queue.Add(machineName) on events; the workqueue deduplicates if the key is already queued or being processed. On reconciliation failure runWorker calls AddRateLimited for backoff retry; on success it calls Forget to reset the rate limiter. * Add TODO for bootstrap token credential strategy in buildKubeClient * Replace global NewKubeClient var with parameter injection * Consolidate agent daemon RBAC into bootstrapper RBAC * Fix flaky e2e: add site label to bootstrap token secret * Stop and remove agent daemon systemd unit during reset * Unify config dirs under /etc/unbounded/{kube,agent} * Add SHA-256 sidecar checksum for applied config integrity Write a .sha256 companion file alongside the applied config JSON to detect on-disk corruption (e.g. bitflips). Each file is written atomically via renameio; a missing sidecar is treated as a warning (older agent or crash between writes), while a present sidecar with wrong digest returns ErrChecksumMismatch. - Write path: PersistAppliedConfig writes checksum after config - Read path: findActiveMachine verifies checksum before trusting data - Reset path: RemoveAppliedConfig cleans up the sidecar file - Tests: ComputeChecksum, VerifyChecksum (match/mismatch/missing/error) * Update daemon doc: consolidate sections, add SVG diagram Combine Drift Detection and Applied Config Integrity into a single Applied Config and Drift Detection section that describes AgentConfig fields, operation counter triggers, and persistence with integrity guard. Replace ASCII diagram with SVG following existing style. Simplify systemd and RBAC sections to prose descriptions. Call out bootstrap token auth as temporary. * Rename reimage to repave in agent daemon, e2e, and docs Follow the API rename from ReimageCounter to RepaveCounter that landed in main via #36. Update Go field references, JSON patch fields in e2e tests, Python variables, CI comments, and documentation. * Consolidate daemonUnit const and remove reboot counter drift check * Implement in-place update of nspawn machine Add alternating (blue/green) nspawn machine update logic and a long-running agent daemon that registers the Machine CR at startup. The daemon discovers the active nspawn machine, builds a kube client from the applied config, ensures a Machine CR exists, then blocks until shutdown. The update logic (updateNode, hasDrift, findActiveMachine) is in place but the trigger mechanism is TBD. Removes the Machine CR watch loop, workqueue, status updates, and operations counter logic from the prior implementation. Also removes daemon-specific e2e steps and design doc. * Add license header and StartLimitIntervalSec=0 to daemon unit file
ea01f36 to
33b2386
Compare
33b2386 to
fd42ce8
Compare
Summary
Implemented a Kubernetes watch-based daemon that monitors the Machine CR on the control plane and reconciles the local node to match the desired state.
What changed
Daemon (
cmd/agent/internal/daemon/)client.NewWithWatchon the Machine CR matched byMachineNamefrom the applied configsystem:bootstrappersgroup) - avoids reading kubeconfig files from inside the nspawn machine (which contain nspawn-internal paths). A TODO marks this as temporary pending a proper agent credential strategy.spec.operations.repaveCounter > status.operations.repaveCounter) to trigger reconciliation. Reboot counter is not handled by the daemon.updateNode, checks actual config drift (version, image) before performing the expensive rootfs reprovisionJoining, acknowledges repave counter (spec -> status), setsNodeUpdatedcondition toTrue/SucceededFailed, setsNodeUpdatedcondition toFalse/FailedProvisioning, setsNodeUpdatedcondition toFalse/InProgresskubeClientFunctype) for testabilityApplied config integrity (
cmd/agent/internal/goalstates/checksum.go)Phase helpers (extracted from inline code)
rootfs.Provision(log, gs)- composite rootfs provisioning tasknodestart.StartNode(log, gs, cfg)- composite node start tasknodestop.StopNode(log, machineName)- composite node stop taskutilexec.MachineRun- nspawn machine command executionGoal state (
cmd/agent/internal/goalstates/)ResolveMachine()discovers active nspawn machine and loads applied configMachineGoalStateprovides machine name, config, and rootfs pathsDaemonUnitconst consolidates the systemd unit name in one placeRBAC (
deploy/machina/07-bootstrapper-rbac.yaml)unbounded-bootstrapper-machine)machines, get/update/patch onmachines/statussystem:bootstrappersgroupSystemd (
unbounded-agent-daemon.service)After=network-online.target machines.targetensures daemon starts after nspawn machines are availableConfig directories
/etc/unbounded/kubeand/etc/unbounded/agent(was/etc/unbounded-kubeand/etc/unbounded-agent)Design doc (
docs/content/reference/agent/daemon.md)E2E test flow (single linear sequence)
Removed
nodestart.RegisterMachinephase (moved to daemon)deploy/agent/01-rbac.yaml(consolidated into bootstrapper RBAC)