Skip to content

[WIP] Machine CR watch-based daemon#37

Closed
bcho wants to merge 3 commits intomainfrom
hbc/agent-controlplane-cr
Closed

[WIP] Machine CR watch-based daemon#37
bcho wants to merge 3 commits intomainfrom
hbc/agent-controlplane-cr

Conversation

@bcho
Copy link
Copy Markdown
Member

@bcho bcho commented Apr 14, 2026

Summary

Implemented a Kubernetes watch-based daemon that monitors the Machine CR on the control plane and reconciles the local node to match the desired state.

What changed

Daemon (cmd/agent/internal/daemon/)

  • Watch-based loop using client.NewWithWatch on the Machine CR matched by MachineName from the applied config
  • Async reconciliation via client-go workqueue (decouples watch events from reconciliation)
  • Authenticates via bootstrap token from the applied config (system:bootstrappers group) - avoids reading kubeconfig files from inside the nspawn machine (which contain nspawn-internal paths). A TODO marks this as temporary pending a proper agent credential strategy.
  • Registers a Machine CR at startup if one does not already exist (supports dynamic/manual-bootstrap environments)
  • Detects operation counter drift (spec.operations.repaveCounter > status.operations.repaveCounter) to trigger reconciliation. Reboot counter is not handled by the daemon.
  • Inside updateNode, checks actual config drift (version, image) before performing the expensive rootfs reprovision
  • On success: sets phase to Joining, acknowledges repave counter (spec -> status), sets NodeUpdated condition to True/Succeeded
  • On failure: sets phase to Failed, sets NodeUpdated condition to False/Failed
  • Before reconciliation: sets phase to Provisioning, sets NodeUpdated condition to False/InProgress
  • Re-GETs Machine CR from API server at start of each event handler to avoid stale watch events causing spurious drift
  • Kube client built via parameter injection (unexported kubeClientFunc type) for testability

Applied config integrity (cmd/agent/internal/goalstates/checksum.go)

  • SHA-256 sidecar checksum file written alongside applied config JSON
  • Verified on read; missing sidecar = warning (proceed), mismatch = error

Phase helpers (extracted from inline code)

  • rootfs.Provision(log, gs) - composite rootfs provisioning task
  • nodestart.StartNode(log, gs, cfg) - composite node start task
  • nodestop.StopNode(log, machineName) - composite node stop task
  • utilexec.MachineRun - nspawn machine command execution

Goal state (cmd/agent/internal/goalstates/)

  • ResolveMachine() discovers active nspawn machine and loads applied config
  • MachineGoalState provides machine name, config, and rootfs paths
  • DaemonUnit const consolidates the systemd unit name in one place

RBAC (deploy/machina/07-bootstrapper-rbac.yaml)

  • Agent daemon RBAC consolidated into the bootstrapper ClusterRole (unbounded-bootstrapper-machine)
  • Grants create/get/list/watch on machines, get/update/patch on machines/status
  • Bound to system:bootstrappers group

Systemd (unbounded-agent-daemon.service)

  • After=network-online.target machines.target ensures daemon starts after nspawn machines are available
  • Enabled in the start command's serial task list after initial provisioning
  • Stopped, disabled, and removed during reset (Step 1 before stopping nspawn machines)

Config directories

  • Unified under /etc/unbounded/kube and /etc/unbounded/agent (was /etc/unbounded-kube and /etc/unbounded-agent)

Design doc (docs/content/reference/agent/daemon.md)

  • Architecture SVG diagram, auth model, RBAC, systemd lifecycle, reconciliation flow

E2E test flow (single linear sequence)

  1. Start node without pre-created Machine CR
  2. Wait for node Ready
  3. Validate Machine CR (daemon self-registers it, polling up to 120s)
  4. Validate daemon is watching
  5. Trigger upgrade: patch Machine CR with new kube version + bump repaveCounter
  6. Validate upgrade completed (daemon performs alternating nspawn machine update)
  7. Wait for node Ready post-upgrade
  8. Validate workload scheduling
  9. Reset agent + delete Machine CR
  10. Rejoin (run agent again on same VM, no VM recreation)
  11. Wait for node Ready
  12. Validate daemon watching
  13. Validate workload

Removed

  • nodestart.RegisterMachine phase (moved to daemon)
  • deploy/agent/01-rbac.yaml (consolidated into bootstrapper RBAC)
  • Two-case e2e structure with VM recreation between cases

@bcho bcho changed the title [DONT REVIEW] Replace gRPC daemon with Machine CR watch-based daemon [DONT REVIEW] Machine CR watch-based daemon Apr 14, 2026
Comment thread hack/agent/e2e-kind/e2e.py Fixed
Comment thread hack/agent/e2e-kind/e2e.py Fixed
@bcho bcho force-pushed the hbc/agent-controlplane-cr branch from 966d3c3 to 526c54f Compare April 14, 2026 21:31
Comment thread hack/agent/e2e-kind/e2e.py Fixed
@bcho bcho force-pushed the hbc/agent-controlplane-cr branch 2 times, most recently from a3603db to 18429a5 Compare April 14, 2026 22:00
Comment thread hack/agent/e2e-kind/e2e.py Fixed
Comment thread hack/agent/e2e-kind/e2e.py Fixed
@bcho bcho changed the title [DONT REVIEW] Machine CR watch-based daemon Machine CR watch-based daemon Apr 15, 2026
Copy link
Copy Markdown
Collaborator

@plombardi89 plombardi89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. I'd make sure all the errors are annotated as it helps debugging when something goes wrong if you have a bunch of traceable errors.

Comment thread cmd/agent/internal/goalstates/constants.go Outdated
Comment thread cmd/agent/internal/goalstates/resolve.go Outdated
Comment thread cmd/agent/internal/goalstates/resolve.go Outdated
Comment thread cmd/agent/internal/goalstates/resolve.go Outdated
Comment thread cmd/agent/internal/daemon/update.go
Comment thread cmd/agent/internal/daemon/daemon.go Outdated
Comment thread cmd/agent/internal/daemon/update.go
Comment thread cmd/agent/internal/daemon/daemon.go Outdated
Comment thread cmd/agent/internal/daemon/daemon.go Outdated
Comment thread cmd/agent/internal/daemon/daemon.go
Comment thread cmd/agent/internal/daemon/daemon.go
bcho added a commit that referenced this pull request Apr 15, 2026
- Annotate bare error returns in goalstates/resolve.go (plombardi89)
- Add MachineConditionNodeUpdated const to api/v1alpha3 (jveski)
- Use apimeta.SetStatusCondition instead of manual condition loop (jveski)
- Set NodeUpdated condition alongside Provisioning phase update (jveski)
- Drop validate-machine-cr-created step after rejoin in e2e workflow
@bcho bcho marked this pull request as ready for review April 15, 2026 22:32
@bcho bcho requested review from a team and Copilot April 15, 2026 22:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements an unbounded-agent daemon that watches the Machine CR for the current node and reconciles the local nspawn machine to the desired state, plus supporting changes (applied-config integrity checks, systemd lifecycle, RBAC, and updated e2e flow).

Changes:

  • Add a watch-driven agent daemon with async reconciliation (workqueue) and machine self-registration.
  • Persist applied config + SHA-256 sidecar, and verify integrity on read.
  • Wire daemon into start/reset flows, update RBAC + config paths, and extend agent e2e to cover upgrade + rejoin.

Reviewed changes

Copilot reviewed 46 out of 49 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
internal/provision/assets/unbounded-agent-uninstall.sh Update uninstall cleanup for new /etc/unbounded/... config directories
images/host-ubuntu2404/assets/vendor-data.tmpl Update cloud-init paths/env var to new agent config location
hack/agent/e2e-kind/run-local.sh Convert e2e runner to single linear flow (join → upgrade → reset/rejoin)
hack/agent/e2e-kind/e2e.py Add daemon validation + upgrade trigger/validation steps; remove old two-case flow
go.mod Bump indirect golang.org/x/oauth2 version
go.sum Update sums for golang.org/x/oauth2 bump
docs/static/img/agent-daemon-overview.svg Add architecture diagram asset for daemon doc
docs/content/reference/agent/daemon.md Add daemon design/architecture/reference documentation
deploy/machina/07-bootstrapper-rbac.yaml Expand bootstrapper RBAC for daemon watch + status updates
cmd/kubectl-unbounded/app/machine_manual_bootstrap_test.go Update expected config file paths in tests
cmd/kubectl-unbounded/app/assets/node-bootstrap/cloud-init.yaml Update config file path + env var path in cloud-init asset
cmd/agent/internal/utilexec/exec.go Add shared utilexec.MachineRun helper
cmd/agent/internal/phases/rootfs/provision.go Extract composite rootfs provisioning task
cmd/agent/internal/phases/reset/nspawn.go Add applied-config removal + composite machine cleanup task
cmd/agent/internal/phases/reset/daemon.go Add reset task to stop/disable/remove daemon unit
cmd/agent/internal/phases/reset/cleanup.go Update reset cleanup paths to new config directory
cmd/agent/internal/phases/nodestop/stop.go Implement graceful node stop (stop services inside machine then stop nspawn)
cmd/agent/internal/phases/nodestop/cri.go Remove unused/no-op stop tasks
cmd/agent/internal/phases/nodestart/wait_kubelet.go Add kubelet active polling task for update verification
cmd/agent/internal/phases/nodestart/start.go Extract composite node-start sequence (includes persisting applied config)
cmd/agent/internal/phases/nodestart/persist_config.go Persist applied config + checksum sidecar after successful node start
cmd/agent/internal/phases/nodestart/nvidia.go Switch to shared utilexec.MachineRun
cmd/agent/internal/phases/nodestart/nspawn.go Switch to shared utilexec.MachineRun and remove local helper
cmd/agent/internal/phases/nodestart/machine_test.go Remove tests for old RegisterMachine phase (moved to daemon)
cmd/agent/internal/phases/nodestart/machine.go Remove old RegisterMachine phase implementation
cmd/agent/internal/phases/nodestart/kubelet.go Switch to shared utilexec.MachineRun
cmd/agent/internal/phases/nodestart/cri.go Switch to shared utilexec.MachineRun
cmd/agent/internal/phases/host/enable_daemon.go Add task to install/enable/start systemd daemon unit
cmd/agent/internal/phases/host/assets/unbounded-agent-daemon.service New systemd unit for unbounded-agent daemon
cmd/agent/internal/goalstates/resolve_test.go Update tests to new goalstates package + exported ResolveOCIImage
cmd/agent/internal/goalstates/resolve.go Add ResolveMachine goalstate resolution (rootfs + nodestart)
cmd/agent/internal/goalstates/hostkernel_other.go Move to goalstates package; adjust error message
cmd/agent/internal/goalstates/hostkernel_linux.go Move to goalstates package
cmd/agent/internal/goalstates/constants_test.go Add tests for alternating machine + applied-config path helpers
cmd/agent/internal/goalstates/constants.go Add new config dirs, daemon unit const, alternate machine + applied-config path helper
cmd/agent/internal/goalstates/checksum_test.go Add checksum/sidecar behavior tests
cmd/agent/internal/goalstates/checksum.go Implement SHA-256 sidecar + verification (missing sidecar tolerated)
cmd/agent/internal/daemon/update_test.go Add drift tests; partial coverage around applied-config handling
cmd/agent/internal/daemon/update.go Implement active-machine discovery, drift detection, and node update orchestration
cmd/agent/internal/daemon/scheme.go Add scheme construction for core + Machine CR
cmd/agent/internal/daemon/register_test.go Add tests for daemon machine self-registration
cmd/agent/internal/daemon/daemon.go Implement watch loop + workqueue reconciliation + status/condition updates
cmd/agent/internal/cmd/start.go Use new goalstate + composite tasks; enable daemon after initial provisioning
cmd/agent/internal/cmd/reset.go Stop daemon first during reset
cmd/agent/internal/cmd/daemon.go Add unbounded-agent daemon CLI subcommand
cmd/agent/internal/cmd/cmd.go Register daemon subcommand
api/v1alpha3/machine_types.go Add NodeUpdated Machine condition constant
.gitignore Ignore hack/bin/
.github/workflows/agent-e2e-kind.yaml Update CI e2e workflow to linear flow including daemon + upgrade validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +113 to +139
func TestFindActiveMachine_Kube1(t *testing.T) {
dir := t.TempDir()

// Override the config dir for testing.
origPath := goalstates.AgentConfigDir
// We can't override the const, so we test the lower-level function
// by writing directly and reading back.
cfg := baseConfig()
data, err := json.MarshalIndent(cfg, "", " ")
require.NoError(t, err)

configPath := filepath.Join(dir, "kube1-applied-config.json")
require.NoError(t, os.WriteFile(configPath, data, 0o600))

// Read it back to verify the format.
readData, err := os.ReadFile(configPath)
require.NoError(t, err)

var readCfg provision.AgentConfig
require.NoError(t, json.Unmarshal(readData, &readCfg))
assert.Equal(t, cfg.MachineName, readCfg.MachineName)
assert.Equal(t, cfg.Cluster.Version, readCfg.Cluster.Version)

_ = origPath // Note: findActiveMachine uses the const, so this test
// validates the serialization/deserialization roundtrip rather than
// the full findActiveMachine flow (which requires root filesystem access).
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestFindActiveMachine_Kube1 doesn't exercise findActiveMachine at all (it just round-trips JSON), so the selection logic (kube1 vs kube2), checksum verification, and error paths are currently untested. Consider refactoring findActiveMachine to accept a base directory (or injecting filesystem ops) so the test can create tempdir files and assert the actual behavior.

Copilot uses AI. Check for mistakes.

// Worker goroutine: processes items from the queue.
go runWorker(ctx, log, kubeClient, queue)

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The daemon only reconciles when a watch event enqueues the Machine name. On startup, if spec.operations.* already drifts from status (or if the daemon was down during a counter bump), there may be no subsequent watch event and the drift will never be processed. Consider enqueueing machineName once after registerMachine (or doing an initial handleMachineEvent/GET+reconcile pass) before entering the watch loop so startup always converges state.

Suggested change
// Always perform one startup reconciliation so any pre-existing drift
// between spec.operations.* and status is converged even if no watch
// event is emitted after registration.
queue.Add(machineName)

Copilot uses AI. Check for mistakes.
Comment on lines +322 to +345
// Set status to Provisioning before starting work.
if err := updateMachinePhase(ctx, c, machine, v1alpha3.MachinePhaseProvisioning, "agent daemon reconciling"); err != nil {
log.Warn("failed to update phase to Provisioning", "error", err)
// Continue with reconciliation even if status update fails.
}

// Execute the node update with the desired config.
if err := updateNode(ctx, log, active, desired); err != nil {
// Update status to Failed.
failMsg := fmt.Sprintf("node update failed: %v", err)
if updateErr := updateMachineStatus(ctx, c, machine, v1alpha3.MachinePhaseFailed, failMsg, false); updateErr != nil {
log.Warn("failed to update status after failure", "error", updateErr)
}

return fmt.Errorf("node update: %w", err)
}

// Acknowledge operation counters.
acknowledgeOperations(machine)

// Update status to Joining with success.
if err := updateMachineStatus(ctx, c, machine, v1alpha3.MachinePhaseJoining, "node update completed", true); err != nil {
log.Warn("failed to update status after success", "error", err)
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleMachineEvent fetches the Machine once, then performs a long-running updateNode, and finally calls Status().Update on the same in-memory object. In practice the Machine status/resourceVersion is likely to change during reconciliation (from the daemon's own Provisioning update or from the controller), so the final status update can conflict and be dropped, leaving spec.operations > status.operations and causing repeated retries. Re-fetch (or use a status Patch with client.MergeFrom) and wrap status writes in retry.RetryOnConflict so phase/conditions and operation acknowledgements are reliably persisted.

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +81
for _, name := range []string{goalstates.NSpawnMachineKube1, goalstates.NSpawnMachineKube2} {
path := goalstates.AppliedConfigPath(name)

data, err := os.ReadFile(path)
if errors.Is(err, os.ErrNotExist) {
continue
}

if err != nil {
return nil, fmt.Errorf("read applied config %s: %w", path, err)
}

// Verify the sidecar checksum before trusting the config data.
checksumPath := goalstates.AppliedConfigChecksumPath(name)
if err := goalstates.VerifyChecksum(data, checksumPath); err != nil {
return nil, fmt.Errorf("verify applied config checksum for %s: %w", name, err)
}

// If the sidecar file is missing, log a warning so operators
// know the integrity check was skipped.
if _, statErr := os.Stat(checksumPath); errors.Is(statErr, os.ErrNotExist) {
log.Warn("no checksum sidecar found, skipping integrity check",
"config_path", path,
"checksum_path", checksumPath,
)
}

var cfg provision.AgentConfig
if err := json.Unmarshal(data, &cfg); err != nil {
return nil, fmt.Errorf("decode applied config %s: %w", path, err)
}

return &ActiveMachine{Name: name, Config: &cfg}, nil
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findActiveMachine returns the first applied config it finds (kube1 before kube2). During an update there is a window where both applied configs can exist (new config persisted before old is cleaned up), and a crash/restart in that window would cause the daemon to pick the wrong active machine/config. Consider detecting the "both present" case and either selecting by file mtime / machinectl state, or returning an error that forces operator intervention.

Copilot uses AI. Check for mistakes.
Comment on lines +1266 to +1280
# Patch the Machine CR to set the new version and repaveCounter = 1.
# The CRD requires spec.kubernetes.bootstrapTokenRef when
# spec.kubernetes is present, so we include a placeholder value.
patch = json.dumps({
"spec": {
"kubernetes": {
"version": upgrade_version,
"bootstrapTokenRef": {
"name": "not-used-by-agent-daemon",
},
},
"operations": {
"repaveCounter": 1,
},
},
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This merge patch overwrites spec.kubernetes.bootstrapTokenRef with a placeholder. That can inadvertently destroy the correct bootstrapTokenRef created by the daemon/machina and may affect any controller logic that relies on it. Since the CR already has bootstrapTokenRef, patch only the fields you need (version + operations.repaveCounter), and consider bumping repaveCounter relative to the current value instead of hard-coding it to 1 to make the step rerunnable.

Copilot uses AI. Check for mistakes.
Comment on lines 96 to 102
# 6. Remove nftables flush service and unbounded-kube config dir
# -----------------------------------------------------------------
echo "Removing nftables flush service..."
systemctl disable --now nftables-flush.service 2>/dev/null || true
rm -f /etc/systemd/system/nftables-flush.service
rm -rf /etc/unbounded-kube
rm -rf /etc/unbounded/kube

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uninstall step now removes only the new config directory (/etc/unbounded/kube). If the host was provisioned by an older version (which used /etc/unbounded-kube), those legacy files will be left behind after uninstall. Consider removing both the legacy and new paths for backward-compatible cleanup.

Copilot uses AI. Check for mistakes.
Comment on lines 153 to 158
echo "Removing agent binaries and configuration..."
rm -f /usr/local/bin/unbounded-agent
rm -f /usr/local/bin/unbounded-agent-install.sh
rm -f /usr/local/bin/unbounded-agent-uninstall.sh
rm -rf /etc/unbounded-agent
rm -rf /etc/unbounded/agent
rm -rf /tmp/unbounded-agent
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uninstall step now removes only /etc/unbounded/agent. If the host was provisioned by an older version that used /etc/unbounded-agent, the legacy directory will remain after uninstall/reset, which can confuse operators and future runs. Consider removing both legacy and new agent config directories.

Copilot uses AI. Check for mistakes.
Comment thread cmd/agent/internal/phases/reset/cleanup.go
bcho added a commit that referenced this pull request Apr 17, 2026
* Replace gRPC daemon with Machine CR watch-based daemon

Replace the gRPC task-pull daemon with a watch-based daemon that
monitors the Machine CR on the control plane and reconciles the local
node to match the desired state.

- Watch Machine CR for spec drift (version, image) and operation
  counter drift (reimageCounter, rebootCounter)
- Perform alternating nspawn machine updates (kube1/kube2) on drift
- Update Machine CR status: phase, conditions (NodeUpdated), and
  acknowledge operation counters
- Authenticate with bootstrap token from applied config (not kubelet
  kubeconfig which has nspawn-internal paths)
- Add RBAC (ClusterRole + ClusterRoleBinding) for system:bootstrappers
- Add e2e tests: daemon validation, version upgrade via reimageCounter
  patch, applied config verification
- Fix double-reconciliation bug: re-GET Machine CR before drift
  detection to avoid stale watch events

* Refactor: extract shared phase helpers and consolidate machineRun

Extract rootfs.Provision, nodestart.StartNode, and nodestop.StopNode as
shared composite tasks used by both the initial agent start and the node
update flow. This removes duplicated phase orchestration from start.go
and nodeupdate.Execute.

Move the duplicated machineRun helper into utilexec.MachineRun so all
packages (nodestart, nodestop, nodeupdate) share a single implementation.
Replace the empty nodestop stubs (StopContainerd, StopKubelet) with a
real StopNode task that gracefully stops services before nspawn teardown.

* Refactor agent daemon: consolidate into daemon package and extract phase helpers

- Move nodeupdate package to cmd/agent/internal/daemon, rename Execute to
  updateNode (unexported). The daemon package now owns the full watch loop,
  reconciliation, kube client, scheme, drift detection, and node update logic.

- Slim cmd/daemon.go to just cobra wiring that calls daemon.Run(ctx, log).
  Delete cmd/scheme.go (moved to daemon package).

- Remove Config field from NodeStart goal state. PersistAppliedConfig and
  StartNode now take *provision.AgentConfig as an explicit parameter instead
  of smuggling it through the goal state struct.

- Extract WaitForKubelet into phases/nodestart/wait_kubelet.go.

- Extract PersistAppliedConfig into phases/nodestart/persist_config.go.

- Add reset.RemoveAppliedConfig task and reset.CleanupMachine composite
  that combines RemoveNSpawnConfig + RemoveMachine + RemoveAppliedConfig.
  Replace inline os.Remove in node update with the composite task.

* Simplify e2e test flow to single linear sequence

Remove the two-case structure (pre-existing CR vs no CR with VM
recreation) and replace with a single linear flow: join, validate
self-registered CR, upgrade, reset, rejoin. This removes the need
for VM recreation between test cases.

* Move Machine CR registration from start command to daemon

The daemon now registers the Machine CR at startup before entering the
watch loop, instead of the start command doing it as a separate phase.
This ensures registration happens even on rejoin after reset, and keeps
all Machine CR interaction in the daemon package.

Remove the now-unused nodestart.RegisterMachine phase and its tests.
Add tests for the daemon's registerMachine and buildMachineCR functions.

* Move EnableDaemon into task list, add machines.target dep, unexport findActiveMachine

- Move EnableDaemon from a separate call into the Serial task list so
  all phases are in one place.
- Add machines.target dependency to the daemon systemd unit so it waits
  for the nspawn machine to be running before starting.
- Unexport FindActiveMachine since it is only used within the daemon
  package.
- Add polling to validate_machine_cr_created in e2e tests since the
  daemon now registers the Machine CR asynchronously after startup.

* Fix implicit string concatenation flagged by CodeQL

Use explicit + concatenation for the multi-line shell command string
in validate_upgrade to avoid the implicit-concatenation-in-list
warning.

* Fix lint: lowercase error string per Go conventions

* Update RBAC and daemon doc to match current architecture

Add create verb to machines resource in agent RBAC (needed for daemon
self-registration). Update daemon.md to accurately describe bootstrap
token auth, system:bootstrappers group, Machine CR registration at
startup, operation counter drift as sole reconciliation trigger, and
machines.target systemd dependency.

* Address PR #37 review nits and drop post-rejoin CR check

- Annotate bare error returns in goalstates/resolve.go (plombardi89)
- Add MachineConditionNodeUpdated const to api/v1alpha3 (jveski)
- Use apimeta.SetStatusCondition instead of manual condition loop (jveski)
- Set NodeUpdated condition alongside Provisioning phase update (jveski)
- Drop validate-machine-cr-created step after rejoin in e2e workflow

* Decouple watch loop from reconciliation with async worker

Move reconciliation to a worker goroutine signalled via a buffered
channel (capacity 1). The watch loop now performs a non-blocking send
on each MODIFIED/ADDED event and immediately returns to draining the
watch stream. This prevents backpressure on the API server's HTTP/2
connection when reconciliation takes time (rootfs provisioning ~15s).

The worker calls handleMachineEvent which re-GETs the Machine CR from
the API server, so coalesced signals naturally pick up the latest state.
Multiple events arriving during an in-flight reconciliation are merged
into a single follow-up reconciliation.

* Use client-go workqueue for async reconciliation

Replace hand-rolled channel-based worker with client-go's
TypedRateLimitingInterface workqueue. This is the standard Kubernetes
controller building block and provides deduplication, rate limiting
with exponential backoff on failures, and proper shutdown semantics.

The watch loop calls queue.Add(machineName) on events; the workqueue
deduplicates if the key is already queued or being processed. On
reconciliation failure runWorker calls AddRateLimited for backoff
retry; on success it calls Forget to reset the rate limiter.

* Add TODO for bootstrap token credential strategy in buildKubeClient

* Replace global NewKubeClient var with parameter injection

* Consolidate agent daemon RBAC into bootstrapper RBAC

* Fix flaky e2e: add site label to bootstrap token secret

* Stop and remove agent daemon systemd unit during reset

* Unify config dirs under /etc/unbounded/{kube,agent}

* Add SHA-256 sidecar checksum for applied config integrity

Write a .sha256 companion file alongside the applied config JSON to
detect on-disk corruption (e.g. bitflips). Each file is written
atomically via renameio; a missing sidecar is treated as a warning
(older agent or crash between writes), while a present sidecar with
wrong digest returns ErrChecksumMismatch.

- Write path: PersistAppliedConfig writes checksum after config
- Read path: findActiveMachine verifies checksum before trusting data
- Reset path: RemoveAppliedConfig cleans up the sidecar file
- Tests: ComputeChecksum, VerifyChecksum (match/mismatch/missing/error)

* Update daemon doc: consolidate sections, add SVG diagram

Combine Drift Detection and Applied Config Integrity into a single
Applied Config and Drift Detection section that describes AgentConfig
fields, operation counter triggers, and persistence with integrity
guard. Replace ASCII diagram with SVG following existing style.
Simplify systemd and RBAC sections to prose descriptions. Call out
bootstrap token auth as temporary.

* Rename reimage to repave in agent daemon, e2e, and docs

Follow the API rename from ReimageCounter to RepaveCounter that landed
in main via #36. Update Go field references, JSON patch fields in e2e
tests, Python variables, CI comments, and documentation.

* Consolidate daemonUnit const and remove reboot counter drift check

* Implement in-place update of nspawn machine

Add alternating (blue/green) nspawn machine update logic and a long-running
agent daemon that registers the Machine CR at startup.

The daemon discovers the active nspawn machine, builds a kube client from
the applied config, ensures a Machine CR exists, then blocks until shutdown.
The update logic (updateNode, hasDrift, findActiveMachine) is in place but
the trigger mechanism is TBD.

Removes the Machine CR watch loop, workqueue, status updates, and
operations counter logic from the prior implementation. Also removes
daemon-specific e2e steps and design doc.

* Add license header and StartLimitIntervalSec=0 to daemon unit file
@bcho bcho force-pushed the hbc/agent-controlplane-cr branch from ea01f36 to 33b2386 Compare April 17, 2026 21:20
@bcho bcho force-pushed the hbc/agent-controlplane-cr branch from 33b2386 to fd42ce8 Compare April 17, 2026 21:31
@bcho bcho changed the title Machine CR watch-based daemon [WIP] Machine CR watch-based daemon Apr 20, 2026
@bcho bcho marked this pull request as draft April 20, 2026 16:41
@bcho bcho closed this Apr 28, 2026
@bcho bcho deleted the hbc/agent-controlplane-cr branch April 28, 2026 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants