[WIP] Machine CR watch-based daemon by bcho · Pull Request #37 · Azure/unbounded

bcho · 2026-04-14T17:44:38Z

Summary

Implemented a Kubernetes watch-based daemon that monitors the Machine CR on the control plane and reconciles the local node to match the desired state.

What changed

Daemon (cmd/agent/internal/daemon/)

Watch-based loop using client.NewWithWatch on the Machine CR matched by MachineName from the applied config
Async reconciliation via client-go workqueue (decouples watch events from reconciliation)
Authenticates via bootstrap token from the applied config (system:bootstrappers group) - avoids reading kubeconfig files from inside the nspawn machine (which contain nspawn-internal paths). A TODO marks this as temporary pending a proper agent credential strategy.
Registers a Machine CR at startup if one does not already exist (supports dynamic/manual-bootstrap environments)
Detects operation counter drift (spec.operations.repaveCounter > status.operations.repaveCounter) to trigger reconciliation. Reboot counter is not handled by the daemon.
Inside updateNode, checks actual config drift (version, image) before performing the expensive rootfs reprovision
On success: sets phase to Joining, acknowledges repave counter (spec -> status), sets NodeUpdated condition to True/Succeeded
On failure: sets phase to Failed, sets NodeUpdated condition to False/Failed
Before reconciliation: sets phase to Provisioning, sets NodeUpdated condition to False/InProgress
Re-GETs Machine CR from API server at start of each event handler to avoid stale watch events causing spurious drift
Kube client built via parameter injection (unexported kubeClientFunc type) for testability

Applied config integrity (cmd/agent/internal/goalstates/checksum.go)

SHA-256 sidecar checksum file written alongside applied config JSON
Verified on read; missing sidecar = warning (proceed), mismatch = error

Phase helpers (extracted from inline code)

rootfs.Provision(log, gs) - composite rootfs provisioning task
nodestart.StartNode(log, gs, cfg) - composite node start task
nodestop.StopNode(log, machineName) - composite node stop task
utilexec.MachineRun - nspawn machine command execution

Goal state (cmd/agent/internal/goalstates/)

ResolveMachine() discovers active nspawn machine and loads applied config
MachineGoalState provides machine name, config, and rootfs paths
DaemonUnit const consolidates the systemd unit name in one place

RBAC (deploy/machina/07-bootstrapper-rbac.yaml)

Agent daemon RBAC consolidated into the bootstrapper ClusterRole (unbounded-bootstrapper-machine)
Grants create/get/list/watch on machines, get/update/patch on machines/status
Bound to system:bootstrappers group

Systemd (unbounded-agent-daemon.service)

After=network-online.target machines.target ensures daemon starts after nspawn machines are available
Enabled in the start command's serial task list after initial provisioning
Stopped, disabled, and removed during reset (Step 1 before stopping nspawn machines)

Config directories

Unified under /etc/unbounded/kube and /etc/unbounded/agent (was /etc/unbounded-kube and /etc/unbounded-agent)

Design doc (docs/content/reference/agent/daemon.md)

Architecture SVG diagram, auth model, RBAC, systemd lifecycle, reconciliation flow

E2E test flow (single linear sequence)

Start node without pre-created Machine CR
Wait for node Ready
Validate Machine CR (daemon self-registers it, polling up to 120s)
Validate daemon is watching
Trigger upgrade: patch Machine CR with new kube version + bump repaveCounter
Validate upgrade completed (daemon performs alternating nspawn machine update)
Wait for node Ready post-upgrade
Validate workload scheduling
Reset agent + delete Machine CR
Rejoin (run agent again on same VM, no VM recreation)
Wait for node Ready
Validate daemon watching
Validate workload

Removed

nodestart.RegisterMachine phase (moved to daemon)
deploy/agent/01-rbac.yaml (consolidated into bootstrapper RBAC)
Two-case e2e structure with VM recreation between cases

plombardi89

Mostly looks good. I'd make sure all the errors are annotated as it helps debugging when something goes wrong if you have a bunch of traceable errors.

- Annotate bare error returns in goalstates/resolve.go (plombardi89) - Add MachineConditionNodeUpdated const to api/v1alpha3 (jveski) - Use apimeta.SetStatusCondition instead of manual condition loop (jveski) - Set NodeUpdated condition alongside Provisioning phase update (jveski) - Drop validate-machine-cr-created step after rejoin in e2e workflow

Copilot

Pull request overview

Implements an unbounded-agent daemon that watches the Machine CR for the current node and reconciles the local nspawn machine to the desired state, plus supporting changes (applied-config integrity checks, systemd lifecycle, RBAC, and updated e2e flow).

Changes:

Add a watch-driven agent daemon with async reconciliation (workqueue) and machine self-registration.
Persist applied config + SHA-256 sidecar, and verify integrity on read.
Wire daemon into start/reset flows, update RBAC + config paths, and extend agent e2e to cover upgrade + rejoin.

Reviewed changes

Copilot reviewed 46 out of 49 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
internal/provision/assets/unbounded-agent-uninstall.sh	Update uninstall cleanup for new `/etc/unbounded/...` config directories
images/host-ubuntu2404/assets/vendor-data.tmpl	Update cloud-init paths/env var to new agent config location
hack/agent/e2e-kind/run-local.sh	Convert e2e runner to single linear flow (join → upgrade → reset/rejoin)
hack/agent/e2e-kind/e2e.py	Add daemon validation + upgrade trigger/validation steps; remove old two-case flow
go.mod	Bump indirect `golang.org/x/oauth2` version
go.sum	Update sums for `golang.org/x/oauth2` bump
docs/static/img/agent-daemon-overview.svg	Add architecture diagram asset for daemon doc
docs/content/reference/agent/daemon.md	Add daemon design/architecture/reference documentation
deploy/machina/07-bootstrapper-rbac.yaml	Expand bootstrapper RBAC for daemon watch + status updates
cmd/kubectl-unbounded/app/machine_manual_bootstrap_test.go	Update expected config file paths in tests
cmd/kubectl-unbounded/app/assets/node-bootstrap/cloud-init.yaml	Update config file path + env var path in cloud-init asset
cmd/agent/internal/utilexec/exec.go	Add shared `utilexec.MachineRun` helper
cmd/agent/internal/phases/rootfs/provision.go	Extract composite rootfs provisioning task
cmd/agent/internal/phases/reset/nspawn.go	Add applied-config removal + composite machine cleanup task
cmd/agent/internal/phases/reset/daemon.go	Add reset task to stop/disable/remove daemon unit
cmd/agent/internal/phases/reset/cleanup.go	Update reset cleanup paths to new config directory
cmd/agent/internal/phases/nodestop/stop.go	Implement graceful node stop (stop services inside machine then stop nspawn)
cmd/agent/internal/phases/nodestop/cri.go	Remove unused/no-op stop tasks
cmd/agent/internal/phases/nodestart/wait_kubelet.go	Add kubelet active polling task for update verification
cmd/agent/internal/phases/nodestart/start.go	Extract composite node-start sequence (includes persisting applied config)
cmd/agent/internal/phases/nodestart/persist_config.go	Persist applied config + checksum sidecar after successful node start
cmd/agent/internal/phases/nodestart/nvidia.go	Switch to shared `utilexec.MachineRun`
cmd/agent/internal/phases/nodestart/nspawn.go	Switch to shared `utilexec.MachineRun` and remove local helper
cmd/agent/internal/phases/nodestart/machine_test.go	Remove tests for old RegisterMachine phase (moved to daemon)
cmd/agent/internal/phases/nodestart/machine.go	Remove old RegisterMachine phase implementation
cmd/agent/internal/phases/nodestart/kubelet.go	Switch to shared `utilexec.MachineRun`
cmd/agent/internal/phases/nodestart/cri.go	Switch to shared `utilexec.MachineRun`
cmd/agent/internal/phases/host/enable_daemon.go	Add task to install/enable/start systemd daemon unit
cmd/agent/internal/phases/host/assets/unbounded-agent-daemon.service	New systemd unit for `unbounded-agent daemon`
cmd/agent/internal/goalstates/resolve_test.go	Update tests to new `goalstates` package + exported `ResolveOCIImage`
cmd/agent/internal/goalstates/resolve.go	Add `ResolveMachine` goalstate resolution (rootfs + nodestart)
cmd/agent/internal/goalstates/hostkernel_other.go	Move to `goalstates` package; adjust error message
cmd/agent/internal/goalstates/hostkernel_linux.go	Move to `goalstates` package
cmd/agent/internal/goalstates/constants_test.go	Add tests for alternating machine + applied-config path helpers
cmd/agent/internal/goalstates/constants.go	Add new config dirs, daemon unit const, alternate machine + applied-config path helper
cmd/agent/internal/goalstates/checksum_test.go	Add checksum/sidecar behavior tests
cmd/agent/internal/goalstates/checksum.go	Implement SHA-256 sidecar + verification (missing sidecar tolerated)
cmd/agent/internal/daemon/update_test.go	Add drift tests; partial coverage around applied-config handling
cmd/agent/internal/daemon/update.go	Implement active-machine discovery, drift detection, and node update orchestration
cmd/agent/internal/daemon/scheme.go	Add scheme construction for core + Machine CR
cmd/agent/internal/daemon/register_test.go	Add tests for daemon machine self-registration
cmd/agent/internal/daemon/daemon.go	Implement watch loop + workqueue reconciliation + status/condition updates
cmd/agent/internal/cmd/start.go	Use new goalstate + composite tasks; enable daemon after initial provisioning
cmd/agent/internal/cmd/reset.go	Stop daemon first during reset
cmd/agent/internal/cmd/daemon.go	Add `unbounded-agent daemon` CLI subcommand
cmd/agent/internal/cmd/cmd.go	Register daemon subcommand
api/v1alpha3/machine_types.go	Add `NodeUpdated` Machine condition constant
.gitignore	Ignore `hack/bin/`
.github/workflows/agent-e2e-kind.yaml	Update CI e2e workflow to linear flow including daemon + upgrade validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T22:41:39Z

+func TestFindActiveMachine_Kube1(t *testing.T) {
+	dir := t.TempDir()
+
+	// Override the config dir for testing.
+	origPath := goalstates.AgentConfigDir
+	// We can't override the const, so we test the lower-level function
+	// by writing directly and reading back.
+	cfg := baseConfig()
+	data, err := json.MarshalIndent(cfg, "", "  ")
+	require.NoError(t, err)
+
+	configPath := filepath.Join(dir, "kube1-applied-config.json")
+	require.NoError(t, os.WriteFile(configPath, data, 0o600))
+
+	// Read it back to verify the format.
+	readData, err := os.ReadFile(configPath)
+	require.NoError(t, err)
+
+	var readCfg provision.AgentConfig
+	require.NoError(t, json.Unmarshal(readData, &readCfg))
+	assert.Equal(t, cfg.MachineName, readCfg.MachineName)
+	assert.Equal(t, cfg.Cluster.Version, readCfg.Cluster.Version)
+
+	_ = origPath // Note: findActiveMachine uses the const, so this test
+	// validates the serialization/deserialization roundtrip rather than
+	// the full findActiveMachine flow (which requires root filesystem access).
+}


TestFindActiveMachine_Kube1 doesn't exercise findActiveMachine at all (it just round-trips JSON), so the selection logic (kube1 vs kube2), checksum verification, and error paths are currently untested. Consider refactoring findActiveMachine to accept a base directory (or injecting filesystem ops) so the test can create tempdir files and assert the actual behavior.

Copilot · 2026-04-15T22:41:39Z

+
+	// Worker goroutine: processes items from the queue.
+	go runWorker(ctx, log, kubeClient, queue)
+


The daemon only reconciles when a watch event enqueues the Machine name. On startup, if spec.operations.* already drifts from status (or if the daemon was down during a counter bump), there may be no subsequent watch event and the drift will never be processed. Consider enqueueing machineName once after registerMachine (or doing an initial handleMachineEvent/GET+reconcile pass) before entering the watch loop so startup always converges state.

Suggested change

// Always perform one startup reconciliation so any pre-existing drift

// between spec.operations.* and status is converged even if no watch

// event is emitted after registration.

queue.Add(machineName)

Copilot · 2026-04-15T22:41:39Z

+	// Set status to Provisioning before starting work.
+	if err := updateMachinePhase(ctx, c, machine, v1alpha3.MachinePhaseProvisioning, "agent daemon reconciling"); err != nil {
+		log.Warn("failed to update phase to Provisioning", "error", err)
+		// Continue with reconciliation even if status update fails.
+	}
+
+	// Execute the node update with the desired config.
+	if err := updateNode(ctx, log, active, desired); err != nil {
+		// Update status to Failed.
+		failMsg := fmt.Sprintf("node update failed: %v", err)
+		if updateErr := updateMachineStatus(ctx, c, machine, v1alpha3.MachinePhaseFailed, failMsg, false); updateErr != nil {
+			log.Warn("failed to update status after failure", "error", updateErr)
+		}
+
+		return fmt.Errorf("node update: %w", err)
+	}
+
+	// Acknowledge operation counters.
+	acknowledgeOperations(machine)
+
+	// Update status to Joining with success.
+	if err := updateMachineStatus(ctx, c, machine, v1alpha3.MachinePhaseJoining, "node update completed", true); err != nil {
+		log.Warn("failed to update status after success", "error", err)
+	}


handleMachineEvent fetches the Machine once, then performs a long-running updateNode, and finally calls Status().Update on the same in-memory object. In practice the Machine status/resourceVersion is likely to change during reconciliation (from the daemon's own Provisioning update or from the controller), so the final status update can conflict and be dropped, leaving spec.operations > status.operations and causing repeated retries. Re-fetch (or use a status Patch with client.MergeFrom) and wrap status writes in retry.RetryOnConflict so phase/conditions and operation acknowledgements are reliably persisted.

Copilot · 2026-04-15T22:41:39Z

+	for _, name := range []string{goalstates.NSpawnMachineKube1, goalstates.NSpawnMachineKube2} {
+		path := goalstates.AppliedConfigPath(name)
+
+		data, err := os.ReadFile(path)
+		if errors.Is(err, os.ErrNotExist) {
+			continue
+		}
+
+		if err != nil {
+			return nil, fmt.Errorf("read applied config %s: %w", path, err)
+		}
+
+		// Verify the sidecar checksum before trusting the config data.
+		checksumPath := goalstates.AppliedConfigChecksumPath(name)
+		if err := goalstates.VerifyChecksum(data, checksumPath); err != nil {
+			return nil, fmt.Errorf("verify applied config checksum for %s: %w", name, err)
+		}
+
+		// If the sidecar file is missing, log a warning so operators
+		// know the integrity check was skipped.
+		if _, statErr := os.Stat(checksumPath); errors.Is(statErr, os.ErrNotExist) {
+			log.Warn("no checksum sidecar found, skipping integrity check",
+				"config_path", path,
+				"checksum_path", checksumPath,
+			)
+		}
+
+		var cfg provision.AgentConfig
+		if err := json.Unmarshal(data, &cfg); err != nil {
+			return nil, fmt.Errorf("decode applied config %s: %w", path, err)
+		}
+
+		return &ActiveMachine{Name: name, Config: &cfg}, nil
+	}


findActiveMachine returns the first applied config it finds (kube1 before kube2). During an update there is a window where both applied configs can exist (new config persisted before old is cleaned up), and a crash/restart in that window would cause the daemon to pick the wrong active machine/config. Consider detecting the "both present" case and either selecting by file mtime / machinectl state, or returning an error that forces operator intervention.

Copilot · 2026-04-15T22:41:40Z

+    # Patch the Machine CR to set the new version and repaveCounter = 1.
+    # The CRD requires spec.kubernetes.bootstrapTokenRef when
+    # spec.kubernetes is present, so we include a placeholder value.
+    patch = json.dumps({
+        "spec": {
+            "kubernetes": {
+                "version": upgrade_version,
+                "bootstrapTokenRef": {
+                    "name": "not-used-by-agent-daemon",
+                },
+            },
+            "operations": {
+                "repaveCounter": 1,
+            },
+        },


This merge patch overwrites spec.kubernetes.bootstrapTokenRef with a placeholder. That can inadvertently destroy the correct bootstrapTokenRef created by the daemon/machina and may affect any controller logic that relies on it. Since the CR already has bootstrapTokenRef, patch only the fields you need (version + operations.repaveCounter), and consider bumping repaveCounter relative to the current value instead of hard-coding it to 1 to make the step rerunnable.

Copilot · 2026-04-15T22:41:40Z

 # 6. Remove nftables flush service and unbounded-kube config dir
 # -----------------------------------------------------------------
 echo "Removing nftables flush service..."
 systemctl disable --now nftables-flush.service 2>/dev/null || true
 rm -f /etc/systemd/system/nftables-flush.service
-rm -rf /etc/unbounded-kube
+rm -rf /etc/unbounded/kube



This uninstall step now removes only the new config directory (/etc/unbounded/kube). If the host was provisioned by an older version (which used /etc/unbounded-kube), those legacy files will be left behind after uninstall. Consider removing both the legacy and new paths for backward-compatible cleanup.

Copilot · 2026-04-15T22:41:40Z

 echo "Removing agent binaries and configuration..."
 rm -f /usr/local/bin/unbounded-agent
 rm -f /usr/local/bin/unbounded-agent-install.sh
 rm -f /usr/local/bin/unbounded-agent-uninstall.sh
-rm -rf /etc/unbounded-agent
+rm -rf /etc/unbounded/agent
 rm -rf /tmp/unbounded-agent


This uninstall step now removes only /etc/unbounded/agent. If the host was provisioned by an older version that used /etc/unbounded-agent, the legacy directory will remain after uninstall/reset, which can confuse operators and future runs. Consider removing both legacy and new agent config directories.

* Replace gRPC daemon with Machine CR watch-based daemon Replace the gRPC task-pull daemon with a watch-based daemon that monitors the Machine CR on the control plane and reconciles the local node to match the desired state. - Watch Machine CR for spec drift (version, image) and operation counter drift (reimageCounter, rebootCounter) - Perform alternating nspawn machine updates (kube1/kube2) on drift - Update Machine CR status: phase, conditions (NodeUpdated), and acknowledge operation counters - Authenticate with bootstrap token from applied config (not kubelet kubeconfig which has nspawn-internal paths) - Add RBAC (ClusterRole + ClusterRoleBinding) for system:bootstrappers - Add e2e tests: daemon validation, version upgrade via reimageCounter patch, applied config verification - Fix double-reconciliation bug: re-GET Machine CR before drift detection to avoid stale watch events * Refactor: extract shared phase helpers and consolidate machineRun Extract rootfs.Provision, nodestart.StartNode, and nodestop.StopNode as shared composite tasks used by both the initial agent start and the node update flow. This removes duplicated phase orchestration from start.go and nodeupdate.Execute. Move the duplicated machineRun helper into utilexec.MachineRun so all packages (nodestart, nodestop, nodeupdate) share a single implementation. Replace the empty nodestop stubs (StopContainerd, StopKubelet) with a real StopNode task that gracefully stops services before nspawn teardown. * Refactor agent daemon: consolidate into daemon package and extract phase helpers - Move nodeupdate package to cmd/agent/internal/daemon, rename Execute to updateNode (unexported). The daemon package now owns the full watch loop, reconciliation, kube client, scheme, drift detection, and node update logic. - Slim cmd/daemon.go to just cobra wiring that calls daemon.Run(ctx, log). Delete cmd/scheme.go (moved to daemon package). - Remove Config field from NodeStart goal state. PersistAppliedConfig and StartNode now take *provision.AgentConfig as an explicit parameter instead of smuggling it through the goal state struct. - Extract WaitForKubelet into phases/nodestart/wait_kubelet.go. - Extract PersistAppliedConfig into phases/nodestart/persist_config.go. - Add reset.RemoveAppliedConfig task and reset.CleanupMachine composite that combines RemoveNSpawnConfig + RemoveMachine + RemoveAppliedConfig. Replace inline os.Remove in node update with the composite task. * Simplify e2e test flow to single linear sequence Remove the two-case structure (pre-existing CR vs no CR with VM recreation) and replace with a single linear flow: join, validate self-registered CR, upgrade, reset, rejoin. This removes the need for VM recreation between test cases. * Move Machine CR registration from start command to daemon The daemon now registers the Machine CR at startup before entering the watch loop, instead of the start command doing it as a separate phase. This ensures registration happens even on rejoin after reset, and keeps all Machine CR interaction in the daemon package. Remove the now-unused nodestart.RegisterMachine phase and its tests. Add tests for the daemon's registerMachine and buildMachineCR functions. * Move EnableDaemon into task list, add machines.target dep, unexport findActiveMachine - Move EnableDaemon from a separate call into the Serial task list so all phases are in one place. - Add machines.target dependency to the daemon systemd unit so it waits for the nspawn machine to be running before starting. - Unexport FindActiveMachine since it is only used within the daemon package. - Add polling to validate_machine_cr_created in e2e tests since the daemon now registers the Machine CR asynchronously after startup. * Fix implicit string concatenation flagged by CodeQL Use explicit + concatenation for the multi-line shell command string in validate_upgrade to avoid the implicit-concatenation-in-list warning. * Fix lint: lowercase error string per Go conventions * Update RBAC and daemon doc to match current architecture Add create verb to machines resource in agent RBAC (needed for daemon self-registration). Update daemon.md to accurately describe bootstrap token auth, system:bootstrappers group, Machine CR registration at startup, operation counter drift as sole reconciliation trigger, and machines.target systemd dependency. * Address PR #37 review nits and drop post-rejoin CR check - Annotate bare error returns in goalstates/resolve.go (plombardi89) - Add MachineConditionNodeUpdated const to api/v1alpha3 (jveski) - Use apimeta.SetStatusCondition instead of manual condition loop (jveski) - Set NodeUpdated condition alongside Provisioning phase update (jveski) - Drop validate-machine-cr-created step after rejoin in e2e workflow * Decouple watch loop from reconciliation with async worker Move reconciliation to a worker goroutine signalled via a buffered channel (capacity 1). The watch loop now performs a non-blocking send on each MODIFIED/ADDED event and immediately returns to draining the watch stream. This prevents backpressure on the API server's HTTP/2 connection when reconciliation takes time (rootfs provisioning ~15s). The worker calls handleMachineEvent which re-GETs the Machine CR from the API server, so coalesced signals naturally pick up the latest state. Multiple events arriving during an in-flight reconciliation are merged into a single follow-up reconciliation. * Use client-go workqueue for async reconciliation Replace hand-rolled channel-based worker with client-go's TypedRateLimitingInterface workqueue. This is the standard Kubernetes controller building block and provides deduplication, rate limiting with exponential backoff on failures, and proper shutdown semantics. The watch loop calls queue.Add(machineName) on events; the workqueue deduplicates if the key is already queued or being processed. On reconciliation failure runWorker calls AddRateLimited for backoff retry; on success it calls Forget to reset the rate limiter. * Add TODO for bootstrap token credential strategy in buildKubeClient * Replace global NewKubeClient var with parameter injection * Consolidate agent daemon RBAC into bootstrapper RBAC * Fix flaky e2e: add site label to bootstrap token secret * Stop and remove agent daemon systemd unit during reset * Unify config dirs under /etc/unbounded/{kube,agent} * Add SHA-256 sidecar checksum for applied config integrity Write a .sha256 companion file alongside the applied config JSON to detect on-disk corruption (e.g. bitflips). Each file is written atomically via renameio; a missing sidecar is treated as a warning (older agent or crash between writes), while a present sidecar with wrong digest returns ErrChecksumMismatch. - Write path: PersistAppliedConfig writes checksum after config - Read path: findActiveMachine verifies checksum before trusting data - Reset path: RemoveAppliedConfig cleans up the sidecar file - Tests: ComputeChecksum, VerifyChecksum (match/mismatch/missing/error) * Update daemon doc: consolidate sections, add SVG diagram Combine Drift Detection and Applied Config Integrity into a single Applied Config and Drift Detection section that describes AgentConfig fields, operation counter triggers, and persistence with integrity guard. Replace ASCII diagram with SVG following existing style. Simplify systemd and RBAC sections to prose descriptions. Call out bootstrap token auth as temporary. * Rename reimage to repave in agent daemon, e2e, and docs Follow the API rename from ReimageCounter to RepaveCounter that landed in main via #36. Update Go field references, JSON patch fields in e2e tests, Python variables, CI comments, and documentation. * Consolidate daemonUnit const and remove reboot counter drift check * Implement in-place update of nspawn machine Add alternating (blue/green) nspawn machine update logic and a long-running agent daemon that registers the Machine CR at startup. The daemon discovers the active nspawn machine, builds a kube client from the applied config, ensures a Machine CR exists, then blocks until shutdown. The update logic (updateNode, hasDrift, findActiveMachine) is in place but the trigger mechanism is TBD. Removes the Machine CR watch loop, workqueue, status updates, and operations counter logic from the prior implementation. Also removes daemon-specific e2e steps and design doc. * Add license header and StartLimitIntervalSec=0 to daemon unit file

bcho changed the title ~~[DONT REVIEW] Replace gRPC daemon with Machine CR watch-based daemon~~ [DONT REVIEW] Machine CR watch-based daemon Apr 14, 2026

github-code-quality Bot found potential problems Apr 14, 2026

View reviewed changes

Comment thread hack/agent/e2e-kind/e2e.py Fixed

github-code-quality Bot found potential problems Apr 14, 2026

View reviewed changes

Comment thread hack/agent/e2e-kind/e2e.py Fixed

bcho force-pushed the hbc/agent-controlplane-cr branch from 966d3c3 to 526c54f Compare April 14, 2026 21:31

github-code-quality Bot found potential problems Apr 14, 2026

View reviewed changes

Comment thread hack/agent/e2e-kind/e2e.py Fixed

bcho force-pushed the hbc/agent-controlplane-cr branch 2 times, most recently from a3603db to 18429a5 Compare April 14, 2026 22:00

github-code-quality Bot found potential problems Apr 14, 2026

View reviewed changes

Comment thread hack/agent/e2e-kind/e2e.py Fixed

bcho mentioned this pull request Apr 14, 2026

Add agent daemon command for host lifecycle #40

Closed

github-code-quality Bot found potential problems Apr 15, 2026

View reviewed changes

Comment thread hack/agent/e2e-kind/e2e.py Fixed

bcho changed the title ~~[DONT REVIEW] Machine CR watch-based daemon~~ Machine CR watch-based daemon Apr 15, 2026

plombardi89 requested changes Apr 15, 2026

View reviewed changes

Comment thread cmd/agent/internal/goalstates/constants.go Outdated

Comment thread cmd/agent/internal/goalstates/resolve.go Outdated

Comment thread cmd/agent/internal/goalstates/resolve.go Outdated

Comment thread cmd/agent/internal/goalstates/resolve.go Outdated