Skip to content

feat: add WorkerPool CRD for persistent task execution#1318

Open
knechtionscoding wants to merge 1 commit into
kelos-dev:mainfrom
datagravity-ai:feat/worker-pool-persistent-spawning
Open

feat: add WorkerPool CRD for persistent task execution#1318
knechtionscoding wants to merge 1 commit into
kelos-dev:mainfrom
datagravity-ai:feat/worker-pool-persistent-spawning

Conversation

@knechtionscoding

@knechtionscoding knechtionscoding commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a WorkerPool CRD that manages persistent worker pods for task execution, addressing the cold-start overhead problem identified in #911. Instead of creating a new Job per task (with git clone, dependency install, etc.), tasks can reference a WorkerPool and execute on pre-warmed pods backed by a StatefulSet with persistent storage.

Architecture:

  • A WorkerPool owns a StatefulSet + headless Service. Each pod runs a kelos-worker-runner sidecar that polls for task assignments via pod annotations.
  • The WorkerPoolReconciler handles both infrastructure (StatefulSet lifecycle, status) and task dispatch (assigning queued tasks to available pods, monitoring completion).
  • Tasks opt in via spec.workerPoolRef. The existing Job-based path is unchanged for tasks without a pool reference.

API changes:

  • New CRD: WorkerPool (spec: type, credentials, replicas, storage, workspaceRef, agentConfigRef, podOverrides; status: phase, replicas, readyReplicas)
  • TaskSpec and TaskTemplate gain a worker struct that groups execution-environment fields (type, credentials, image, model, workspaceRef, agentConfigRef, podOverrides). Top-level fields are deprecated with +deprecated markers but remain functional.
  • Effective*() accessor methods on TaskSpec resolve the new worker.* fields over deprecated top-level fields, ensuring all consumers (job builder, controller, CLI, metrics) work with both old and new manifests.
  • spec.workerPoolRef added to both TaskSpec and TaskTemplate.
  • CEL validation enforces the union: exactly one of workerPoolRef, worker.type+credentials, or deprecated type+credentials must be provided.

Components added:

  • api/v1alpha1/workerpool_types.go — CRD types
  • internal/controller/workerpool_controller.go — Reconciler (infrastructure + task dispatch)
  • cmd/kelos-worker-runner/ — Binary that runs in worker pods
  • internal/workerrunner/ — Worker runner core logic
  • internal/cli/get_workerpool.gokelos get workerpools command
  • Helm CRD template, RBAC ClusterRole, Dockerfile
  • Unit tests, integration test, example manifests

Which issue(s) this PR is related to:

Fixes #1310
Fixes #911

Special notes for your reviewer:

  • Backward compatibility: All existing manifests continue to work unchanged. The deprecated top-level fields on TaskSpec/TaskTemplate are wired through Effective*() accessors that every consumer uses.
  • Credentials field changed from value to pointer on TaskSpec/TaskTemplate to make it optional when workerPoolRef is set. All accesses are nil-guarded via the accessor methods.
  • Reconcile dispatch: The WorkerPoolReconciler handles both WorkerPool objects and Tasks (with workerPoolRef). It disambiguates by trying WorkerPool first, falling through to Task only on NotFound.
  • Task assignment protocol: Controller annotates pod with task name → runner detects annotation → runner executes agent → runner writes status annotation → controller reads status and updates Task phase → controller clears annotations.
  • No token refresh yet: The worker runner receives KELOS_TOKEN_SECRET env var for future token refresh support, but the actual refresh logic is deferred.
  • The TaskReconciler has an early return for tasks with workerPoolRef set, so it never creates Jobs for pool-based tasks.

Does this PR introduce a user-facing change?

Add WorkerPool CRD for persistent task execution. WorkerPools manage pre-warmed worker pods that eliminate per-task cold-start overhead. Tasks reference a pool via spec.workerPoolRef to execute on persistent infrastructure instead of creating one-shot Jobs. TaskSpec gains a worker struct for grouping execution-environment fields (type, credentials, image, workspaceRef); top-level fields are deprecated but remain functional.

Summary by cubic

Adds a WorkerPool CRD and controller for persistent, pre-warmed workers to remove per-task cold starts. Tasks can opt in via spec.workerPoolRef; the existing Job path stays as-is.

  • New Features

    • WorkerPool CRD managing a StatefulSet and headless Service with status phases; WorkerPoolReconciler provisions the pool and assigns tasks via a pod-annotation protocol; kelos-worker-runner runs in pods to poll, execute, and report.
    • API: adds WorkerConfig/TaskWorker to group execution fields on Task/TaskTemplate; introduces spec.workerPoolRef; Credentials is now optional (pointer); Effective*() accessors resolve new vs. deprecated fields; CEL validation enforces exactly one execution source and requires credentials.secretRef for api-key/oauth.
    • CLI/Helm/RBAC: kelos get workerpools with shell completion; controller flags --worker-runner-image and --worker-runner-image-pull-policy; chart adds workerpools.kelos.dev CRD and a kelos-worker-runner-role, and updates Task/TaskSpawner CRDs for the new fields/validations; Makefile builds cmd/kelos-worker-runner; examples under examples/15-workerpool.
  • Migration

    • No changes needed for existing manifests; deprecated top-level fields still work.
    • To use pools: upgrade the chart to install the CRD/RBAC, create a WorkerPool, then set spec.workerPoolRef on a Task or TaskSpawner.

Written for commit dc51633. Summary will update on new commits.

Review in cubic

@cubic-dev-ai

cubic-dev-ai Bot commented Jun 8, 2026

Copy link
Copy Markdown

This PR is large and would use a significant portion of your monthly review quota. Comment @cubic-dev-ai review this to confirm that you want cubic to review it.

@knechtionscoding

Copy link
Copy Markdown
Contributor Author

@gjkim42 for your review. The vast majority of this diff is related to generating new CRDs. The rest is reworking #1228 into the new model.

I ran this through the general reviewer and API review locally twice and resolved open issues

@knechtionscoding

Copy link
Copy Markdown
Contributor Author

Due to the size, I am also happy to own running and posting of the reviewer if you would like and then just allow you to review the output to not burn the projects tokens

@knechtionscoding knechtionscoding force-pushed the feat/worker-pool-persistent-spawning branch from a49a324 to 4173e9a Compare June 8, 2026 14:22
@knechtionscoding knechtionscoding force-pushed the feat/worker-pool-persistent-spawning branch from 4173e9a to 0894d29 Compare June 8, 2026 14:26
@gjkim42 gjkim42 self-assigned this Jun 8, 2026

@gjkim42 gjkim42 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to add e2e tests for the persistent tasks

knechtionscoding

This comment was marked as outdated.

knechtionscoding

This comment was marked as outdated.

@knechtionscoding knechtionscoding force-pushed the feat/worker-pool-persistent-spawning branch from 4407e18 to e9ce834 Compare June 9, 2026 09:22

@knechtionscoding knechtionscoding left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API Design Review

Verdict: APPROVE
Scope: New WorkerPool CRD + WorkerConfig shared type + TaskSpec/TaskTemplate deprecation layer with worker struct and workerPoolRef

Findings

API Conventions ✓

  • All field names correctly use camelCase in Go structs and JSON tags
  • Optional fields properly marked with +optional, pointer types, and omitempty
  • Required fields validated with +kubebuilder:validation:Required
  • No boolean isX/enableX anti-patterns
  • WorkerPoolPhase correctly uses named string type with const block
  • Status fields (Replicas, ReadyReplicas, Phase) appropriately mirror standard Kubernetes status patterns (matching Deployment/StatefulSet conventions)
  • ObservedGeneration included in WorkerPoolStatus — good practice for day-one

Compatibility ✓

  • No breaking changes: deprecated top-level fields remain functional with +deprecated markers, +optional, pointer types, and omitempty
  • CEL union validation correctly accepts all three forms: workerPoolRef, worker.type+credentials, or deprecated type+credentials
  • Credentials value→pointer change is a valid loosening for v1alpha1 (required→optional in CRD schema)
  • Existing manifests in examples/01-simple-task/, self-development/ continue to validate under the updated schema
  • TaskSpawnerSpec workspace validation rule correctly extended to cover taskTemplate.worker.workspaceRef and taskTemplate.workerPoolRef
  • TaskBuilder correctly propagates both Worker and WorkerPoolRef from templates (lines 112-117)

Naming & Consistency ✓

  • WorkerPoolReference follows existing XxxReference pattern (WorkspaceReference, AgentConfigReference, SecretReference)
  • Field naming (workerPoolRef, workspaceRef, agentConfigRef) follows established xxxRef convention
  • replicas, readyReplicas, phase, conditions, observedGeneration all match upstream Kubernetes semantics
  • WorkerConfig correctly reused via type aliases for TaskWorker and TaskTemplateWorker

Validation ✓

  • WorkerPoolSpec credential XValidation rule matches existing pattern from previous TaskSpec
  • agentConfigRef/agentConfigRefs mutual exclusion enforced at all three levels (WorkerPoolSpec, WorkerConfig, TaskSpec)
  • WorkerPoolReference.Name has +kubebuilder:validation:MinLength=1 — good empty-string prevention (better than existing WorkspaceReference which lacks it)
  • WorkerPoolStorage.Size correctly uses resource.Quantity type

Extensibility ✓

  • WorkerPoolStorage struct (rather than bare size scalar on spec) allows future fields (access modes, volume mode) without breaking changes
  • WorkerPoolSpec uses PodOverrides pointer — expandable without schema change
  • WorkerPoolReference is a struct (not bare string) — can add namespace later for cross-namespace refs

Documentation ✓

  • docs/reference.md updated with WorkerPool section, deprecated field markers, new CLI commands
  • Example manifests under examples/15-workerpool/ cover all three usage patterns (pool, task-with-pool, spawner-with-pool)
  • All exported types and fields have godoc comments

Suggestions (non-blocking)

  • WorkerPoolSpec field duplication vs WorkerConfig embedding: WorkerPoolSpec duplicates fields (type, credentials, model, etc.) that WorkerConfig already groups. Consider embedding WorkerConfig in WorkerPoolSpec with different validation markers in a follow-up to reduce maintenance surface — though the required-vs-optional distinction makes direct embedding non-trivial, so the current approach is reasonable.
  • WorkspaceReference and AgentConfigReference lack MinLength=1: The new WorkerPoolReference correctly adds it. Consider adding MinLength=1 to the older reference types in a follow-up for consistency.
  • No CEL warning when both worker.* and deprecated top-level fields are set simultaneously: The Effective*() accessors resolve correctly (worker wins), but dual-specification may confuse users. A webhook admission warning could surface this during the deprecation transition.
  • taskspawner-with-pool.yaml includes workspaceRef alongside workerPoolRef: The comment explains why (spawner discovery loop). Consider noting in the reference docs that spawners with pool-based templates still need workspaceRef for GitHub/Linear source discovery, since the CEL rule on TaskSpawnerSpec accepts workerPoolRef alone but the spawner's runtime behavior may differ.

/kelos needs-input

@knechtionscoding knechtionscoding left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Verdict: COMMENT
Overall correctness: patch is correct
Scope: Adds WorkerPool CRD, controller, worker-runner binary, CLI support, Helm RBAC, TaskSpec deprecation layer via worker struct, and comprehensive tests

Findings Overview

Priority Count File:Line Summary
P0 0 none
P1 0 none
P2 2 See below DependsOn/branch-lock not enforced for pool tasks; kelos logs does not resolve WorkerPodName
P3 1 See below clearPodAssignment uses MergeFrom without optimistic lock

Findings

Correctness

  • [P2] internal/controller/workerpool_controller.go:807 — The TaskReconciler early-returns for tasks with workerPoolRef (task_controller.go:80), skipping checkDependencies and BranchLocker.TryAcquire. The WorkerPoolReconciler.reconcileTask has no equivalent logic — it immediately calls assignTask for any non-terminal task via the default case. A Task with both workerPoolRef and dependsOn (valid per the schema) will execute immediately, ignoring its dependencies. Similarly, branch locking is bypassed. If dependsOn is intentionally unsupported for pool-based tasks, consider adding CEL validation to reject it or documenting the limitation. Otherwise, port the dependency-check logic (or at minimum a TaskPhaseWaiting guard that skips assignment when the task is in Waiting phase).

  • [P3] internal/controller/workerpool_controller.go:954clearPodAssignment uses client.MergeFrom without MergeFromWithOptimisticLock. Lower risk than assignTask (since only the controller clears annotations after completion), but a concurrent runner setTaskStatus call could theoretically lead to a lost update on the tasks-completed counter. Acceptable for now given annotation semantics.

CLI

  • [P2] internal/cli/logs.go:131-144resolveTaskPodName searches for pods labeled kelos.dev/task and falls back to task.Status.PodName. Worker pool pods are not labeled with the task name (assignment is via annotation), so kelos logs <task> will fail to find the pod for pool-based tasks. The function should also check task.Status.WorkerPodName as a fallback:
if task.Status.WorkerPodName != "" {
    return task.Status.WorkerPodName, nil
}

Suggestions

  • [P3] Consider adding an explicit case kelosv1alpha1.TaskPhaseWaiting arm in reconcileTask rather than falling through to the default case — this documents intent and prevents future regressions if dependency logic is added later.
  • [P3] The previous review's P1 findings (optimistic lock on assignTask, Worker field propagation in task builder) have both been addressed — confirmed at lines 856 and 112 respectively.

Key takeaways

  • The core architecture (StatefulSet lifecycle, annotation-based task protocol, optimistic-lock assignment, rollback on patch failure) is solid and well-tested with unit, integration, and e2e coverage.
  • Main gap is dependency/branch-lock handling for pool-based tasks — straightforward to either explicitly forbid via CEL or implement in a follow-up.
  • kelos logs needs a one-line fix to support pool-based tasks.

@knechtionscoding knechtionscoding force-pushed the feat/worker-pool-persistent-spawning branch from e9ce834 to 8a4a840 Compare June 9, 2026 09:38
@knechtionscoding

Copy link
Copy Markdown
Contributor Author

@gjkim42 ready for review.

Introduce a WorkerPool custom resource that manages a fleet of
persistent worker pods via StatefulSet, eliminating cold-start overhead
for tasks. Tasks reference a pool via spec.workerPoolRef and are
dispatched to pre-warmed workers through an annotation-based protocol.

Key components:
- WorkerPool CRD with spec (type, credentials, replicas, storage,
  workspaceRef) and status (phase, readyReplicas)
- WorkerPoolReconciler managing StatefulSet/Service lifecycle and
  task-to-pod assignment
- kelos-worker-runner binary that polls for assignments and delegates
  to the agent entrypoint
- TaskSpec restructured with worker struct + Effective*() accessors
  for field resolution (deprecated top-level fields preserved)
- CLI: kelos get workerpools
- Helm: WorkerPool CRD template + kelos-worker-runner-role ClusterRole

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@knechtionscoding knechtionscoding force-pushed the feat/worker-pool-persistent-spawning branch from 8a4a840 to dc51633 Compare June 9, 2026 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/api Categorizes issue or PR as related to API changes kind/feature Categorizes issue or PR as related to a new feature needs-actor priority/important-longterm release-note triage-accepted

Projects

None yet

2 participants