feat: add WorkerPool CRD for persistent task execution#1318
feat: add WorkerPool CRD for persistent task execution#1318knechtionscoding wants to merge 1 commit into
Conversation
|
This PR is large and would use a significant portion of your monthly review quota. Comment |
|
Due to the size, I am also happy to own running and posting of the reviewer if you would like and then just allow you to review the output to not burn the projects tokens |
a49a324 to
4173e9a
Compare
4173e9a to
0894d29
Compare
gjkim42
left a comment
There was a problem hiding this comment.
I think we have to add e2e tests for the persistent tasks
0894d29 to
7944fa0
Compare
7944fa0 to
4407e18
Compare
4407e18 to
e9ce834
Compare
knechtionscoding
left a comment
There was a problem hiding this comment.
API Design Review
Verdict: APPROVE
Scope: New WorkerPool CRD + WorkerConfig shared type + TaskSpec/TaskTemplate deprecation layer with worker struct and workerPoolRef
Findings
API Conventions ✓
- All field names correctly use camelCase in Go structs and JSON tags
- Optional fields properly marked with
+optional, pointer types, andomitempty - Required fields validated with
+kubebuilder:validation:Required - No boolean
isX/enableXanti-patterns WorkerPoolPhasecorrectly uses named string type with const block- Status fields (
Replicas,ReadyReplicas,Phase) appropriately mirror standard Kubernetes status patterns (matching Deployment/StatefulSet conventions) ObservedGenerationincluded inWorkerPoolStatus— good practice for day-one
Compatibility ✓
- No breaking changes: deprecated top-level fields remain functional with
+deprecatedmarkers,+optional, pointer types, andomitempty - CEL union validation correctly accepts all three forms:
workerPoolRef,worker.type+credentials, or deprecatedtype+credentials Credentialsvalue→pointer change is a valid loosening for v1alpha1 (required→optional in CRD schema)- Existing manifests in
examples/01-simple-task/,self-development/continue to validate under the updated schema TaskSpawnerSpecworkspace validation rule correctly extended to covertaskTemplate.worker.workspaceRefandtaskTemplate.workerPoolRefTaskBuildercorrectly propagates bothWorkerandWorkerPoolReffrom templates (lines 112-117)
Naming & Consistency ✓
WorkerPoolReferencefollows existingXxxReferencepattern (WorkspaceReference,AgentConfigReference,SecretReference)- Field naming (
workerPoolRef,workspaceRef,agentConfigRef) follows establishedxxxRefconvention replicas,readyReplicas,phase,conditions,observedGenerationall match upstream Kubernetes semanticsWorkerConfigcorrectly reused via type aliases forTaskWorkerandTaskTemplateWorker
Validation ✓
WorkerPoolSpeccredential XValidation rule matches existing pattern from previousTaskSpecagentConfigRef/agentConfigRefsmutual exclusion enforced at all three levels (WorkerPoolSpec, WorkerConfig, TaskSpec)WorkerPoolReference.Namehas+kubebuilder:validation:MinLength=1— good empty-string prevention (better than existingWorkspaceReferencewhich lacks it)WorkerPoolStorage.Sizecorrectly usesresource.Quantitytype
Extensibility ✓
WorkerPoolStoragestruct (rather than baresizescalar on spec) allows future fields (access modes, volume mode) without breaking changesWorkerPoolSpecusesPodOverridespointer — expandable without schema changeWorkerPoolReferenceis a struct (not bare string) — can addnamespacelater for cross-namespace refs
Documentation ✓
docs/reference.mdupdated with WorkerPool section, deprecated field markers, new CLI commands- Example manifests under
examples/15-workerpool/cover all three usage patterns (pool, task-with-pool, spawner-with-pool) - All exported types and fields have godoc comments
Suggestions (non-blocking)
- WorkerPoolSpec field duplication vs WorkerConfig embedding:
WorkerPoolSpecduplicates fields (type, credentials, model, etc.) thatWorkerConfigalready groups. Consider embeddingWorkerConfiginWorkerPoolSpecwith different validation markers in a follow-up to reduce maintenance surface — though the required-vs-optional distinction makes direct embedding non-trivial, so the current approach is reasonable. WorkspaceReferenceandAgentConfigReferencelackMinLength=1: The newWorkerPoolReferencecorrectly adds it. Consider addingMinLength=1to the older reference types in a follow-up for consistency.- No CEL warning when both
worker.*and deprecated top-level fields are set simultaneously: TheEffective*()accessors resolve correctly (worker wins), but dual-specification may confuse users. A webhook admission warning could surface this during the deprecation transition. taskspawner-with-pool.yamlincludesworkspaceRefalongsideworkerPoolRef: The comment explains why (spawner discovery loop). Consider noting in the reference docs that spawners with pool-based templates still needworkspaceReffor GitHub/Linear source discovery, since the CEL rule onTaskSpawnerSpecacceptsworkerPoolRefalone but the spawner's runtime behavior may differ.
/kelos needs-input
knechtionscoding
left a comment
There was a problem hiding this comment.
Review Summary
Verdict: COMMENT
Overall correctness: patch is correct
Scope: Adds WorkerPool CRD, controller, worker-runner binary, CLI support, Helm RBAC, TaskSpec deprecation layer via worker struct, and comprehensive tests
Findings Overview
| Priority | Count | File:Line | Summary |
|---|---|---|---|
| P0 | 0 | — | none |
| P1 | 0 | — | none |
| P2 | 2 | See below | DependsOn/branch-lock not enforced for pool tasks; kelos logs does not resolve WorkerPodName |
| P3 | 1 | See below | clearPodAssignment uses MergeFrom without optimistic lock |
Findings
Correctness
-
[P2]
internal/controller/workerpool_controller.go:807— TheTaskReconcilerearly-returns for tasks withworkerPoolRef(task_controller.go:80), skippingcheckDependenciesandBranchLocker.TryAcquire. TheWorkerPoolReconciler.reconcileTaskhas no equivalent logic — it immediately callsassignTaskfor any non-terminal task via thedefaultcase. A Task with bothworkerPoolRefanddependsOn(valid per the schema) will execute immediately, ignoring its dependencies. Similarly, branch locking is bypassed. IfdependsOnis intentionally unsupported for pool-based tasks, consider adding CEL validation to reject it or documenting the limitation. Otherwise, port the dependency-check logic (or at minimum aTaskPhaseWaitingguard that skips assignment when the task is in Waiting phase). -
[P3]
internal/controller/workerpool_controller.go:954—clearPodAssignmentusesclient.MergeFromwithoutMergeFromWithOptimisticLock. Lower risk thanassignTask(since only the controller clears annotations after completion), but a concurrent runnersetTaskStatuscall could theoretically lead to a lost update on thetasks-completedcounter. Acceptable for now given annotation semantics.
CLI
- [P2]
internal/cli/logs.go:131-144—resolveTaskPodNamesearches for pods labeledkelos.dev/taskand falls back totask.Status.PodName. Worker pool pods are not labeled with the task name (assignment is via annotation), sokelos logs <task>will fail to find the pod for pool-based tasks. The function should also checktask.Status.WorkerPodNameas a fallback:
if task.Status.WorkerPodName != "" {
return task.Status.WorkerPodName, nil
}Suggestions
- [P3] Consider adding an explicit
case kelosv1alpha1.TaskPhaseWaitingarm inreconcileTaskrather than falling through to thedefaultcase — this documents intent and prevents future regressions if dependency logic is added later. - [P3] The previous review's P1 findings (optimistic lock on
assignTask, Worker field propagation in task builder) have both been addressed — confirmed at lines 856 and 112 respectively.
Key takeaways
- The core architecture (StatefulSet lifecycle, annotation-based task protocol, optimistic-lock assignment, rollback on patch failure) is solid and well-tested with unit, integration, and e2e coverage.
- Main gap is dependency/branch-lock handling for pool-based tasks — straightforward to either explicitly forbid via CEL or implement in a follow-up.
kelos logsneeds a one-line fix to support pool-based tasks.
e9ce834 to
8a4a840
Compare
|
@gjkim42 ready for review. |
Introduce a WorkerPool custom resource that manages a fleet of persistent worker pods via StatefulSet, eliminating cold-start overhead for tasks. Tasks reference a pool via spec.workerPoolRef and are dispatched to pre-warmed workers through an annotation-based protocol. Key components: - WorkerPool CRD with spec (type, credentials, replicas, storage, workspaceRef) and status (phase, readyReplicas) - WorkerPoolReconciler managing StatefulSet/Service lifecycle and task-to-pod assignment - kelos-worker-runner binary that polls for assignments and delegates to the agent entrypoint - TaskSpec restructured with worker struct + Effective*() accessors for field resolution (deprecated top-level fields preserved) - CLI: kelos get workerpools - Helm: WorkerPool CRD template + kelos-worker-runner-role ClusterRole Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8a4a840 to
dc51633
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds a
WorkerPoolCRD that manages persistent worker pods for task execution, addressing the cold-start overhead problem identified in #911. Instead of creating a new Job per task (with git clone, dependency install, etc.), tasks can reference aWorkerPooland execute on pre-warmed pods backed by a StatefulSet with persistent storage.Architecture:
WorkerPoolowns a StatefulSet + headless Service. Each pod runs akelos-worker-runnersidecar that polls for task assignments via pod annotations.WorkerPoolReconcilerhandles both infrastructure (StatefulSet lifecycle, status) and task dispatch (assigning queued tasks to available pods, monitoring completion).spec.workerPoolRef. The existing Job-based path is unchanged for tasks without a pool reference.API changes:
WorkerPool(spec: type, credentials, replicas, storage, workspaceRef, agentConfigRef, podOverrides; status: phase, replicas, readyReplicas)TaskSpecandTaskTemplategain aworkerstruct that groups execution-environment fields (type, credentials, image, model, workspaceRef, agentConfigRef, podOverrides). Top-level fields are deprecated with+deprecatedmarkers but remain functional.Effective*()accessor methods onTaskSpecresolve the newworker.*fields over deprecated top-level fields, ensuring all consumers (job builder, controller, CLI, metrics) work with both old and new manifests.spec.workerPoolRefadded to bothTaskSpecandTaskTemplate.workerPoolRef,worker.type+credentials, or deprecatedtype+credentialsmust be provided.Components added:
api/v1alpha1/workerpool_types.go— CRD typesinternal/controller/workerpool_controller.go— Reconciler (infrastructure + task dispatch)cmd/kelos-worker-runner/— Binary that runs in worker podsinternal/workerrunner/— Worker runner core logicinternal/cli/get_workerpool.go—kelos get workerpoolscommandWhich issue(s) this PR is related to:
Fixes #1310
Fixes #911
Special notes for your reviewer:
Effective*()accessors that every consumer uses.Credentialsfield changed from value to pointer on TaskSpec/TaskTemplate to make it optional whenworkerPoolRefis set. All accesses are nil-guarded via the accessor methods.WorkerPoolReconcilerhandles both WorkerPool objects and Tasks (withworkerPoolRef). It disambiguates by trying WorkerPool first, falling through to Task only on NotFound.KELOS_TOKEN_SECRETenv var for future token refresh support, but the actual refresh logic is deferred.TaskReconcilerhas an early return for tasks withworkerPoolRefset, so it never creates Jobs for pool-based tasks.Does this PR introduce a user-facing change?
Summary by cubic
Adds a
WorkerPoolCRD and controller for persistent, pre-warmed workers to remove per-task cold starts. Tasks can opt in viaspec.workerPoolRef; the existing Job path stays as-is.New Features
WorkerPoolCRD managing a StatefulSet and headless Service with status phases;WorkerPoolReconcilerprovisions the pool and assigns tasks via a pod-annotation protocol;kelos-worker-runnerruns in pods to poll, execute, and report.WorkerConfig/TaskWorkerto group execution fields onTask/TaskTemplate; introducesspec.workerPoolRef;Credentialsis now optional (pointer);Effective*()accessors resolve new vs. deprecated fields; CEL validation enforces exactly one execution source and requirescredentials.secretRefforapi-key/oauth.kelos get workerpoolswith shell completion; controller flags--worker-runner-imageand--worker-runner-image-pull-policy; chart addsworkerpools.kelos.devCRD and akelos-worker-runner-role, and updates Task/TaskSpawner CRDs for the new fields/validations; Makefile buildscmd/kelos-worker-runner; examples underexamples/15-workerpool.Migration
WorkerPool, then setspec.workerPoolRefon aTaskorTaskSpawner.Written for commit dc51633. Summary will update on new commits.