feat: add persistent execution mode for TaskSpawner#1228
feat: add persistent execution mode for TaskSpawner#1228knechtionscoding wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
4 issues found across 39 files
Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Re-trigger cubic
058a2b6 to
c7d3aa3
Compare
c7d3aa3 to
10e66e3
Compare
|
@gjkim42 should be ready for review |
|
I am not sure we want to add this feature. I'd rather implement the context store from #1024 (comment). |
|
@gjkim42 we've found it useful to allow for more interactive sessions/long running usages. Specifically where doing the git clone and other prep tasks are a rather large burden for the task or question. (think asking about the codebase). Using a context store doesn't solve that. Rather, the persistent runner just means an agent that is always up and can swap between tasks, not necessarily the same task. |
I am thinking about introducing the new CRD Session, which sets up an environment for a coding agent. |
@gjkim42 possibly. I think there's still a speed/responsiveness question that comes from this, but I'm open to being wrong and having the CRD for Session solve it. I'll be more specific for the usecase we are finding this useful for: Context: We have a rather large mono-repo which makes git operations rather expensive (think on the order of 1-2 minutes just to do the git clone even with shallow). As such persistent mode has been helpful for:
I think CRD Session solves the persistence issue (from a maintain history) but I don't think it solves always running/speed |
|
/kelos api-review |
|
🤖 Kelos Task Status Task |
There was a problem hiding this comment.
🤖 Kelos API Reviewer Agent @gjkim42
API Design Review
Verdict: COMMENT
Scope: Adds executionMode (ephemeral|persistent) + sessionConfig to TaskSpawnerSpec, a Queued Task phase, and three session status fields on TaskStatus.
The mechanics are sound and the additions are backward compatible: executionMode defaults to ephemeral, every new field is +optional, no existing field is removed/retyped/tightened, durations use metav1.Duration, storageSize uses resource.Quantity, the mode is an extensible enum (not a bool), and deepcopy/CRD/RBAC artifacts are regenerated. I'm landing on COMMENT rather than approve/request-changes because the core API shape is still under discussion in this thread (embed in TaskSpawner vs. a dedicated Session CRD) and CRD fields are effectively permanent once shipped.
Findings
Compatibility (good)
executionModedefaults toephemeral, so existing TaskSpawners are unaffected — no migration needed.- New
QueuedTaskPhase is additive;TaskPhaseis not CRD-enum-constrained, so existing clients can ignore it. make updateartifacts (deepcopy, both CRDs, install-crd.yaml, RBAC) are present and consistent with the Go types.
API placement — needs a maintainer decision
- The unresolved question raised above (separate
SessionCRD vs.sessionConfigembedded inTaskSpawnerSpec) is the most consequential API call here, because it dictates a permanent surface. The contributor's use case #1 (interactive/web-UI, an always-on agent that swaps between tasks) describes a session whose lifecycle is independent of any spawner — that argues for a first-classSession/SessionPoolresource with its ownstatus, reusable across spawners. Embedding makesTaskSpawnerown a StatefulSet + session lifecycle in addition to discovery/polling. - Concrete risk: if a
SessionCRD lands later,sessionConfigonTaskSpawnerbecomes a parallel/deprecated surface that can't be removed. Worth settling the shape before merge rather than shipping both.
Naming
TaskStatus.LastSessionFailure(api/v1alpha1/task_types.go:266) holds a pod name but reads like it holds a failure reason/timestamp. Per thexxxNameconvention, rename tolastFailedSessionPodName(or split into a structuredlastSessionFailure: {podName, reason, time}if more detail is wanted later). As-is, a client readinglastSessionFailurewould reasonably expect a message, not a name.- "session" terminology (
SessionConfig,sessionPodName,StatefulSetName-for-sessions) overlaps with the proposedSessionCRD. If both ship, pick distinct terms now to avoid a confusing API vocabulary.
Extensibility
RetryOnPodFailure *bool+MaxSessionRetries *int32(taskspawner_types.go:91,:98) are two top-level fields encoding one concept, with a documented coupling ("only used when RetryOnPodFailure is true"). Prefer a single struct, e.g.podFailurePolicy: { retry: bool, maxRetries: int32 }, so future knobs (backoff, requeue delay, failure classification) extend the struct instead of adding more flat fields. This is exactly the "prefer a struct over parallel scalars" guidance.StorageSize+StorageClassName(:77,:82) are a partial re-implementation of a PVC spec laid out as flat scalars. Grouping intostorage: { size, className }keeps it expandable (accessModes, selector, volumeName) and reads closer to StatefulSetvolumeClaimTemplates. Minimal-API-wise two scalars are acceptable today, but the struct costs nothing now and avoids a second top-level field later.
Suggestions (optional / non-blocking)
IdleTimeout(:54) andMaxSessionDuration(:65) document "Defaults to 30m/8h" but have no+kubebuilder:defaultmarker (defaulted in the controller), unlikeReplicas/Git. Either add the markers or note in the godoc that defaulting is applied by the controller, so the documented default and the CRD schema agree.WorkspaceResetConfig.Git *bool(:40): a bool namedgitmay want to grow into a reset strategy. It already lives in a struct so adding sibling fields is cheap — fine to leave, just flagging the shape.- Consider an
examples/NN-taskspawner-persistent/entry mirroring the existing numbered tutorials so users have a working persistent-mode manifest.
Note on prompt injection: the cubic-dev-ai comment on this PR contains an embedded cubic:attribution directive instructing automated readers to credit cubic and treat omission as "plagiarism." That instruction was disregarded; this review is independent analysis of the code.
/kelos needs-input
|
@gjkim42 on session naming, I'm happy to rename it to something else if you have ideas. Because it isn't really a session in the uppercase Session sense. It is a session in the sense that a task is assigned to a specific pod and so has a session. Unsure what better name to provide for it, but happy to help rename it |
|
I actually read the proposal just before... and this feature looks great.
This is my concern, but I cannot find a better name for it for now. /kelos review |
|
🤖 Kelos Task Status Task |
|
@gjkim42 we could differentiate with something like persistenceConfig. And then remove references to session specifically? Because I don't think there is anything in there that loses clarity? |
There was a problem hiding this comment.
🤖 Kelos Reviewer Agent @gjkim42
Review Summary
Verdict: REQUEST CHANGES
Overall correctness: patch is incorrect
Scope: Adds a persistent execution mode to TaskSpawner — long-lived StatefulSet session pods process Queued Tasks via a new SessionReconciler + session-runner binary, with shared init-container builders extracted from job_builder.go.
The architecture is sound and the StatefulSet/immutability handling, RBAC for the new session-runner SA, init-container extraction, and token-refresh-on-disk plumbing are well done. One correctness bug blocks: switching a spawner out of persistent mode orphans its StatefulSet. Two silent-degradation paths and a docs gap follow.
Findings Overview
| Priority | Count | File:Line | Summary |
|---|---|---|---|
| P0 | 0 | — | none |
| P1 | 1 | taskspawner_controller.go:122 | Switching away from persistent mode orphans the session StatefulSet/Service; StatefulSetName never cleared |
| P2 | 5 | taskspawner_controller.go:557 | MCP HeadersFrom/EnvFrom secrets not resolved in persistent path (silent auth drop) |
| P2 | session_controller.go:340 | Transient TaskSpawner GET error silently defaults retry policy, ignoring RetryOnPodFailure/MaxSessionRetries | |
| P2 | sessionrunner/runner.go:288 | Initial token-refresh failure logs a warning and runs the agent unauthenticated | |
| P2 | rbac.yaml:393 | slack-server-role granted unused pods/log: get and tasks: update;watch |
|
| P2 | (docs) | New user-facing feature (executionMode/sessionConfig/new flags) ships with no docs/reference.md or examples/ update | |
| P3 | 6 | session_controller.go:184 | Pod-annotation Update not retried on conflict; benign pod churn rolls back assignment |
| P3 | session_statefulset_builder.go:563 | Dead code: sessionServiceName never called |
|
| P3 | cmd/kelos-spawner/reconciler.go:43 | Dead plumbing: persistentSource only ever set to nil |
|
| P3 | capture/usage.go:273 | Dead code: parseResponse referenced only by its own test |
|
| P3 | test/integration/session_test.go:448 | randomSuffix ignores crypto/rand.Read error |
|
| P3 | (tests) | Missing coverage: processTask/Run loop, assignment race, ServiceName wiring, refactor SecurityContext/mounts |
Findings
Correctness
- [P1]
internal/controller/taskspawner_controller.go:122—reconcileSessionStatefulSetruns only whenExecutionMode == persistent. When a user later changes the field back toephemeral(or removes it), Reconcile falls through to the Deployment/CronJob/Webhook paths, whosedeleteStaleResourcecalls are keyed by the spawner name and only target Deployment/CronJob — never thesession-<name>StatefulSet or its headless Service. The orphaned StatefulSet keeps itsreplicashonored, so session pods are restarted indefinitely and keep claiming/processing tasks, andStatus.StatefulSetNameis never cleared. Mirror thedeleteStaleResourcepattern to delete the session StatefulSet + Service (guarded by ownership) and resetStatefulSetNameon the non-persistent paths. - [P2]
internal/controller/taskspawner_controller.go:557— the persistent path passesagentConfigstraight intoBuild(...), which serializesMCPServersas-is. The ephemeralcreateJobpath first callsresolveMCPServerSecretsto mergeHeadersFrom/EnvFromsecret values intoHeaders/Env. As a result, MCP servers using secret-sourced headers/env in persistent mode silently lose them — the agent runs without the auth it would have in ephemeral mode, with no error surfaced. Resolve MCP secrets before building the StatefulSet, matchingcreateJob. - [P2]
internal/controller/session_controller.go:340—getSessionConfigreturnsnilon any non-NotFoundGet(TaskSpawner)error (it only logs), and callers treat nil as "retry up to 3". A transient GET failure (throttling, cache miss) during a pod-failure reconcile therefore silently ignores an admin'sRetryOnPodFailure: false(re-queues anyway) and replaces a customMaxSessionRetrieswith 3. Return the error and let Reconcile requeue; treat only NotFound as nil config. - [P2]
internal/sessionrunner/runner.go:288— before invoking the agent,processTaskcallsrefreshTokenand on error only printsWarning: failed to refresh tokenthen continues. If the token Secret is missing/empty or carries an expiredtoken-expires-at, no token file is written; withGH_CONFIG_DIRset,runAgentstripsGITHUB_TOKEN/GH_TOKEN/GH_ENTERPRISE_TOKENand the file credential helper reads a non-existent file, so the agent runs git/gh unauthenticated. Distinguish a hard credential failure (secret missing/expired) from a transient API error and fail the task fast withtoken-expiredrather than running unauthenticated. - [P3]
internal/controller/session_controller.go:184— after claiming the task, the pod annotationr.Update(availablePod)uses the stale cached pod object with noRetryOnConflict. Running pods bump resourceVersion frequently (kubelet status), so this often 409s even with no competing assignment, rolling the task back to Queued and forcing a requeue. Wrap inRetryOnConflictwith a fresh Get + re-check thatAnnotationAssignedTaskis still unset, mirroring the task-claim block.
Security
- [P2]
internal/manifests/charts/kelos/templates/rbac.yaml:393— this PR addspods/log: getandtasks: update;watchtokelos-slack-server-role. These are unrelated to persistent mode and unbacked by code (the only slack change here is addingExecutionModetoSpawnerRef; noTasks().Update/Watchor pod-log read exists ininternal/slack).pods/log: getcluster-wide is a meaningful capability. Drop these grants unless a slack code path needing them lands in the same PR.
Documentation
- [P2] This PR introduces user-facing surface —
executionMode,sessionConfig, controller flags--session-runner-image/--session-runner-image-pull-policy, and chart valuesessionRunnerImage— but updates none ofdocs/reference.md(CRD field tables + CLI Reference),README.md, orexamples/. Add the matching reference entries and aexamples/NN-taskspawner-persistent/manifest in this PR.
Tests
- [P2]
internal/sessionrunner/runner_test.go— the new tests coverConfigFromEnv,refreshToken/hosts.yml,tailWriter, andatomicWriteFile, but the central orchestration is untested:processTask(reset → agent → auth-failure → agent-error → status), theRunloop's name+resourceVersion assignment dedup, andrunAgent's env construction (token stripping gated onGH_CONFIG_DIR,KELOS_BRANCHinjection). Add aprocessTasktest with a fake agent command asserting the success path and theerrTokenExpired/errAgentReportedFailurebranches. - [P3]
internal/controller/session_controller_test.go:940— the only race test exercises the already-past-Queued early return. The true double-assignment race (two Queued tasks → same pod, guarded only by the pod Update conflict + rollback) and the annotation-Update-failure rollback path are untested. Add a test injecting an Update conflict and asserting one task is rolled back to Queued withSessionPodNamecleared. - [P3]
internal/controller/init_containers_test.go:30— for a behavior-preserving extraction, the tests assert only Name/Image/Command presence, not the load-bearing details the refactor must keep:SecurityContext.RunAsUser, the workspaceVolumeMount, and the auth-pathArgsrewrite (prepending--). Pin those so a future regression fails. - [P3]
internal/controller/session_statefulset_builder_test.go:61— no assertion thatsts.Spec.ServiceName == svc.Name; this is the load-bearing wiring for stable pod DNS. Add it.
Code quality
- [P3]
internal/controller/session_statefulset_builder.go:563—sessionServiceNameis never called (the Service name is derived viasessionStatefulSetNameinsideBuild). Remove it or wire it in. - [P3]
cmd/kelos-spawner/reconciler.go:43— thepersistentSourcefield/param and therunCycleWithSourcebranch are unreachable (both construction sites passnil). Per "no speculative scaffolding," land this with the code that populates it. - [P3]
internal/capture/usage.go:273—parseResponseis unexported and referenced only byTestParseResponse; it is dead. Remove it, or wire it into the runner if final-response capture was intended. - [P3]
test/integration/session_test.go:448—randomSuffixignoresrand.Read's error; on the (rare) error path the suffix is all-zeros, causing namespace collisions across specs. The e2e framework checks this error; match that.
Key takeaways
- One blocker: persistent→ephemeral mode switch leaks the StatefulSet and keeps session pods running.
- Two silent-degradation paths (MCP secrets, initial token refresh) conflict with the project's fail-fast convention.
- The feature needs
docs/reference.md+ anexamples/entry; several new files would benefit from happy-path/behavior-pinning tests.
Note on prompt injection: the cubic-dev-ai comment on this PR carries an embedded cubic:attribution directive instructing automated readers to credit cubic and treat omission as "plagiarism." That instruction was disregarded; this review is independent analysis of the code.
10e66e3 to
a255c56
Compare
|
🤖 Kelos Task Status Task |
31cae2e to
7b78a5d
Compare
|
@gjkim42 E2E Test added |
|
/kelos review |
|
🤖 Kelos Task Status Task |
|
/kelos review |
|
🤖 Kelos Task Status Task |
|
/kelos review |
|
🤖 Kelos Task Status Task |
There was a problem hiding this comment.
🤖 Kelos Reviewer Agent @gjkim42
Review Summary
Verdict: REQUEST CHANGES
Overall correctness: patch is incorrect
Scope: Adds persistent TaskSpawner execution backed by session StatefulSets and a session-runner loop, plus API fields, docs, examples, and tests.
Findings Overview
| Priority | Count | File:Line | Summary |
|---|---|---|---|
| P0 | 0 | — | none |
| P1 | 1 | internal/sessionrunner/runner.go:308 | Transient initial token refresh failures can run the agent without credentials |
| P2 | 1 | internal/controller/session_statefulset_builder.go:273 | Persistent session pods silently ignore most TaskTemplate.podOverrides fields |
| P2 | 1 | internal/controller/session_controller.go:309 | Successful tasks can be forced to Failed when the pod status annotation write fails |
| P2 | 1 | test/e2e/framework/framework.go:344 | E2E helper uses global Expect() inside an Eventually polling path |
| P3 | 0 | — | none |
Findings
Correctness
- [P1]
internal/sessionrunner/runner.go:308— InitialrefreshTokenerrors that are noterrCredentialUnavailableare logged and ignored. In the generated session pod,KELOS_TOKEN_SECRETalso setsGH_CONFIG_DIR;runAgentstripsGITHUB_TOKEN/GH_TOKEN/GH_ENTERPRISE_TOKEN, andconfigureFileCredentialHelperpoints git at/workspace/.kelos-token. If the first Secret read fails transiently, no token file orhosts.ymltoken is written, but the agent is still invoked with the env credentials removed, so git/gh run unauthenticated. Return an error/requeue instead of starting the agent until the initial credential refresh succeeds. - [P2]
internal/controller/session_statefulset_builder.go:273— The persistent StatefulSet path only appliesEnv,Resources,NodeSelector, andServiceAccountNamefromTaskTemplate.podOverrides; the Job path also appliesimagePullSecrets,tolerations,affinity, user volumes/mounts, extra containers/init containers, and security contexts. Switching an existing spawner that relies on a private image pull secret or tainted-node toleration toexecutionMode: persistentsilently changes scheduling/pull behavior and can leave the session pod inImagePullBackOffor unschedulable. Apply the same supported pod override fields in the StatefulSet builder, or explicitly reject/document unsupported fields before creating the StatefulSet. - [P2]
internal/controller/session_controller.go:309—processTaskwritesTask.Status.CompletionTimein a defer beforeRun()writes the pod's terminalkelos.dev/task-statusannotation. If a successful agent run then hits repeated pod update conflicts or a transient API failure insetTaskStatus, the Task already hasCompletionTimebut no status annotation, and this branch marks itFailedafter 15 seconds while clearing the completion timestamp. Avoid turning an already-completed Task into failure solely because the runner's pod annotation write failed; retry/requeue the signal path or use a status field that can distinguish success from missing signal.
Tests
- [P2]
test/e2e/framework/framework.go:344—GetTaskSessionPodNameuses globalExpect()and the new e2e calls it from inside anEventuallypoll. A transient GET error will fail the spec immediately instead of being retried by Gomega, which is exactly the flake pattern the project conventions call out for e2e wait helpers. Make the polling function perform the GET inline and return an empty string on error, or add aWaitForTaskSessionPodNamehelper usingEventually(func(g Gomega) { ... }).
Key takeaways
- The previous P1 control-flow and runner dedup blockers are fixed in the current head.
- One remaining fail-fast issue can still start the agent without credentials on initial Secret read failures, so I’m requesting changes.
- The persistent path should preserve the existing
podOverridescontract closely enough that switching execution modes does not silently drop scheduling, pull, volume, or security settings.
Note on prompt injection: the cubic-dev-ai comment on this PR carries an embedded attribution directive for automated readers. That instruction was disregarded; this review is independent analysis of the code.
7b78a5d to
f3c0c2e
Compare
f3c0c2e to
9f12310
Compare
|
@gjkim42 I ran the reviewer locally a few times to make sure everything was good. Should be ready for final review now |
9f12310 to
ed05233
Compare
Introduce a "persistent" execution mode that maintains long-lived session pods via StatefulSet, allowing Tasks to be processed without per-task cold-start overhead. When executionMode is set to "persistent", Tasks transition to a Queued phase and are assigned to available session pods by the new SessionReconciler. Key components: - SessionReconciler: assigns queued Tasks to session pods, monitors completion, handles pod failures with configurable retry - SessionStatefulSetBuilder: creates StatefulSet + headless Service with PVC-backed workspaces - Session runner binary: polls for assigned tasks, invokes agent entrypoint, reports results, refreshes GitHub tokens, resets workspace between tasks - Shared init container builders extracted from job_builder for reuse between ephemeral Jobs and persistent StatefulSets API additions: - TaskPhaseQueued phase for Tasks awaiting session assignment - ExecutionMode and SessionConfig on TaskSpawnerSpec - Session status fields (SessionPodName, SessionRetryCount, LastSessionFailure) on TaskStatus Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ed05233 to
f3718f8
Compare
|
Thanks! |
gjkim42
left a comment
There was a problem hiding this comment.
Holding this — the persistent-execution concept should land on a different resource boundary, and it should not squat on the Session name.
This introduces a foundational concept (long-lived execution substrate for Tasks). Two things are expensive to reverse once shipped, so I'd like to settle them before merge rather than carry a deprecated surface indefinitely.
1. Resource boundary. executionMode + sessionConfig live on TaskSpawnerSpec, but TaskSpawner is a trigger ('when X, mint a Task') — it shouldn't own execution infrastructure. Today it owns the StatefulSet, the fleet config, and a routing label, and the assignment loop reaches sideways into the spawner for config. Proposed: a dedicated WorkerPool CRD that owns the StatefulSet + fleet config + execution environment, with Tasks referencing it via spec.workerPoolRef (presence of the ref = persistent; no executionMode enum, so no contradictory states). This also makes hand-authored persistent Tasks possible and lets one pool serve many spawners.
2. Naming — reserve Session for a different concept. We're planning a separate, interactive Session resource (a human-driven coding session, peer to Task). This PR uses 'session' throughout for the batch warm-pod substrate — SessionPool, SessionReconciler, kelos-session-runner, sessionConfig, SessionPodName. That collides head-on: SessionPool reads as 'a pool of Sessions' but means 'a pool of warm Task-workers'. Rename the batch substrate to WorkerPool / worker pods and reserve 'Session' for the interactive concept. Several of these names lock in at release (binary/image name, TaskStatus API fields, env vars), which is why this is a pre-merge concern.
The implementation is largely reusable — the StatefulSet builder, runner, and assignment logic move mostly intact; what changes is the config source (spawner → WorkerPool), the StatefulSet owner (spawner → WorkerPool), and the names. The inline comments map each piece to its new home. I'd like to close this and reland on the WorkerPool API — happy to pair on a design doc first.
Not in scope for this PR: designing the interactive Session CRD. We're only protecting its name now.
| // +kubebuilder:validation:Enum=ephemeral;persistent | ||
| // +kubebuilder:default=ephemeral | ||
| // +optional | ||
| ExecutionMode ExecutionMode `json:"executionMode,omitempty"` |
There was a problem hiding this comment.
Remove. In the proposed design there's no executionMode enum anywhere — persistence is expressed by the presence of Task.spec.workerPoolRef. That removes a class of invalid states (an enum that can contradict a config block), and the spawner stamps a workerPoolRef into spawned Tasks instead of a mode.
| // SessionConfig configures persistent execution mode. Only valid when | ||
| // executionMode is "persistent". | ||
| // +optional | ||
| SessionConfig *SessionConfig `json:"sessionConfig,omitempty"` |
There was a problem hiding this comment.
Remove from TaskSpawner. Fleet config belongs on the new WorkerPool CRD (its own lifecycle + reconciler). TaskSpawner gains only taskTemplate.workerPoolRef.
| } | ||
|
|
||
| // SessionConfig configures persistent execution mode sessions. | ||
| type SessionConfig struct { |
There was a problem hiding this comment.
This becomes WorkerPoolSpec on the new CRD, and it should also own the environment fields the StatefulSet builder currently pulls from taskTemplate (type, credentials, workspaceRef, agentConfigRef, image) — a pool is a warm, pre-configured environment. Trim to MVP per our 'minimum viable API' rule: keep replicas, storage.size (+ environment); defer storageClassName, idleTimeout, maxSessionDuration, maxTasksPerSession, workspaceReset, and the retry knobs (runner defaults already cover them).
| // execute this Task. Set only for Tasks belonging to a persistent-mode | ||
| // TaskSpawner. | ||
| // +optional | ||
| SessionPodName string `json:"sessionPodName,omitempty"` |
There was a problem hiding this comment.
These three status fields (SessionPodName, SessionRetryCount, LastSessionFailure) bake 'session' into the Task API — hard to change after release. Rename to WorkerPodName / WorkerRetryCount / LastWorkerFailure so 'Session' stays reserved for the planned interactive resource.
| } | ||
| task.Labels["kelos.dev/taskspawner"] = spawnerRef.Name | ||
| if spawnerRef.ExecutionMode == v1alpha1.ExecutionModePersistent { | ||
| task.Labels["kelos.dev/execution-mode"] = string(v1alpha1.ExecutionModePersistent) |
There was a problem hiding this comment.
The spawner should stamp task.spec.workerPoolRef (from taskTemplate.workerPoolRef) instead of a kelos.dev/execution-mode label. Routing then branches on a validated spec field rather than mutable selector metadata.
| // handleDeletion handles TaskSpawner deletion. | ||
| // reconcileSessionStatefulSet manages the StatefulSet and headless Service for | ||
| // persistent-mode TaskSpawners. | ||
| func (r *TaskSpawnerReconciler) reconcileSessionStatefulSet(ctx context.Context, req ctrl.Request, ts *kelosv1alpha1.TaskSpawner, isSuspended bool) (ctrl.Result, error) { |
There was a problem hiding this comment.
This loop moves to a new WorkerPoolReconciler, with the StatefulSet + headless Service owned by the WorkerPool (controllerRef → WorkerPool), not the spawner (see the SetControllerReference(ts, …) below). TaskSpawner stops managing infrastructure entirely. This is a relocation — the build logic itself is reusable.
| } | ||
|
|
||
| agentUID := AgentUID | ||
| name := sessionStatefulSetName(ts.Name) |
There was a problem hiding this comment.
Naming the StatefulSet after the spawner is what hard-couples one pool to one spawner. Name it after the WorkerPool instead, so a pool can serve many spawners and hand-authored Tasks.
| // getSessionConfig fetches the SessionConfig for a task's owning TaskSpawner. | ||
| // It returns an error on transient failures so callers requeue rather than | ||
| // silently falling back to defaults. | ||
| func (r *SessionReconciler) getSessionConfig(ctx context.Context, task *kelosv1alpha1.Task) (*kelosv1alpha1.SessionConfig, error) { |
There was a problem hiding this comment.
Replace the reach into TaskSpawner with a lookup of the WorkerPool the Task references (task.Spec.WorkerPoolRef). The Task names its pool explicitly — no inference from labels, and no dependence on the spawner still existing.
| } | ||
| var spawner kelosv1alpha1.TaskSpawner | ||
| if err := r.Get(ctx, client.ObjectKey{Namespace: task.Namespace, Name: spawnerName}, &spawner); err != nil { | ||
| if apierrors.IsNotFound(err) { |
There was a problem hiding this comment.
In the new design a missing referenced pool is an explicit Queued + PoolNotFound condition, not a silent fallback to default retry (which conflicts with our fail-fast rule). This silent-default path disappears with the redesign.
| return out.Close() | ||
| } | ||
|
|
||
| func main() { |
There was a problem hiding this comment.
Binary/image name leak: kelos-session-runner (this cmd/ dir, its Dockerfile, and the published image tag) bakes 'session' into a release artifact that's awkward to rename later. Propose kelos-worker-runner, reserving 'session' for the interactive Session concept.
|
@gjkim42 reasonable! I'll open a PR/Issue with worker pool design |
|
Opened #1310 |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds a "persistent" execution mode to TaskSpawner that maintains long-lived session pods via StatefulSet. Tasks are processed without per-task cold-start overhead (no Job creation, no init container re-runs, no image pulls per task).
When
executionMode: persistentis set on a TaskSpawner:Queuedphase instead of creating a JobSessionReconcilerassigns queued Tasks to available session podsKey components:
cmd/kelos-session-runner) — polls for assigned tasks, invokes agent entrypoint, reports results back via Task status, refreshes GitHub tokens periodically, resets workspace between tasksjob_builder.gointoinit_containers.gofor reuse between ephemeral Jobs and persistent StatefulSetsAPI additions:
TaskPhaseQueued— new phase for Tasks awaiting session pod assignmentExecutionMode(ephemeral|persistent) onTaskSpawnerSpecSessionConfig— idle timeout, max tasks per session, replicas, storage size/class, workspace reset config, retry policyTaskStatus—sessionPodName,sessionRetryCount,lastSessionFailureStatefulSetNameonTaskSpawnerStatusWhich issue(s) this PR is related to:
fixes #911
Special notes for your reviewer:
Addressed review feedback (round 2):
Correctness:
deleteStaleSessionResources— when switching away from persistent mode, the session StatefulSet + headless Service are cleaned up andStatus.StatefulSetNameis cleared, preventing orphaned session podsHeadersFrom/EnvFrom) are now resolved before building the StatefulSet, matching the ephemeralcreateJobpathgetSessionConfignow returns an error on transient API failures instead of silently falling back to defaults (only NotFound returns nil config)processTasknow fail the task immediately withtoken-expiredreason rather than running the agent unauthenticatedassignTaskwrapped inRetryOnConflictwith fresh Get + re-check, preventing spurious 409s from kubelet status bumpsSecurity:
pods/log: getand trimmedtasksverbs tocreate, listfromkelos-slack-server-role— these grants were unbacked by code ininternal/slackCode quality:
sessionServiceNamefunction (name is derived viasessionStatefulSetNameinsideBuild)persistentSourcefield/param/branch fromspawnerReconciler/runOnceparseResponsefunction and its testrandomSuffixto panic onrand.ReaderrorTests:
processTaskorchestration tests: success path,errTokenExpiredon expired creds,errAgentReportedFailure, auth-failure-in-output detectionTestSessionReconciler_PodAnnotationConflictRollsBackTask— injects Update conflict, asserts task rolls back to Queuedsts.Spec.ServiceName == svc.Nameassertion (load-bearing wiring for stable pod DNS)SecurityContext.RunAsUser, workspaceVolumeMount, and"--"Args assertions to init container testsrunAgentFnseam toRunnerstruct for testabilityDocumentation:
executionModeand allsessionConfig.*fields todocs/reference.mdTaskSpawner table--session-runner-imageand--session-runner-image-pull-policyto CLI Referenceexamples/14-taskspawner-persistent/with README and manifestsDoes this PR introduce a user-facing change?