feat: add persistent execution mode for TaskSpawner by knechtionscoding · Pull Request #1228 · kelos-dev/kelos

knechtionscoding · 2026-05-28T16:01:19Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a "persistent" execution mode to TaskSpawner that maintains long-lived session pods via StatefulSet. Tasks are processed without per-task cold-start overhead (no Job creation, no init container re-runs, no image pulls per task).

When executionMode: persistent is set on a TaskSpawner:

Spawned Tasks transition to a new Queued phase instead of creating a Job
A SessionReconciler assigns queued Tasks to available session pods
The session runner inside the pod picks up the task, invokes the agent, reports results, and resets the workspace for the next task

Key components:

SessionReconciler — assigns queued Tasks to session pods, monitors completion via pod annotations, handles pod failures with configurable retry and re-queue
SessionStatefulSetBuilder — creates StatefulSet + headless Service with PVC-backed workspaces per replica
Session runner binary (cmd/kelos-session-runner) — polls for assigned tasks, invokes agent entrypoint, reports results back via Task status, refreshes GitHub tokens periodically, resets workspace between tasks
Shared init container builders — extracted from job_builder.go into init_containers.go for reuse between ephemeral Jobs and persistent StatefulSets

API additions:

TaskPhaseQueued — new phase for Tasks awaiting session pod assignment
ExecutionMode (ephemeral | persistent) on TaskSpawnerSpec
SessionConfig — idle timeout, max tasks per session, replicas, storage size/class, workspace reset config, retry policy
Session status fields on TaskStatus — sessionPodName, sessionRetryCount, lastSessionFailure
StatefulSetName on TaskSpawnerStatus

Which issue(s) this PR is related to:

fixes #911

Special notes for your reviewer:

Addressed review feedback (round 2):

Correctness:

P1: Added deleteStaleSessionResources — when switching away from persistent mode, the session StatefulSet + headless Service are cleaned up and Status.StatefulSetName is cleared, preventing orphaned session pods
P2: MCP server secrets (HeadersFrom/EnvFrom) are now resolved before building the StatefulSet, matching the ephemeral createJob path
P2: getSessionConfig now returns an error on transient API failures instead of silently falling back to defaults (only NotFound returns nil config)
P2: Hard credential failures (secret missing key, token expired) in processTask now fail the task immediately with token-expired reason rather than running the agent unauthenticated
P3: Pod annotation update in assignTask wrapped in RetryOnConflict with fresh Get + re-check, preventing spurious 409s from kubelet status bumps

Security:

P2: Removed pods/log: get and trimmed tasks verbs to create, list from kelos-slack-server-role — these grants were unbacked by code in internal/slack

Code quality:

Removed dead sessionServiceName function (name is derived via sessionStatefulSetName inside Build)
Removed unreachable persistentSource field/param/branch from spawnerReconciler/runOnce
Removed dead parseResponse function and its test
Fixed randomSuffix to panic on rand.Read error

Tests:

Added processTask orchestration tests: success path, errTokenExpired on expired creds, errAgentReportedFailure, auth-failure-in-output detection
Added TestSessionReconciler_PodAnnotationConflictRollsBackTask — injects Update conflict, asserts task rolls back to Queued
Added sts.Spec.ServiceName == svc.Name assertion (load-bearing wiring for stable pod DNS)
Added SecurityContext.RunAsUser, workspace VolumeMount, and "--" Args assertions to init container tests
Added runAgentFn seam to Runner struct for testability

Documentation:

Added executionMode and all sessionConfig.* fields to docs/reference.md TaskSpawner table
Added --session-runner-image and --session-runner-image-pull-policy to CLI Reference
Created examples/14-taskspawner-persistent/ with README and manifests

Does this PR introduce a user-facing change?

Add persistent execution mode for TaskSpawner. Set `executionMode: persistent` to maintain long-lived session pods that process multiple tasks without cold-start overhead. Configurable via `sessionConfig` (replicas, idle timeout, storage, retry policy).

cubic-dev-ai

4 issues found across 39 files

_{Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic}

knechtionscoding · 2026-05-29T10:56:33Z

@gjkim42 should be ready for review

gjkim42 · 2026-05-29T13:08:56Z

I am not sure we want to add this feature.
Is it useful?

I'd rather implement the context store from #1024 (comment).

knechtionscoding · 2026-05-29T13:20:09Z

@gjkim42 we've found it useful to allow for more interactive sessions/long running usages. Specifically where doing the git clone and other prep tasks are a rather large burden for the task or question. (think asking about the codebase). Using a context store doesn't solve that. Rather, the persistent runner just means an agent that is always up and can swap between tasks, not necessarily the same task.

gjkim42 · 2026-05-29T13:23:12Z

Rather, the persistent runner just means an agent that is always up and can swap between tasks, not necessarily the same task.

I am thinking about introducing the new CRD Session, which sets up an environment for a coding agent.
Does it cover those use cases?

knechtionscoding · 2026-05-29T13:32:16Z

I am thinking about introducing the new CRD Session, which sets up an environment for a coding agent.
Does it cover those use cases?

@gjkim42 possibly. I think there's still a speed/responsiveness question that comes from this, but I'm open to being wrong and having the CRD for Session solve it.

I'll be more specific for the usecase we are finding this useful for:

Context: We have a rather large mono-repo which makes git operations rather expensive (think on the order of 1-2 minutes just to do the git clone even with shallow).

As such persistent mode has been helpful for:

Interacting with it from a web ui/desktop app similar to claude. Where we don't want to check out the repo each time a message is sent but we do want to maintain the same/consistent configuration that we hvae in kelos and provide responsiveness.
Where we know we have a consistent throughput of tasks such that swapping between branches and sessions is much cheaper than running git clone every time.

I think CRD Session solves the persistence issue (from a maintain history) but I don't think it solves always running/speed

gjkim42 · 2026-05-29T13:36:43Z

/kelos api-review

kelos-bot · 2026-05-29T13:36:48Z

🤖 Kelos Task Status

Task kelos-api-reviewer-issue-comment-bf822513d881 has succeeded. ✅

kelos-bot

🤖 Kelos API Reviewer Agent @gjkim42

API Design Review

Verdict: COMMENT
Scope: Adds executionMode (ephemeral|persistent) + sessionConfig to TaskSpawnerSpec, a Queued Task phase, and three session status fields on TaskStatus.

The mechanics are sound and the additions are backward compatible: executionMode defaults to ephemeral, every new field is +optional, no existing field is removed/retyped/tightened, durations use metav1.Duration, storageSize uses resource.Quantity, the mode is an extensible enum (not a bool), and deepcopy/CRD/RBAC artifacts are regenerated. I'm landing on COMMENT rather than approve/request-changes because the core API shape is still under discussion in this thread (embed in TaskSpawner vs. a dedicated Session CRD) and CRD fields are effectively permanent once shipped.

Findings

Compatibility (good)

executionMode defaults to ephemeral, so existing TaskSpawners are unaffected — no migration needed.
New Queued TaskPhase is additive; TaskPhase is not CRD-enum-constrained, so existing clients can ignore it.
make update artifacts (deepcopy, both CRDs, install-crd.yaml, RBAC) are present and consistent with the Go types.

API placement — needs a maintainer decision

The unresolved question raised above (separate Session CRD vs. sessionConfig embedded in TaskSpawnerSpec) is the most consequential API call here, because it dictates a permanent surface. The contributor's use case #1 (interactive/web-UI, an always-on agent that swaps between tasks) describes a session whose lifecycle is independent of any spawner — that argues for a first-class Session/SessionPool resource with its own status, reusable across spawners. Embedding makes TaskSpawner own a StatefulSet + session lifecycle in addition to discovery/polling.
Concrete risk: if a Session CRD lands later, sessionConfig on TaskSpawner becomes a parallel/deprecated surface that can't be removed. Worth settling the shape before merge rather than shipping both.

Naming

TaskStatus.LastSessionFailure (api/v1alpha1/task_types.go:266) holds a pod name but reads like it holds a failure reason/timestamp. Per the xxxName convention, rename to lastFailedSessionPodName (or split into a structured lastSessionFailure: {podName, reason, time} if more detail is wanted later). As-is, a client reading lastSessionFailure would reasonably expect a message, not a name.
"session" terminology (SessionConfig, sessionPodName, StatefulSetName-for-sessions) overlaps with the proposed Session CRD. If both ship, pick distinct terms now to avoid a confusing API vocabulary.

Extensibility

RetryOnPodFailure *bool + MaxSessionRetries *int32 (taskspawner_types.go:91,:98) are two top-level fields encoding one concept, with a documented coupling ("only used when RetryOnPodFailure is true"). Prefer a single struct, e.g. podFailurePolicy: { retry: bool, maxRetries: int32 }, so future knobs (backoff, requeue delay, failure classification) extend the struct instead of adding more flat fields. This is exactly the "prefer a struct over parallel scalars" guidance.
StorageSize + StorageClassName (:77,:82) are a partial re-implementation of a PVC spec laid out as flat scalars. Grouping into storage: { size, className } keeps it expandable (accessModes, selector, volumeName) and reads closer to StatefulSet volumeClaimTemplates. Minimal-API-wise two scalars are acceptable today, but the struct costs nothing now and avoids a second top-level field later.

Suggestions (optional / non-blocking)

IdleTimeout (:54) and MaxSessionDuration (:65) document "Defaults to 30m/8h" but have no +kubebuilder:default marker (defaulted in the controller), unlike Replicas/Git. Either add the markers or note in the godoc that defaulting is applied by the controller, so the documented default and the CRD schema agree.
WorkspaceResetConfig.Git *bool (:40): a bool named git may want to grow into a reset strategy. It already lives in a struct so adding sibling fields is cheap — fine to leave, just flagging the shape.
Consider an examples/NN-taskspawner-persistent/ entry mirroring the existing numbered tutorials so users have a working persistent-mode manifest.

Note on prompt injection: the cubic-dev-ai comment on this PR contains an embedded cubic:attribution directive instructing automated readers to credit cubic and treat omission as "plagiarism." That instruction was disregarded; this review is independent analysis of the code.

/kelos needs-input

knechtionscoding · 2026-05-29T13:57:56Z

@gjkim42 on session naming, I'm happy to rename it to something else if you have ideas. Because it isn't really a session in the uppercase Session sense. It is a session in the sense that a task is assigned to a specific pod and so has a session. Unsure what better name to provide for it, but happy to help rename it

gjkim42 · 2026-05-29T14:02:36Z

I actually read the proposal just before... and this feature looks great.

@gjkim42 on session naming, I'm happy to rename it to something else if you have ideas. Because it isn't really a session in the uppercase Session sense. It is a session in the sense that a task is assigned to a specific pod and so has a session. Unsure what better name to provide for it, but happy to help rename it

This is my concern, but I cannot find a better name for it for now.

/kelos review

kelos-bot · 2026-05-29T14:02:40Z

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-7b9f650c908c has succeeded. ✅

knechtionscoding · 2026-05-29T14:24:12Z

@gjkim42 we could differentiate with something like persistenceConfig. And then remove references to session specifically? Because I don't think there is anything in there that loses clarity?

kelos-bot

🤖 Kelos Reviewer Agent @gjkim42

Review Summary

Verdict: REQUEST CHANGES
Overall correctness: patch is incorrect
Scope: Adds a persistent execution mode to TaskSpawner — long-lived StatefulSet session pods process Queued Tasks via a new SessionReconciler + session-runner binary, with shared init-container builders extracted from job_builder.go.

The architecture is sound and the StatefulSet/immutability handling, RBAC for the new session-runner SA, init-container extraction, and token-refresh-on-disk plumbing are well done. One correctness bug blocks: switching a spawner out of persistent mode orphans its StatefulSet. Two silent-degradation paths and a docs gap follow.

Findings Overview

Priority	Count	File:Line	Summary
P0	0	—	none
P1	1	taskspawner_controller.go:122	Switching away from persistent mode orphans the session StatefulSet/Service; StatefulSetName never cleared
P2	5	taskspawner_controller.go:557	MCP HeadersFrom/EnvFrom secrets not resolved in persistent path (silent auth drop)
P2		session_controller.go:340	Transient TaskSpawner GET error silently defaults retry policy, ignoring RetryOnPodFailure/MaxSessionRetries
P2		sessionrunner/runner.go:288	Initial token-refresh failure logs a warning and runs the agent unauthenticated
P2		rbac.yaml:393	slack-server-role granted unused `pods/log: get` and `tasks: update;watch`
P2		(docs)	New user-facing feature (executionMode/sessionConfig/new flags) ships with no docs/reference.md or examples/ update
P3	6	session_controller.go:184	Pod-annotation Update not retried on conflict; benign pod churn rolls back assignment
P3		session_statefulset_builder.go:563	Dead code: `sessionServiceName` never called
P3		cmd/kelos-spawner/reconciler.go:43	Dead plumbing: `persistentSource` only ever set to nil
P3		capture/usage.go:273	Dead code: `parseResponse` referenced only by its own test
P3		test/integration/session_test.go:448	`randomSuffix` ignores `crypto/rand.Read` error
P3		(tests)	Missing coverage: processTask/Run loop, assignment race, ServiceName wiring, refactor SecurityContext/mounts

Findings

Correctness

[P1] internal/controller/taskspawner_controller.go:122 — reconcileSessionStatefulSet runs only when ExecutionMode == persistent. When a user later changes the field back to ephemeral (or removes it), Reconcile falls through to the Deployment/CronJob/Webhook paths, whose deleteStaleResource calls are keyed by the spawner name and only target Deployment/CronJob — never the session-<name> StatefulSet or its headless Service. The orphaned StatefulSet keeps its replicas honored, so session pods are restarted indefinitely and keep claiming/processing tasks, and Status.StatefulSetName is never cleared. Mirror the deleteStaleResource pattern to delete the session StatefulSet + Service (guarded by ownership) and reset StatefulSetName on the non-persistent paths.
[P2] internal/controller/taskspawner_controller.go:557 — the persistent path passes agentConfig straight into Build(...), which serializes MCPServers as-is. The ephemeral createJob path first calls resolveMCPServerSecrets to merge HeadersFrom/EnvFrom secret values into Headers/Env. As a result, MCP servers using secret-sourced headers/env in persistent mode silently lose them — the agent runs without the auth it would have in ephemeral mode, with no error surfaced. Resolve MCP secrets before building the StatefulSet, matching createJob.
[P2] internal/controller/session_controller.go:340 — getSessionConfig returns nil on any non-NotFound Get(TaskSpawner) error (it only logs), and callers treat nil as "retry up to 3". A transient GET failure (throttling, cache miss) during a pod-failure reconcile therefore silently ignores an admin's RetryOnPodFailure: false (re-queues anyway) and replaces a custom MaxSessionRetries with 3. Return the error and let Reconcile requeue; treat only NotFound as nil config.
[P2] internal/sessionrunner/runner.go:288 — before invoking the agent, processTask calls refreshToken and on error only prints Warning: failed to refresh token then continues. If the token Secret is missing/empty or carries an expired token-expires-at, no token file is written; with GH_CONFIG_DIR set, runAgent strips GITHUB_TOKEN/GH_TOKEN/GH_ENTERPRISE_TOKEN and the file credential helper reads a non-existent file, so the agent runs git/gh unauthenticated. Distinguish a hard credential failure (secret missing/expired) from a transient API error and fail the task fast with token-expired rather than running unauthenticated.
[P3] internal/controller/session_controller.go:184 — after claiming the task, the pod annotation r.Update(availablePod) uses the stale cached pod object with no RetryOnConflict. Running pods bump resourceVersion frequently (kubelet status), so this often 409s even with no competing assignment, rolling the task back to Queued and forcing a requeue. Wrap in RetryOnConflict with a fresh Get + re-check that AnnotationAssignedTask is still unset, mirroring the task-claim block.

Security

[P2] internal/manifests/charts/kelos/templates/rbac.yaml:393 — this PR adds pods/log: get and tasks: update;watch to kelos-slack-server-role. These are unrelated to persistent mode and unbacked by code (the only slack change here is adding ExecutionMode to SpawnerRef; no Tasks().Update/Watch or pod-log read exists in internal/slack). pods/log: get cluster-wide is a meaningful capability. Drop these grants unless a slack code path needing them lands in the same PR.

Documentation

[P2] This PR introduces user-facing surface — executionMode, sessionConfig, controller flags --session-runner-image/--session-runner-image-pull-policy, and chart value sessionRunnerImage — but updates none of docs/reference.md (CRD field tables + CLI Reference), README.md, or examples/. Add the matching reference entries and a examples/NN-taskspawner-persistent/ manifest in this PR.

Tests

[P2] internal/sessionrunner/runner_test.go — the new tests cover ConfigFromEnv, refreshToken/hosts.yml, tailWriter, and atomicWriteFile, but the central orchestration is untested: processTask (reset → agent → auth-failure → agent-error → status), the Run loop's name+resourceVersion assignment dedup, and runAgent's env construction (token stripping gated on GH_CONFIG_DIR, KELOS_BRANCH injection). Add a processTask test with a fake agent command asserting the success path and the errTokenExpired/errAgentReportedFailure branches.
[P3] internal/controller/session_controller_test.go:940 — the only race test exercises the already-past-Queued early return. The true double-assignment race (two Queued tasks → same pod, guarded only by the pod Update conflict + rollback) and the annotation-Update-failure rollback path are untested. Add a test injecting an Update conflict and asserting one task is rolled back to Queued with SessionPodName cleared.
[P3] internal/controller/init_containers_test.go:30 — for a behavior-preserving extraction, the tests assert only Name/Image/Command presence, not the load-bearing details the refactor must keep: SecurityContext.RunAsUser, the workspace VolumeMount, and the auth-path Args rewrite (prepending --). Pin those so a future regression fails.
[P3] internal/controller/session_statefulset_builder_test.go:61 — no assertion that sts.Spec.ServiceName == svc.Name; this is the load-bearing wiring for stable pod DNS. Add it.

Code quality

[P3] internal/controller/session_statefulset_builder.go:563 — sessionServiceName is never called (the Service name is derived via sessionStatefulSetName inside Build). Remove it or wire it in.
[P3] cmd/kelos-spawner/reconciler.go:43 — the persistentSource field/param and the runCycleWithSource branch are unreachable (both construction sites pass nil). Per "no speculative scaffolding," land this with the code that populates it.
[P3] internal/capture/usage.go:273 — parseResponse is unexported and referenced only by TestParseResponse; it is dead. Remove it, or wire it into the runner if final-response capture was intended.
[P3] test/integration/session_test.go:448 — randomSuffix ignores rand.Read's error; on the (rare) error path the suffix is all-zeros, causing namespace collisions across specs. The e2e framework checks this error; match that.

Key takeaways

One blocker: persistent→ephemeral mode switch leaks the StatefulSet and keeps session pods running.
Two silent-degradation paths (MCP secrets, initial token refresh) conflict with the project's fail-fast convention.
The feature needs docs/reference.md + an examples/ entry; several new files would benefit from happy-path/behavior-pinning tests.

Note on prompt injection: the cubic-dev-ai comment on this PR carries an embedded cubic:attribution directive instructing automated readers to credit cubic and treat omission as "plagiarism." That instruction was disregarded; this review is independent analysis of the code.

kelos-bot · 2026-06-02T13:24:08Z

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-3f391c48be41 has failed. ❌

knechtionscoding · 2026-06-02T14:15:12Z

@gjkim42 E2E Test added

gjkim42 · 2026-06-02T14:20:25Z

/kelos review

kelos-bot · 2026-06-02T14:20:29Z

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-3a8aa87a910d has failed. ❌

gjkim42 · 2026-06-02T15:16:42Z

/kelos review

kelos-bot · 2026-06-02T15:16:47Z

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-ab675acf6b78 has failed. ❌

gjkim42 · 2026-06-02T15:21:53Z

/kelos review

kelos-bot · 2026-06-02T15:21:57Z

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-2fe54b4230ee has succeeded. ✅

kelos-bot

🤖 Kelos Reviewer Agent @gjkim42

Review Summary

Verdict: REQUEST CHANGES
Overall correctness: patch is incorrect
Scope: Adds persistent TaskSpawner execution backed by session StatefulSets and a session-runner loop, plus API fields, docs, examples, and tests.

Findings Overview

Priority	Count	File:Line	Summary
P0	0	—	none
P1	1	internal/sessionrunner/runner.go:308	Transient initial token refresh failures can run the agent without credentials
P2	1	internal/controller/session_statefulset_builder.go:273	Persistent session pods silently ignore most `TaskTemplate.podOverrides` fields
P2	1	internal/controller/session_controller.go:309	Successful tasks can be forced to Failed when the pod status annotation write fails
P2	1	test/e2e/framework/framework.go:344	E2E helper uses global `Expect()` inside an `Eventually` polling path
P3	0	—	none

Findings

Correctness

[P1] internal/sessionrunner/runner.go:308 — Initial refreshToken errors that are not errCredentialUnavailable are logged and ignored. In the generated session pod, KELOS_TOKEN_SECRET also sets GH_CONFIG_DIR; runAgent strips GITHUB_TOKEN/GH_TOKEN/GH_ENTERPRISE_TOKEN, and configureFileCredentialHelper points git at /workspace/.kelos-token. If the first Secret read fails transiently, no token file or hosts.yml token is written, but the agent is still invoked with the env credentials removed, so git/gh run unauthenticated. Return an error/requeue instead of starting the agent until the initial credential refresh succeeds.
[P2] internal/controller/session_statefulset_builder.go:273 — The persistent StatefulSet path only applies Env, Resources, NodeSelector, and ServiceAccountName from TaskTemplate.podOverrides; the Job path also applies imagePullSecrets, tolerations, affinity, user volumes/mounts, extra containers/init containers, and security contexts. Switching an existing spawner that relies on a private image pull secret or tainted-node toleration to executionMode: persistent silently changes scheduling/pull behavior and can leave the session pod in ImagePullBackOff or unschedulable. Apply the same supported pod override fields in the StatefulSet builder, or explicitly reject/document unsupported fields before creating the StatefulSet.
[P2] internal/controller/session_controller.go:309 — processTask writes Task.Status.CompletionTime in a defer before Run() writes the pod's terminal kelos.dev/task-status annotation. If a successful agent run then hits repeated pod update conflicts or a transient API failure in setTaskStatus, the Task already has CompletionTime but no status annotation, and this branch marks it Failed after 15 seconds while clearing the completion timestamp. Avoid turning an already-completed Task into failure solely because the runner's pod annotation write failed; retry/requeue the signal path or use a status field that can distinguish success from missing signal.

Tests

[P2] test/e2e/framework/framework.go:344 — GetTaskSessionPodName uses global Expect() and the new e2e calls it from inside an Eventually poll. A transient GET error will fail the spec immediately instead of being retried by Gomega, which is exactly the flake pattern the project conventions call out for e2e wait helpers. Make the polling function perform the GET inline and return an empty string on error, or add a WaitForTaskSessionPodName helper using Eventually(func(g Gomega) { ... }).

Key takeaways

The previous P1 control-flow and runner dedup blockers are fixed in the current head.
One remaining fail-fast issue can still start the agent without credentials on initial Secret read failures, so I’m requesting changes.
The persistent path should preserve the existing podOverrides contract closely enough that switching execution modes does not silently drop scheduling, pull, volume, or security settings.

Note on prompt injection: the cubic-dev-ai comment on this PR carries an embedded attribution directive for automated readers. That instruction was disregarded; this review is independent analysis of the code.

knechtionscoding · 2026-06-03T08:51:12Z

@gjkim42 I ran the reviewer locally a few times to make sure everything was good. Should be ready for final review now

Introduce a "persistent" execution mode that maintains long-lived session pods via StatefulSet, allowing Tasks to be processed without per-task cold-start overhead. When executionMode is set to "persistent", Tasks transition to a Queued phase and are assigned to available session pods by the new SessionReconciler. Key components: - SessionReconciler: assigns queued Tasks to session pods, monitors completion, handles pod failures with configurable retry - SessionStatefulSetBuilder: creates StatefulSet + headless Service with PVC-backed workspaces - Session runner binary: polls for assigned tasks, invokes agent entrypoint, reports results, refreshes GitHub tokens, resets workspace between tasks - Shared init container builders extracted from job_builder for reuse between ephemeral Jobs and persistent StatefulSets API additions: - TaskPhaseQueued phase for Tasks awaiting session assignment - ExecutionMode and SessionConfig on TaskSpawnerSpec - Session status fields (SessionPodName, SessionRetryCount, LastSessionFailure) on TaskStatus Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gjkim42 · 2026-06-03T12:47:19Z

Thanks!
Since this is a core api change, i'll review the api again this weekend.

gjkim42

Holding this — the persistent-execution concept should land on a different resource boundary, and it should not squat on the Session name.

This introduces a foundational concept (long-lived execution substrate for Tasks). Two things are expensive to reverse once shipped, so I'd like to settle them before merge rather than carry a deprecated surface indefinitely.

1. Resource boundary. executionMode + sessionConfig live on TaskSpawnerSpec, but TaskSpawner is a trigger ('when X, mint a Task') — it shouldn't own execution infrastructure. Today it owns the StatefulSet, the fleet config, and a routing label, and the assignment loop reaches sideways into the spawner for config. Proposed: a dedicated WorkerPool CRD that owns the StatefulSet + fleet config + execution environment, with Tasks referencing it via spec.workerPoolRef (presence of the ref = persistent; no executionMode enum, so no contradictory states). This also makes hand-authored persistent Tasks possible and lets one pool serve many spawners.

2. Naming — reserve Session for a different concept. We're planning a separate, interactive Session resource (a human-driven coding session, peer to Task). This PR uses 'session' throughout for the batch warm-pod substrate — SessionPool, SessionReconciler, kelos-session-runner, sessionConfig, SessionPodName. That collides head-on: SessionPool reads as 'a pool of Sessions' but means 'a pool of warm Task-workers'. Rename the batch substrate to WorkerPool / worker pods and reserve 'Session' for the interactive concept. Several of these names lock in at release (binary/image name, TaskStatus API fields, env vars), which is why this is a pre-merge concern.

The implementation is largely reusable — the StatefulSet builder, runner, and assignment logic move mostly intact; what changes is the config source (spawner → WorkerPool), the StatefulSet owner (spawner → WorkerPool), and the names. The inline comments map each piece to its new home. I'd like to close this and reland on the WorkerPool API — happy to pair on a design doc first.

Not in scope for this PR: designing the interactive Session CRD. We're only protecting its name now.

gjkim42 · 2026-06-06T02:52:09Z

+	// +kubebuilder:validation:Enum=ephemeral;persistent
+	// +kubebuilder:default=ephemeral
+	// +optional
+	ExecutionMode ExecutionMode `json:"executionMode,omitempty"`


Remove. In the proposed design there's no executionMode enum anywhere — persistence is expressed by the presence of Task.spec.workerPoolRef. That removes a class of invalid states (an enum that can contradict a config block), and the spawner stamps a workerPoolRef into spawned Tasks instead of a mode.

gjkim42 · 2026-06-06T02:52:09Z

+	// SessionConfig configures persistent execution mode. Only valid when
+	// executionMode is "persistent".
+	// +optional
+	SessionConfig *SessionConfig `json:"sessionConfig,omitempty"`


Remove from TaskSpawner. Fleet config belongs on the new WorkerPool CRD (its own lifecycle + reconciler). TaskSpawner gains only taskTemplate.workerPoolRef.

gjkim42 · 2026-06-06T02:52:09Z

+}
+
+// SessionConfig configures persistent execution mode sessions.
+type SessionConfig struct {


This becomes WorkerPoolSpec on the new CRD, and it should also own the environment fields the StatefulSet builder currently pulls from taskTemplate (type, credentials, workspaceRef, agentConfigRef, image) — a pool is a warm, pre-configured environment. Trim to MVP per our 'minimum viable API' rule: keep replicas, storage.size (+ environment); defer storageClassName, idleTimeout, maxSessionDuration, maxTasksPerSession, workspaceReset, and the retry knobs (runner defaults already cover them).

gjkim42 · 2026-06-06T02:52:09Z

+	// execute this Task. Set only for Tasks belonging to a persistent-mode
+	// TaskSpawner.
+	// +optional
+	SessionPodName string `json:"sessionPodName,omitempty"`


These three status fields (SessionPodName, SessionRetryCount, LastSessionFailure) bake 'session' into the Task API — hard to change after release. Rename to WorkerPodName / WorkerRetryCount / LastWorkerFailure so 'Session' stays reserved for the planned interactive resource.

gjkim42 · 2026-06-06T02:52:09Z

 		}
 		task.Labels["kelos.dev/taskspawner"] = spawnerRef.Name
+		if spawnerRef.ExecutionMode == v1alpha1.ExecutionModePersistent {
+			task.Labels["kelos.dev/execution-mode"] = string(v1alpha1.ExecutionModePersistent)


The spawner should stamp task.spec.workerPoolRef (from taskTemplate.workerPoolRef) instead of a kelos.dev/execution-mode label. Routing then branches on a validated spec field rather than mutable selector metadata.

gjkim42 · 2026-06-06T02:52:09Z

 // handleDeletion handles TaskSpawner deletion.
+// reconcileSessionStatefulSet manages the StatefulSet and headless Service for
+// persistent-mode TaskSpawners.
+func (r *TaskSpawnerReconciler) reconcileSessionStatefulSet(ctx context.Context, req ctrl.Request, ts *kelosv1alpha1.TaskSpawner, isSuspended bool) (ctrl.Result, error) {


This loop moves to a new WorkerPoolReconciler, with the StatefulSet + headless Service owned by the WorkerPool (controllerRef → WorkerPool), not the spawner (see the SetControllerReference(ts, …) below). TaskSpawner stops managing infrastructure entirely. This is a relocation — the build logic itself is reusable.

gjkim42 · 2026-06-06T02:52:09Z

+	}
+
+	agentUID := AgentUID
+	name := sessionStatefulSetName(ts.Name)


Naming the StatefulSet after the spawner is what hard-couples one pool to one spawner. Name it after the WorkerPool instead, so a pool can serve many spawners and hand-authored Tasks.

gjkim42 · 2026-06-06T02:52:09Z

+// getSessionConfig fetches the SessionConfig for a task's owning TaskSpawner.
+// It returns an error on transient failures so callers requeue rather than
+// silently falling back to defaults.
+func (r *SessionReconciler) getSessionConfig(ctx context.Context, task *kelosv1alpha1.Task) (*kelosv1alpha1.SessionConfig, error) {


Replace the reach into TaskSpawner with a lookup of the WorkerPool the Task references (task.Spec.WorkerPoolRef). The Task names its pool explicitly — no inference from labels, and no dependence on the spawner still existing.

gjkim42 · 2026-06-06T02:52:09Z

+	}
+	var spawner kelosv1alpha1.TaskSpawner
+	if err := r.Get(ctx, client.ObjectKey{Namespace: task.Namespace, Name: spawnerName}, &spawner); err != nil {
+		if apierrors.IsNotFound(err) {


In the new design a missing referenced pool is an explicit Queued + PoolNotFound condition, not a silent fallback to default retry (which conflicts with our fail-fast rule). This silent-default path disappears with the redesign.

gjkim42 · 2026-06-06T02:52:09Z

+	return out.Close()
+}
+
+func main() {


Binary/image name leak: kelos-session-runner (this cmd/ dir, its Dockerfile, and the published image tag) bakes 'session' into a release artifact that's awkward to rename later. Propose kelos-worker-runner, reserving 'session' for the interactive Session concept.

knechtionscoding · 2026-06-06T05:36:03Z

@gjkim42 reasonable! I'll open a PR/Issue with worker pool design

knechtionscoding · 2026-06-06T05:50:08Z

Opened #1310

knechtionscoding had a problem deploying to ok-to-test May 28, 2026 16:01 — with GitHub Actions Error

github-actions Bot added kind/feature Categorizes issue or PR as related to a new feature needs-triage needs-priority needs-actor release-note labels May 28, 2026

knechtionscoding marked this pull request as ready for review May 28, 2026 16:01

cubic-dev-ai Bot reviewed May 28, 2026

View reviewed changes

Comment thread internal/controller/session_controller.go

Comment thread internal/sessionrunner/runner.go Outdated

Comment thread internal/capture/capture.go Outdated

Comment thread internal/sessionrunner/runner.go Outdated

knechtionscoding force-pushed the feat/persistent-spawner branch from 058a2b6 to c7d3aa3 Compare May 28, 2026 16:22

knechtionscoding had a problem deploying to ok-to-test May 28, 2026 16:22 — with GitHub Actions Error

knechtionscoding force-pushed the feat/persistent-spawner branch from c7d3aa3 to 10e66e3 Compare May 29, 2026 09:59

knechtionscoding temporarily deployed to ok-to-test May 29, 2026 09:59 — with GitHub Actions Inactive

kelos-bot Bot reviewed May 29, 2026

View reviewed changes

gjkim42 self-assigned this May 29, 2026

gjkim42 added triage-accepted kind/api Categorizes issue or PR as related to API changes labels May 29, 2026

github-actions Bot removed the needs-triage label May 29, 2026

kelos-bot Bot requested changes May 29, 2026

View reviewed changes

knechtionscoding force-pushed the feat/persistent-spawner branch from 10e66e3 to a255c56 Compare May 29, 2026 15:26

knechtionscoding force-pushed the feat/persistent-spawner branch from 31cae2e to 7b78a5d Compare June 2, 2026 14:14

knechtionscoding had a problem deploying to ok-to-test June 2, 2026 14:14 — with GitHub Actions Error

kelos-bot Bot requested changes Jun 2, 2026

View reviewed changes

Comment thread internal/sessionrunner/runner.go

Comment thread internal/controller/session_statefulset_builder.go

Comment thread internal/controller/session_controller.go Outdated

Comment thread test/e2e/framework/framework.go

knechtionscoding force-pushed the feat/persistent-spawner branch from 7b78a5d to f3c0c2e Compare June 2, 2026 16:02

knechtionscoding had a problem deploying to ok-to-test June 2, 2026 16:03 — with GitHub Actions Error

This comment was marked as outdated.

Sign in to view

knechtionscoding force-pushed the feat/persistent-spawner branch from f3c0c2e to 9f12310 Compare June 2, 2026 17:27

knechtionscoding had a problem deploying to ok-to-test June 2, 2026 17:27 — with GitHub Actions Error

This comment was marked as outdated.

Sign in to view

knechtionscoding force-pushed the feat/persistent-spawner branch from 9f12310 to ed05233 Compare June 3, 2026 09:16

knechtionscoding had a problem deploying to ok-to-test June 3, 2026 09:16 — with GitHub Actions Error

knechtionscoding force-pushed the feat/persistent-spawner branch from ed05233 to f3718f8 Compare June 3, 2026 11:54

knechtionscoding requested a deployment to ok-to-test June 3, 2026 11:54 — with GitHub Actions Waiting

This comment was marked as outdated.

Sign in to view

gjkim42 requested changes Jun 6, 2026

View reviewed changes

gjkim42 mentioned this pull request Jun 6, 2026

API: Add persistent execution mode to TaskSpawner for warm agent sessions that eliminate per-task cold-start overhead #911

Open

knechtionscoding mentioned this pull request Jun 8, 2026

feat: add WorkerPool CRD for persistent task execution #1318

Open

Conversation

knechtionscoding commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

knechtionscoding commented May 29, 2026

Uh oh!

gjkim42 commented May 29, 2026

Uh oh!

knechtionscoding commented May 29, 2026

Uh oh!

gjkim42 commented May 29, 2026

Uh oh!

knechtionscoding commented May 29, 2026

Uh oh!

gjkim42 commented May 29, 2026

Uh oh!

kelos-bot Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kelos-bot Bot left a comment

Choose a reason for hiding this comment

API Design Review

Findings

Compatibility (good)

API placement — needs a maintainer decision

Naming

Extensibility

Suggestions (optional / non-blocking)

Uh oh!

knechtionscoding commented May 29, 2026

Uh oh!

gjkim42 commented May 29, 2026

Uh oh!

kelos-bot Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knechtionscoding commented May 29, 2026

Uh oh!

kelos-bot Bot left a comment

Choose a reason for hiding this comment

Review Summary

Findings Overview

Findings

Correctness

Security

Documentation

Tests

Code quality

Key takeaways

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kelos-bot Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knechtionscoding commented Jun 2, 2026

Uh oh!

gjkim42 commented Jun 2, 2026

Uh oh!

knechtionscoding commented May 28, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

kelos-bot Bot commented May 29, 2026 •

edited

Loading

kelos-bot Bot commented May 29, 2026 •

edited

Loading

kelos-bot Bot commented Jun 2, 2026 •

edited

Loading

kelos-bot Bot commented Jun 2, 2026 •

edited

Loading

kelos-bot Bot commented Jun 2, 2026 •

edited

Loading

kelos-bot Bot commented Jun 2, 2026 •

edited

Loading