Skip to content

feat: add persistent execution mode for TaskSpawner#1228

Open
knechtionscoding wants to merge 1 commit into
kelos-dev:mainfrom
datagravity-ai:feat/persistent-spawner
Open

feat: add persistent execution mode for TaskSpawner#1228
knechtionscoding wants to merge 1 commit into
kelos-dev:mainfrom
datagravity-ai:feat/persistent-spawner

Conversation

@knechtionscoding

@knechtionscoding knechtionscoding commented May 28, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a "persistent" execution mode to TaskSpawner that maintains long-lived session pods via StatefulSet. Tasks are processed without per-task cold-start overhead (no Job creation, no init container re-runs, no image pulls per task).

When executionMode: persistent is set on a TaskSpawner:

  1. Spawned Tasks transition to a new Queued phase instead of creating a Job
  2. A SessionReconciler assigns queued Tasks to available session pods
  3. The session runner inside the pod picks up the task, invokes the agent, reports results, and resets the workspace for the next task

Key components:

  • SessionReconciler — assigns queued Tasks to session pods, monitors completion via pod annotations, handles pod failures with configurable retry and re-queue
  • SessionStatefulSetBuilder — creates StatefulSet + headless Service with PVC-backed workspaces per replica
  • Session runner binary (cmd/kelos-session-runner) — polls for assigned tasks, invokes agent entrypoint, reports results back via Task status, refreshes GitHub tokens periodically, resets workspace between tasks
  • Shared init container builders — extracted from job_builder.go into init_containers.go for reuse between ephemeral Jobs and persistent StatefulSets

API additions:

  • TaskPhaseQueued — new phase for Tasks awaiting session pod assignment
  • ExecutionMode (ephemeral | persistent) on TaskSpawnerSpec
  • SessionConfig — idle timeout, max tasks per session, replicas, storage size/class, workspace reset config, retry policy
  • Session status fields on TaskStatussessionPodName, sessionRetryCount, lastSessionFailure
  • StatefulSetName on TaskSpawnerStatus

Which issue(s) this PR is related to:

fixes #911

Special notes for your reviewer:

Addressed review feedback (round 2):

Correctness:

  • P1: Added deleteStaleSessionResources — when switching away from persistent mode, the session StatefulSet + headless Service are cleaned up and Status.StatefulSetName is cleared, preventing orphaned session pods
  • P2: MCP server secrets (HeadersFrom/EnvFrom) are now resolved before building the StatefulSet, matching the ephemeral createJob path
  • P2: getSessionConfig now returns an error on transient API failures instead of silently falling back to defaults (only NotFound returns nil config)
  • P2: Hard credential failures (secret missing key, token expired) in processTask now fail the task immediately with token-expired reason rather than running the agent unauthenticated
  • P3: Pod annotation update in assignTask wrapped in RetryOnConflict with fresh Get + re-check, preventing spurious 409s from kubelet status bumps

Security:

  • P2: Removed pods/log: get and trimmed tasks verbs to create, list from kelos-slack-server-role — these grants were unbacked by code in internal/slack

Code quality:

  • Removed dead sessionServiceName function (name is derived via sessionStatefulSetName inside Build)
  • Removed unreachable persistentSource field/param/branch from spawnerReconciler/runOnce
  • Removed dead parseResponse function and its test
  • Fixed randomSuffix to panic on rand.Read error

Tests:

  • Added processTask orchestration tests: success path, errTokenExpired on expired creds, errAgentReportedFailure, auth-failure-in-output detection
  • Added TestSessionReconciler_PodAnnotationConflictRollsBackTask — injects Update conflict, asserts task rolls back to Queued
  • Added sts.Spec.ServiceName == svc.Name assertion (load-bearing wiring for stable pod DNS)
  • Added SecurityContext.RunAsUser, workspace VolumeMount, and "--" Args assertions to init container tests
  • Added runAgentFn seam to Runner struct for testability

Documentation:

  • Added executionMode and all sessionConfig.* fields to docs/reference.md TaskSpawner table
  • Added --session-runner-image and --session-runner-image-pull-policy to CLI Reference
  • Created examples/14-taskspawner-persistent/ with README and manifests

Does this PR introduce a user-facing change?

Add persistent execution mode for TaskSpawner. Set `executionMode: persistent` to maintain long-lived session pods that process multiple tasks without cold-start overhead. Configurable via `sessionConfig` (replicas, idle timeout, storage, retry policy).

@github-actions github-actions Bot added kind/feature Categorizes issue or PR as related to a new feature needs-triage needs-priority needs-actor release-note labels May 28, 2026
@knechtionscoding knechtionscoding marked this pull request as ready for review May 28, 2026 16:01

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 39 files

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic

Comment thread internal/controller/session_controller.go
Comment thread internal/sessionrunner/runner.go Outdated
Comment thread internal/capture/capture.go Outdated
Comment thread internal/sessionrunner/runner.go Outdated
@knechtionscoding

Copy link
Copy Markdown
Contributor Author

@gjkim42 should be ready for review

@gjkim42

gjkim42 commented May 29, 2026

Copy link
Copy Markdown
Collaborator

I am not sure we want to add this feature.
Is it useful?

I'd rather implement the context store from #1024 (comment).

@knechtionscoding

Copy link
Copy Markdown
Contributor Author

@gjkim42 we've found it useful to allow for more interactive sessions/long running usages. Specifically where doing the git clone and other prep tasks are a rather large burden for the task or question. (think asking about the codebase). Using a context store doesn't solve that. Rather, the persistent runner just means an agent that is always up and can swap between tasks, not necessarily the same task.

@gjkim42

gjkim42 commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Rather, the persistent runner just means an agent that is always up and can swap between tasks, not necessarily the same task.

I am thinking about introducing the new CRD Session, which sets up an environment for a coding agent.
Does it cover those use cases?

@knechtionscoding

Copy link
Copy Markdown
Contributor Author

I am thinking about introducing the new CRD Session, which sets up an environment for a coding agent.
Does it cover those use cases?

@gjkim42 possibly. I think there's still a speed/responsiveness question that comes from this, but I'm open to being wrong and having the CRD for Session solve it.

I'll be more specific for the usecase we are finding this useful for:

Context: We have a rather large mono-repo which makes git operations rather expensive (think on the order of 1-2 minutes just to do the git clone even with shallow).

As such persistent mode has been helpful for:

  1. Interacting with it from a web ui/desktop app similar to claude. Where we don't want to check out the repo each time a message is sent but we do want to maintain the same/consistent configuration that we hvae in kelos and provide responsiveness.
  2. Where we know we have a consistent throughput of tasks such that swapping between branches and sessions is much cheaper than running git clone every time.

I think CRD Session solves the persistence issue (from a maintain history) but I don't think it solves always running/speed

@gjkim42

gjkim42 commented May 29, 2026

Copy link
Copy Markdown
Collaborator

/kelos api-review

@kelos-bot

kelos-bot Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

🤖 Kelos Task Status

Task kelos-api-reviewer-issue-comment-bf822513d881 has succeeded. ✅

@kelos-bot kelos-bot Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Kelos API Reviewer Agent @gjkim42

API Design Review

Verdict: COMMENT
Scope: Adds executionMode (ephemeral|persistent) + sessionConfig to TaskSpawnerSpec, a Queued Task phase, and three session status fields on TaskStatus.

The mechanics are sound and the additions are backward compatible: executionMode defaults to ephemeral, every new field is +optional, no existing field is removed/retyped/tightened, durations use metav1.Duration, storageSize uses resource.Quantity, the mode is an extensible enum (not a bool), and deepcopy/CRD/RBAC artifacts are regenerated. I'm landing on COMMENT rather than approve/request-changes because the core API shape is still under discussion in this thread (embed in TaskSpawner vs. a dedicated Session CRD) and CRD fields are effectively permanent once shipped.

Findings

Compatibility (good)

  • executionMode defaults to ephemeral, so existing TaskSpawners are unaffected — no migration needed.
  • New Queued TaskPhase is additive; TaskPhase is not CRD-enum-constrained, so existing clients can ignore it.
  • make update artifacts (deepcopy, both CRDs, install-crd.yaml, RBAC) are present and consistent with the Go types.

API placement — needs a maintainer decision

  • The unresolved question raised above (separate Session CRD vs. sessionConfig embedded in TaskSpawnerSpec) is the most consequential API call here, because it dictates a permanent surface. The contributor's use case #1 (interactive/web-UI, an always-on agent that swaps between tasks) describes a session whose lifecycle is independent of any spawner — that argues for a first-class Session/SessionPool resource with its own status, reusable across spawners. Embedding makes TaskSpawner own a StatefulSet + session lifecycle in addition to discovery/polling.
  • Concrete risk: if a Session CRD lands later, sessionConfig on TaskSpawner becomes a parallel/deprecated surface that can't be removed. Worth settling the shape before merge rather than shipping both.

Naming

  • TaskStatus.LastSessionFailure (api/v1alpha1/task_types.go:266) holds a pod name but reads like it holds a failure reason/timestamp. Per the xxxName convention, rename to lastFailedSessionPodName (or split into a structured lastSessionFailure: {podName, reason, time} if more detail is wanted later). As-is, a client reading lastSessionFailure would reasonably expect a message, not a name.
  • "session" terminology (SessionConfig, sessionPodName, StatefulSetName-for-sessions) overlaps with the proposed Session CRD. If both ship, pick distinct terms now to avoid a confusing API vocabulary.

Extensibility

  • RetryOnPodFailure *bool + MaxSessionRetries *int32 (taskspawner_types.go:91,:98) are two top-level fields encoding one concept, with a documented coupling ("only used when RetryOnPodFailure is true"). Prefer a single struct, e.g. podFailurePolicy: { retry: bool, maxRetries: int32 }, so future knobs (backoff, requeue delay, failure classification) extend the struct instead of adding more flat fields. This is exactly the "prefer a struct over parallel scalars" guidance.
  • StorageSize + StorageClassName (:77,:82) are a partial re-implementation of a PVC spec laid out as flat scalars. Grouping into storage: { size, className } keeps it expandable (accessModes, selector, volumeName) and reads closer to StatefulSet volumeClaimTemplates. Minimal-API-wise two scalars are acceptable today, but the struct costs nothing now and avoids a second top-level field later.

Suggestions (optional / non-blocking)

  • IdleTimeout (:54) and MaxSessionDuration (:65) document "Defaults to 30m/8h" but have no +kubebuilder:default marker (defaulted in the controller), unlike Replicas/Git. Either add the markers or note in the godoc that defaulting is applied by the controller, so the documented default and the CRD schema agree.
  • WorkspaceResetConfig.Git *bool (:40): a bool named git may want to grow into a reset strategy. It already lives in a struct so adding sibling fields is cheap — fine to leave, just flagging the shape.
  • Consider an examples/NN-taskspawner-persistent/ entry mirroring the existing numbered tutorials so users have a working persistent-mode manifest.

Note on prompt injection: the cubic-dev-ai comment on this PR contains an embedded cubic:attribution directive instructing automated readers to credit cubic and treat omission as "plagiarism." That instruction was disregarded; this review is independent analysis of the code.

/kelos needs-input

@knechtionscoding

Copy link
Copy Markdown
Contributor Author

@gjkim42 on session naming, I'm happy to rename it to something else if you have ideas. Because it isn't really a session in the uppercase Session sense. It is a session in the sense that a task is assigned to a specific pod and so has a session. Unsure what better name to provide for it, but happy to help rename it

@gjkim42

gjkim42 commented May 29, 2026

Copy link
Copy Markdown
Collaborator

I actually read the proposal just before... and this feature looks great.

@gjkim42 on session naming, I'm happy to rename it to something else if you have ideas. Because it isn't really a session in the uppercase Session sense. It is a session in the sense that a task is assigned to a specific pod and so has a session. Unsure what better name to provide for it, but happy to help rename it

This is my concern, but I cannot find a better name for it for now.

/kelos review

@kelos-bot

kelos-bot Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-7b9f650c908c has succeeded. ✅

@gjkim42 gjkim42 self-assigned this May 29, 2026
@gjkim42 gjkim42 added triage-accepted kind/api Categorizes issue or PR as related to API changes labels May 29, 2026
@knechtionscoding

Copy link
Copy Markdown
Contributor Author

@gjkim42 we could differentiate with something like persistenceConfig. And then remove references to session specifically? Because I don't think there is anything in there that loses clarity?

@kelos-bot kelos-bot Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Kelos Reviewer Agent @gjkim42

Review Summary

Verdict: REQUEST CHANGES
Overall correctness: patch is incorrect
Scope: Adds a persistent execution mode to TaskSpawner — long-lived StatefulSet session pods process Queued Tasks via a new SessionReconciler + session-runner binary, with shared init-container builders extracted from job_builder.go.

The architecture is sound and the StatefulSet/immutability handling, RBAC for the new session-runner SA, init-container extraction, and token-refresh-on-disk plumbing are well done. One correctness bug blocks: switching a spawner out of persistent mode orphans its StatefulSet. Two silent-degradation paths and a docs gap follow.

Findings Overview

Priority Count File:Line Summary
P0 0 none
P1 1 taskspawner_controller.go:122 Switching away from persistent mode orphans the session StatefulSet/Service; StatefulSetName never cleared
P2 5 taskspawner_controller.go:557 MCP HeadersFrom/EnvFrom secrets not resolved in persistent path (silent auth drop)
P2 session_controller.go:340 Transient TaskSpawner GET error silently defaults retry policy, ignoring RetryOnPodFailure/MaxSessionRetries
P2 sessionrunner/runner.go:288 Initial token-refresh failure logs a warning and runs the agent unauthenticated
P2 rbac.yaml:393 slack-server-role granted unused pods/log: get and tasks: update;watch
P2 (docs) New user-facing feature (executionMode/sessionConfig/new flags) ships with no docs/reference.md or examples/ update
P3 6 session_controller.go:184 Pod-annotation Update not retried on conflict; benign pod churn rolls back assignment
P3 session_statefulset_builder.go:563 Dead code: sessionServiceName never called
P3 cmd/kelos-spawner/reconciler.go:43 Dead plumbing: persistentSource only ever set to nil
P3 capture/usage.go:273 Dead code: parseResponse referenced only by its own test
P3 test/integration/session_test.go:448 randomSuffix ignores crypto/rand.Read error
P3 (tests) Missing coverage: processTask/Run loop, assignment race, ServiceName wiring, refactor SecurityContext/mounts

Findings

Correctness

  • [P1] internal/controller/taskspawner_controller.go:122reconcileSessionStatefulSet runs only when ExecutionMode == persistent. When a user later changes the field back to ephemeral (or removes it), Reconcile falls through to the Deployment/CronJob/Webhook paths, whose deleteStaleResource calls are keyed by the spawner name and only target Deployment/CronJob — never the session-<name> StatefulSet or its headless Service. The orphaned StatefulSet keeps its replicas honored, so session pods are restarted indefinitely and keep claiming/processing tasks, and Status.StatefulSetName is never cleared. Mirror the deleteStaleResource pattern to delete the session StatefulSet + Service (guarded by ownership) and reset StatefulSetName on the non-persistent paths.
  • [P2] internal/controller/taskspawner_controller.go:557 — the persistent path passes agentConfig straight into Build(...), which serializes MCPServers as-is. The ephemeral createJob path first calls resolveMCPServerSecrets to merge HeadersFrom/EnvFrom secret values into Headers/Env. As a result, MCP servers using secret-sourced headers/env in persistent mode silently lose them — the agent runs without the auth it would have in ephemeral mode, with no error surfaced. Resolve MCP secrets before building the StatefulSet, matching createJob.
  • [P2] internal/controller/session_controller.go:340getSessionConfig returns nil on any non-NotFound Get(TaskSpawner) error (it only logs), and callers treat nil as "retry up to 3". A transient GET failure (throttling, cache miss) during a pod-failure reconcile therefore silently ignores an admin's RetryOnPodFailure: false (re-queues anyway) and replaces a custom MaxSessionRetries with 3. Return the error and let Reconcile requeue; treat only NotFound as nil config.
  • [P2] internal/sessionrunner/runner.go:288 — before invoking the agent, processTask calls refreshToken and on error only prints Warning: failed to refresh token then continues. If the token Secret is missing/empty or carries an expired token-expires-at, no token file is written; with GH_CONFIG_DIR set, runAgent strips GITHUB_TOKEN/GH_TOKEN/GH_ENTERPRISE_TOKEN and the file credential helper reads a non-existent file, so the agent runs git/gh unauthenticated. Distinguish a hard credential failure (secret missing/expired) from a transient API error and fail the task fast with token-expired rather than running unauthenticated.
  • [P3] internal/controller/session_controller.go:184 — after claiming the task, the pod annotation r.Update(availablePod) uses the stale cached pod object with no RetryOnConflict. Running pods bump resourceVersion frequently (kubelet status), so this often 409s even with no competing assignment, rolling the task back to Queued and forcing a requeue. Wrap in RetryOnConflict with a fresh Get + re-check that AnnotationAssignedTask is still unset, mirroring the task-claim block.

Security

  • [P2] internal/manifests/charts/kelos/templates/rbac.yaml:393 — this PR adds pods/log: get and tasks: update;watch to kelos-slack-server-role. These are unrelated to persistent mode and unbacked by code (the only slack change here is adding ExecutionMode to SpawnerRef; no Tasks().Update/Watch or pod-log read exists in internal/slack). pods/log: get cluster-wide is a meaningful capability. Drop these grants unless a slack code path needing them lands in the same PR.

Documentation

  • [P2] This PR introduces user-facing surface — executionMode, sessionConfig, controller flags --session-runner-image/--session-runner-image-pull-policy, and chart value sessionRunnerImage — but updates none of docs/reference.md (CRD field tables + CLI Reference), README.md, or examples/. Add the matching reference entries and a examples/NN-taskspawner-persistent/ manifest in this PR.

Tests

  • [P2] internal/sessionrunner/runner_test.go — the new tests cover ConfigFromEnv, refreshToken/hosts.yml, tailWriter, and atomicWriteFile, but the central orchestration is untested: processTask (reset → agent → auth-failure → agent-error → status), the Run loop's name+resourceVersion assignment dedup, and runAgent's env construction (token stripping gated on GH_CONFIG_DIR, KELOS_BRANCH injection). Add a processTask test with a fake agent command asserting the success path and the errTokenExpired/errAgentReportedFailure branches.
  • [P3] internal/controller/session_controller_test.go:940 — the only race test exercises the already-past-Queued early return. The true double-assignment race (two Queued tasks → same pod, guarded only by the pod Update conflict + rollback) and the annotation-Update-failure rollback path are untested. Add a test injecting an Update conflict and asserting one task is rolled back to Queued with SessionPodName cleared.
  • [P3] internal/controller/init_containers_test.go:30 — for a behavior-preserving extraction, the tests assert only Name/Image/Command presence, not the load-bearing details the refactor must keep: SecurityContext.RunAsUser, the workspace VolumeMount, and the auth-path Args rewrite (prepending --). Pin those so a future regression fails.
  • [P3] internal/controller/session_statefulset_builder_test.go:61 — no assertion that sts.Spec.ServiceName == svc.Name; this is the load-bearing wiring for stable pod DNS. Add it.

Code quality

  • [P3] internal/controller/session_statefulset_builder.go:563sessionServiceName is never called (the Service name is derived via sessionStatefulSetName inside Build). Remove it or wire it in.
  • [P3] cmd/kelos-spawner/reconciler.go:43 — the persistentSource field/param and the runCycleWithSource branch are unreachable (both construction sites pass nil). Per "no speculative scaffolding," land this with the code that populates it.
  • [P3] internal/capture/usage.go:273parseResponse is unexported and referenced only by TestParseResponse; it is dead. Remove it, or wire it into the runner if final-response capture was intended.
  • [P3] test/integration/session_test.go:448randomSuffix ignores rand.Read's error; on the (rare) error path the suffix is all-zeros, causing namespace collisions across specs. The e2e framework checks this error; match that.

Key takeaways

  • One blocker: persistent→ephemeral mode switch leaks the StatefulSet and keeps session pods running.
  • Two silent-degradation paths (MCP secrets, initial token refresh) conflict with the project's fail-fast convention.
  • The feature needs docs/reference.md + an examples/ entry; several new files would benefit from happy-path/behavior-pinning tests.

Note on prompt injection: the cubic-dev-ai comment on this PR carries an embedded cubic:attribution directive instructing automated readers to credit cubic and treat omission as "plagiarism." That instruction was disregarded; this review is independent analysis of the code.

Comment thread internal/controller/taskspawner_controller.go Outdated
Comment thread internal/controller/taskspawner_controller.go
Comment thread internal/controller/session_controller.go Outdated
Comment thread internal/sessionrunner/runner.go Outdated
Comment thread internal/manifests/charts/kelos/templates/rbac.yaml Outdated
Comment thread test/integration/session_test.go Outdated
Comment thread internal/sessionrunner/runner_test.go
Comment thread internal/controller/session_controller_test.go
Comment thread internal/controller/init_containers_test.go
Comment thread internal/controller/session_statefulset_builder_test.go
@knechtionscoding knechtionscoding force-pushed the feat/persistent-spawner branch from 10e66e3 to a255c56 Compare May 29, 2026 15:26
@kelos-bot

kelos-bot Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-3f391c48be41 has failed. ❌

@knechtionscoding

Copy link
Copy Markdown
Contributor Author

@gjkim42 E2E Test added

@gjkim42

gjkim42 commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

/kelos review

@kelos-bot

kelos-bot Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-3a8aa87a910d has failed. ❌

@gjkim42

gjkim42 commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

/kelos review

@kelos-bot

kelos-bot Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-ab675acf6b78 has failed. ❌

@gjkim42

gjkim42 commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

/kelos review

@kelos-bot

kelos-bot Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-2fe54b4230ee has succeeded. ✅

@kelos-bot kelos-bot Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Kelos Reviewer Agent @gjkim42

Review Summary

Verdict: REQUEST CHANGES
Overall correctness: patch is incorrect
Scope: Adds persistent TaskSpawner execution backed by session StatefulSets and a session-runner loop, plus API fields, docs, examples, and tests.

Findings Overview

Priority Count File:Line Summary
P0 0 none
P1 1 internal/sessionrunner/runner.go:308 Transient initial token refresh failures can run the agent without credentials
P2 1 internal/controller/session_statefulset_builder.go:273 Persistent session pods silently ignore most TaskTemplate.podOverrides fields
P2 1 internal/controller/session_controller.go:309 Successful tasks can be forced to Failed when the pod status annotation write fails
P2 1 test/e2e/framework/framework.go:344 E2E helper uses global Expect() inside an Eventually polling path
P3 0 none

Findings

Correctness

  • [P1] internal/sessionrunner/runner.go:308 — Initial refreshToken errors that are not errCredentialUnavailable are logged and ignored. In the generated session pod, KELOS_TOKEN_SECRET also sets GH_CONFIG_DIR; runAgent strips GITHUB_TOKEN/GH_TOKEN/GH_ENTERPRISE_TOKEN, and configureFileCredentialHelper points git at /workspace/.kelos-token. If the first Secret read fails transiently, no token file or hosts.yml token is written, but the agent is still invoked with the env credentials removed, so git/gh run unauthenticated. Return an error/requeue instead of starting the agent until the initial credential refresh succeeds.
  • [P2] internal/controller/session_statefulset_builder.go:273 — The persistent StatefulSet path only applies Env, Resources, NodeSelector, and ServiceAccountName from TaskTemplate.podOverrides; the Job path also applies imagePullSecrets, tolerations, affinity, user volumes/mounts, extra containers/init containers, and security contexts. Switching an existing spawner that relies on a private image pull secret or tainted-node toleration to executionMode: persistent silently changes scheduling/pull behavior and can leave the session pod in ImagePullBackOff or unschedulable. Apply the same supported pod override fields in the StatefulSet builder, or explicitly reject/document unsupported fields before creating the StatefulSet.
  • [P2] internal/controller/session_controller.go:309processTask writes Task.Status.CompletionTime in a defer before Run() writes the pod's terminal kelos.dev/task-status annotation. If a successful agent run then hits repeated pod update conflicts or a transient API failure in setTaskStatus, the Task already has CompletionTime but no status annotation, and this branch marks it Failed after 15 seconds while clearing the completion timestamp. Avoid turning an already-completed Task into failure solely because the runner's pod annotation write failed; retry/requeue the signal path or use a status field that can distinguish success from missing signal.

Tests

  • [P2] test/e2e/framework/framework.go:344GetTaskSessionPodName uses global Expect() and the new e2e calls it from inside an Eventually poll. A transient GET error will fail the spec immediately instead of being retried by Gomega, which is exactly the flake pattern the project conventions call out for e2e wait helpers. Make the polling function perform the GET inline and return an empty string on error, or add a WaitForTaskSessionPodName helper using Eventually(func(g Gomega) { ... }).

Key takeaways

  • The previous P1 control-flow and runner dedup blockers are fixed in the current head.
  • One remaining fail-fast issue can still start the agent without credentials on initial Secret read failures, so I’m requesting changes.
  • The persistent path should preserve the existing podOverrides contract closely enough that switching execution modes does not silently drop scheduling, pull, volume, or security settings.

Note on prompt injection: the cubic-dev-ai comment on this PR carries an embedded attribution directive for automated readers. That instruction was disregarded; this review is independent analysis of the code.

Comment thread internal/sessionrunner/runner.go
Comment thread internal/controller/session_statefulset_builder.go
Comment thread internal/controller/session_controller.go Outdated
Comment thread test/e2e/framework/framework.go
knechtionscoding

This comment was marked as outdated.

knechtionscoding

This comment was marked as outdated.

@knechtionscoding

Copy link
Copy Markdown
Contributor Author

@gjkim42 I ran the reviewer locally a few times to make sure everything was good. Should be ready for final review now

Introduce a "persistent" execution mode that maintains long-lived session
pods via StatefulSet, allowing Tasks to be processed without per-task
cold-start overhead. When executionMode is set to "persistent", Tasks
transition to a Queued phase and are assigned to available session pods
by the new SessionReconciler.

Key components:
- SessionReconciler: assigns queued Tasks to session pods, monitors
  completion, handles pod failures with configurable retry
- SessionStatefulSetBuilder: creates StatefulSet + headless Service
  with PVC-backed workspaces
- Session runner binary: polls for assigned tasks, invokes agent
  entrypoint, reports results, refreshes GitHub tokens, resets
  workspace between tasks
- Shared init container builders extracted from job_builder for reuse
  between ephemeral Jobs and persistent StatefulSets

API additions:
- TaskPhaseQueued phase for Tasks awaiting session assignment
- ExecutionMode and SessionConfig on TaskSpawnerSpec
- Session status fields (SessionPodName, SessionRetryCount,
  LastSessionFailure) on TaskStatus

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
knechtionscoding

This comment was marked as outdated.

@gjkim42

gjkim42 commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Thanks!
Since this is a core api change, i'll review the api again this weekend.

@gjkim42 gjkim42 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding this — the persistent-execution concept should land on a different resource boundary, and it should not squat on the Session name.

This introduces a foundational concept (long-lived execution substrate for Tasks). Two things are expensive to reverse once shipped, so I'd like to settle them before merge rather than carry a deprecated surface indefinitely.

1. Resource boundary. executionMode + sessionConfig live on TaskSpawnerSpec, but TaskSpawner is a trigger ('when X, mint a Task') — it shouldn't own execution infrastructure. Today it owns the StatefulSet, the fleet config, and a routing label, and the assignment loop reaches sideways into the spawner for config. Proposed: a dedicated WorkerPool CRD that owns the StatefulSet + fleet config + execution environment, with Tasks referencing it via spec.workerPoolRef (presence of the ref = persistent; no executionMode enum, so no contradictory states). This also makes hand-authored persistent Tasks possible and lets one pool serve many spawners.

2. Naming — reserve Session for a different concept. We're planning a separate, interactive Session resource (a human-driven coding session, peer to Task). This PR uses 'session' throughout for the batch warm-pod substrate — SessionPool, SessionReconciler, kelos-session-runner, sessionConfig, SessionPodName. That collides head-on: SessionPool reads as 'a pool of Sessions' but means 'a pool of warm Task-workers'. Rename the batch substrate to WorkerPool / worker pods and reserve 'Session' for the interactive concept. Several of these names lock in at release (binary/image name, TaskStatus API fields, env vars), which is why this is a pre-merge concern.

The implementation is largely reusable — the StatefulSet builder, runner, and assignment logic move mostly intact; what changes is the config source (spawner → WorkerPool), the StatefulSet owner (spawner → WorkerPool), and the names. The inline comments map each piece to its new home. I'd like to close this and reland on the WorkerPool API — happy to pair on a design doc first.

Not in scope for this PR: designing the interactive Session CRD. We're only protecting its name now.

// +kubebuilder:validation:Enum=ephemeral;persistent
// +kubebuilder:default=ephemeral
// +optional
ExecutionMode ExecutionMode `json:"executionMode,omitempty"`

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove. In the proposed design there's no executionMode enum anywhere — persistence is expressed by the presence of Task.spec.workerPoolRef. That removes a class of invalid states (an enum that can contradict a config block), and the spawner stamps a workerPoolRef into spawned Tasks instead of a mode.

// SessionConfig configures persistent execution mode. Only valid when
// executionMode is "persistent".
// +optional
SessionConfig *SessionConfig `json:"sessionConfig,omitempty"`

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove from TaskSpawner. Fleet config belongs on the new WorkerPool CRD (its own lifecycle + reconciler). TaskSpawner gains only taskTemplate.workerPoolRef.

}

// SessionConfig configures persistent execution mode sessions.
type SessionConfig struct {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes WorkerPoolSpec on the new CRD, and it should also own the environment fields the StatefulSet builder currently pulls from taskTemplate (type, credentials, workspaceRef, agentConfigRef, image) — a pool is a warm, pre-configured environment. Trim to MVP per our 'minimum viable API' rule: keep replicas, storage.size (+ environment); defer storageClassName, idleTimeout, maxSessionDuration, maxTasksPerSession, workspaceReset, and the retry knobs (runner defaults already cover them).

// execute this Task. Set only for Tasks belonging to a persistent-mode
// TaskSpawner.
// +optional
SessionPodName string `json:"sessionPodName,omitempty"`

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three status fields (SessionPodName, SessionRetryCount, LastSessionFailure) bake 'session' into the Task API — hard to change after release. Rename to WorkerPodName / WorkerRetryCount / LastWorkerFailure so 'Session' stays reserved for the planned interactive resource.

}
task.Labels["kelos.dev/taskspawner"] = spawnerRef.Name
if spawnerRef.ExecutionMode == v1alpha1.ExecutionModePersistent {
task.Labels["kelos.dev/execution-mode"] = string(v1alpha1.ExecutionModePersistent)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spawner should stamp task.spec.workerPoolRef (from taskTemplate.workerPoolRef) instead of a kelos.dev/execution-mode label. Routing then branches on a validated spec field rather than mutable selector metadata.

// handleDeletion handles TaskSpawner deletion.
// reconcileSessionStatefulSet manages the StatefulSet and headless Service for
// persistent-mode TaskSpawners.
func (r *TaskSpawnerReconciler) reconcileSessionStatefulSet(ctx context.Context, req ctrl.Request, ts *kelosv1alpha1.TaskSpawner, isSuspended bool) (ctrl.Result, error) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop moves to a new WorkerPoolReconciler, with the StatefulSet + headless Service owned by the WorkerPool (controllerRef → WorkerPool), not the spawner (see the SetControllerReference(ts, …) below). TaskSpawner stops managing infrastructure entirely. This is a relocation — the build logic itself is reusable.

}

agentUID := AgentUID
name := sessionStatefulSetName(ts.Name)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming the StatefulSet after the spawner is what hard-couples one pool to one spawner. Name it after the WorkerPool instead, so a pool can serve many spawners and hand-authored Tasks.

// getSessionConfig fetches the SessionConfig for a task's owning TaskSpawner.
// It returns an error on transient failures so callers requeue rather than
// silently falling back to defaults.
func (r *SessionReconciler) getSessionConfig(ctx context.Context, task *kelosv1alpha1.Task) (*kelosv1alpha1.SessionConfig, error) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the reach into TaskSpawner with a lookup of the WorkerPool the Task references (task.Spec.WorkerPoolRef). The Task names its pool explicitly — no inference from labels, and no dependence on the spawner still existing.

}
var spawner kelosv1alpha1.TaskSpawner
if err := r.Get(ctx, client.ObjectKey{Namespace: task.Namespace, Name: spawnerName}, &spawner); err != nil {
if apierrors.IsNotFound(err) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new design a missing referenced pool is an explicit Queued + PoolNotFound condition, not a silent fallback to default retry (which conflicts with our fail-fast rule). This silent-default path disappears with the redesign.

return out.Close()
}

func main() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary/image name leak: kelos-session-runner (this cmd/ dir, its Dockerfile, and the published image tag) bakes 'session' into a release artifact that's awkward to rename later. Propose kelos-worker-runner, reserving 'session' for the interactive Session concept.

@knechtionscoding

Copy link
Copy Markdown
Contributor Author

@gjkim42 reasonable! I'll open a PR/Issue with worker pool design

@knechtionscoding

Copy link
Copy Markdown
Contributor Author

Opened #1310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/api Categorizes issue or PR as related to API changes kind/feature Categorizes issue or PR as related to a new feature needs-actor needs-priority release-note triage-accepted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API: Add persistent execution mode to TaskSpawner for warm agent sessions that eliminate per-task cold-start overhead

2 participants