Skip to content

fix: refresh GitHub App installation token for long-running tasks#1196

Open
kelos-bot[bot] wants to merge 4 commits into
mainfrom
kelos-task-1182
Open

fix: refresh GitHub App installation token for long-running tasks#1196
kelos-bot[bot] wants to merge 4 commits into
mainfrom
kelos-task-1182

Conversation

@kelos-bot

@kelos-bot kelos-bot Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

GitHub App installation tokens expire about 1 hour after they are minted. Previously the controller minted a token at task admission and injected it through pod env vars, which are fixed for the life of the pod. Long-running tasks could then lose the ability to fetch, push, comment, or update checks once that token expired.

This PR makes the per-task GitHub token Secret refreshable while the task Job is running:

  • Mounts the workspace token Secret as a kubelet-synced file at /kelos/github-token/GITHUB_TOKEN in the agent and supporting init containers.
  • Keeps GITHUB_TOKEN, GH_TOKEN, and GH_ENTERPRISE_TOKEN env vars as compatibility fallbacks.
  • Exposes KELOS_GITHUB_TOKEN_FILE so custom agents can read the current token from the mounted file.
  • Updates the injected git credential helper to read the token file on each invocation, with $GITHUB_TOKEN fallback.
  • Installs a shared /usr/local/bin/gh wrapper in the reference agent images so gh reads the token file on each invocation.
  • Refreshes GitHub App-derived per-task token Secrets while the Job is Active, using kelos.dev/github-app-secret and kelos.dev/token-expires-at annotations to identify and schedule refreshes.
  • Retries failed token refreshes on a short backoff, shortened further near expiry, so transient failures get another attempt before the current token expires.
  • Documents and rejects collisions with the managed kelos-github-token volume name so PodOverrides.Volumes fail during job construction instead of producing an invalid pod spec.

PAT-backed workspaces keep the same token lifetime, but use the same file/env plumbing for consistency.

Which issue(s) this PR is related to:

Fixes #1182

Special notes for your reviewer:

  • Custom agent images should prefer KELOS_GITHUB_TOKEN_FILE for long-running GitHub API calls. The env vars remain for compatibility but are fixed at pod start.
  • The refresh loop stops after the Job is no longer active. The token Secret remains owner-referenced to the Task for garbage collection.
  • Failed refreshes requeue sooner than the 5-minute refresh margin; near expiry, the retry delay is capped at half the remaining token lifetime.
  • Tests cover token volume/env wiring, file-based git credential helper behavior, token Secret annotations, refresh/no-op paths, refresh-failure retry scheduling, and reserved PodOverrides volume-name validation.

Does this PR introduce a user-facing change?

Long-running tasks no longer fail mid-run when the GitHub App installation token expires. The controller now mounts the workspace token Secret as a file at `/kelos/github-token/GITHUB_TOKEN`, refreshes GitHub App-derived per-task token Secrets in place while the Job is running, retries refresh failures before the current token expires, and reference agent images read the file for `git` and `gh` calls. The env vars `GITHUB_TOKEN`, `GH_TOKEN`, and `GH_ENTERPRISE_TOKEN` are still set as compatibility fallbacks, and `KELOS_GITHUB_TOKEN_FILE` exposes the mounted token path for custom agents. `PodOverrides.Volumes` may not use the reserved `kelos-github-token` volume name.

@github-actions github-actions Bot added needs-triage kind/bug Categorizes issue or PR as related to a bug needs-kind Indicates an issue or PR lacks a kind/* label needs-priority needs-actor release-note and removed needs-kind Indicates an issue or PR lacks a kind/* label labels May 22, 2026
@gjkim42

gjkim42 commented May 22, 2026

Copy link
Copy Markdown
Collaborator

/kelos review

@kelos-bot

kelos-bot Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-806de4e6c834 has failed. ❌

@gjkim42

gjkim42 commented May 22, 2026

Copy link
Copy Markdown
Collaborator

/kelos pick-up

@kelos-bot

kelos-bot Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Task Status

Task kelos-pr-responder-issue-comment-eee273417975 has succeeded. ✅

@kelos-bot

kelos-bot Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Agent @gjkim42

Fixed the failing test-integration check.

The existing Task Controller integration spec for workspaces with a secretRef asserted six env vars on the main container and two on the git-clone init container. The fix in this PR adds KELOS_GITHUB_TOKEN_FILE to both sets and mounts the per-task token Secret as a refreshable volume, so the old length assertions tripped under CI. The test now asserts the additional env var, verifies the kelos-github-token volume is declared on the pod, and confirms it is mounted on both the main and init containers at the expected path.

make verify, make test, and make test-integration all pass locally. All CI checks pass on 2bcf1c0. Ready for re-review.

@gjkim42 gjkim42 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/kelos pick-up

Comment thread docs/agent-image-interface.md Outdated
Comment on lines +39 to +41
| `GITHUB_TOKEN` | GitHub token for workspace access; captured at pod start and not refreshed in-process — long-running consumers should read `KELOS_GITHUB_TOKEN_FILE` instead | When workspace has a `secretRef` |
| `GH_TOKEN` | GitHub token for `gh` CLI (github.com); not refreshed in-process (the bundled `gh` wrapper overrides it from the token file on each call) | When workspace has a `secretRef` and repo is on github.com |
| `GH_ENTERPRISE_TOKEN` | GitHub token for `gh` CLI (GitHub Enterprise); same in-process freshness caveat as `GH_TOKEN` | When workspace has a `secretRef` and repo is on a GitHub Enterprise host |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to recommend image authors to use KELOS_GITHUB_TOKEN_FILE?

@kelos-bot

kelos-bot Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Task Status

Task kelos-pr-responder-pull-request-review-4e35f262a8cf has succeeded. ✅

@kelos-bot kelos-bot Bot force-pushed the kelos-task-1182 branch from 2bcf1c0 to 26f87ea Compare May 23, 2026 06:02
@gjkim42

gjkim42 commented May 23, 2026

Copy link
Copy Markdown
Collaborator

/kelos review

@kelos-bot

kelos-bot Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-2da56a73d620 has succeeded. ✅

@gjkim42

gjkim42 commented May 23, 2026

Copy link
Copy Markdown
Collaborator

/kelos pick-up
Then squash commits

@kelos-bot

kelos-bot Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Task Status

Task kelos-pr-responder-issue-comment-3ef1756b6132 has succeeded. ✅

@kelos-bot kelos-bot Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Kelos Reviewer Agent @gjkim42

Review Summary

Verdict: COMMENT (would be APPROVE — GitHub blocks self-approval)
Overall correctness: patch is correct
Scope: Adds in-place GitHub App installation token refresh by mounting the per-task token Secret as a kubelet-synced file, teaching git/gh/agents to read it on each call, and a controller refresh loop keyed off Job Active state.

Findings Overview

Priority Count File:Line Summary
P0 0 none
P1 0 none
P2 0 none
P3 4 multiple minor consistency/observability nits

Findings

Correctness / Observability

  • [P3] internal/controller/task_controller.go:531-539 — When refreshGitHubAppTokenIfNeeded cannot fetch the Workspace (deleted, transient API error, cache lag), the host-derivation branch is silently skipped and tc.BaseURL defaults to the controller-wide value. For enterprise installations whose default TokenClient.BaseURL is https://api.github.com, the refresh will then mint against the wrong endpoint and fail with a 401 that surfaces only as a generic GitHubTokenRefreshFailed event. Consider logging the r.Get error (even at V(1)) so operators can correlate the refresh failure with the underlying cause, or persist the resolved API base URL on the token Secret alongside kelos.dev/github-app-secret so refresh does not depend on the Workspace still being readable.

Code Quality

  • [P3] internal/controller/task_controller.go:546-552 vs internal/controller/task_controller.go:457-466 — Inconsistent secret-data update technique: resolveGitHubAppToken writes the new token via existing.StringData = tokenSecret.StringData, while refreshGitHubAppTokenIfNeeded writes into tokenSecret.Data directly and deletes the (nil-from-Get) StringData entry. Both paths persist correctly but use opposite conventions, which makes the code harder to reason about. Pick one (StringData for writes is the idiomatic choice).
  • [P3] internal/controller/job_builder.go:567mainContainer.VolumeMounts = append([]corev1.VolumeMount(nil), workspaceVolumeMounts...) defensively clones the slice for the main container, but the init containers (git-clone, branch-setup) at :454 and :530 reference workspaceVolumeMounts directly. If any later code path appends to one of those containers' VolumeMounts, the shared backing array could leak across containers. Either share consistently or clone consistently.

Tests

  • [P3] internal/controller/job_builder_test.go:836-845 — In TestBuildClaudeCodeJob_CredentialHelperReadsFromTokenFile, the two OR'd strings.Contains checks both evaluate to the identical literal cat "/kelos/github-token/GITHUB_TOKEN", so the second branch is dead code. If the intent was to tolerate alternate quoting (e.g. cat '/kelos/...'), one of the literals should use single quotes; otherwise drop the redundant check.

Key takeaways

  • The refresh path is correctly gated on job.Status.Active > 0 and only fires for App-derived Secrets (via the kelos.dev/github-app-secret annotation), which keeps PAT-backed workspaces untouched and avoids burning rate limit on finished Jobs.
  • The credential-helper file fallback to $GITHUB_TOKEN keeps custom agent images that have not adopted KELOS_GITHUB_TOKEN_FILE working for short tasks, matching the migration note in the docs.
  • Happy-path coverage is in place (TestRefreshGitHubAppTokenIfNeeded_RemintsExpiringToken asserts the API call, the rewritten Data["GITHUB_TOKEN"], and the refreshed expiry annotation).

@kelos-bot kelos-bot Bot force-pushed the kelos-task-1182 branch from 26f87ea to 3efcc6e Compare May 23, 2026 10:37
@kelos-bot

kelos-bot Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Agent @gjkim42

All four P3 review findings were already addressed in the current commit (3efcc6e), and the branch is already a single squashed commit. CI is green. Ready for re-review.

@gjkim42

gjkim42 commented May 23, 2026

Copy link
Copy Markdown
Collaborator

/kelos pick-up

@kelos-bot

kelos-bot Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Task Status

Task kelos-pr-responder-issue-comment-423c726ad82f has succeeded. ✅

@kelos-bot

kelos-bot Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Agent @gjkim42

Re-checked on pick-up: no new review feedback or comments since the 10:51 UTC status, the current commit (3efcc6e) still addresses all four P3 findings from the prior /kelos review, the branch is a single squashed commit, and CI is green. Holding for your re-review.

@gjkim42

gjkim42 commented May 29, 2026

Copy link
Copy Markdown
Collaborator

/kelos review

@kelos-bot

kelos-bot Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-8bef20f790bf has succeeded. ✅

@kelos-bot kelos-bot Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Kelos Reviewer Agent @gjkim42

Review Summary

Verdict: COMMENT (would be APPROVE — GitHub blocks self-approval)
Overall correctness: patch is correct
Scope: Mounts the per-task GitHub token Secret as a kubelet-synced file, teaches git/gh/agents to re-read it each invocation, and adds a controller refresh loop that re-mints the GitHub App installation token in place while the Job is Active.

Findings Overview

Priority Count File:Line Summary
P0 0 none
P1 0 none
P2 0 none
P3 0 none

Findings

No qualifying issues. The four P3 nits from the prior round (workspace-fetch fallback logging, StringData vs Data write inconsistency, VolumeMount slice cloning, and the duplicated strings.Contains assertion) are all resolved in 3efcc6e.

Key takeaways

  • Refresh is correctly gated on job.Status.Active > 0 and the kelos.dev/github-app-secret annotation, and early-returns on time.Now().Before(refreshAt) before touching TokenClient — so frequent reconciles (e.g. output polling) do not spam the GitHub API; the install token is only re-minted inside the 5-minute margin.
  • The credential helper and gh wrapper both read $KELOS_GITHUB_TOKEN_FILE lazily with a $GITHUB_TOKEN env fallback, and the volume mounts the resolved per-task Secret (<task>-github-token), so kubelet sync delivers refreshed tokens without a pod restart. PAT-backed workspaces inherit the file plumbing but are never refreshed, matching prior behavior.
  • On workspace-fetch failure during refresh the controller falls back to the default API base URL and logs at V(1); for an Enterprise host this surfaces as a GitHubTokenRefreshFailed warning event + retry rather than a silently broken token, which is acceptable.
  • Tests cover the happy path (TestRefreshGitHubAppTokenIfNeeded_RemintsExpiringToken asserts the API call count, rewritten token, and refreshed expiry annotation) plus the skip/no-op branches and the volume/mount/env wiring in unit and integration suites.

@kelos-bot kelos-bot Bot mentioned this pull request Jun 1, 2026
@gjkim42

gjkim42 commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

/kelos review

@kelos-bot

kelos-bot Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-46eeaed30a96 has succeeded. ✅

@kelos-bot kelos-bot Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Kelos Reviewer Agent @gjkim42

Review Summary

Verdict: COMMENT
Overall correctness: patch is correct
Scope: Mounts per-task GitHub tokens as kubelet-synced files, refreshes GitHub App installation tokens while Jobs are active, and updates agent images/docs/tests so git and gh re-read refreshed tokens.

Findings Overview

Priority Count File:Line Summary
P0 0 none
P1 0 none
P2 1 internal/controller/job_builder.go:870 New managed token volume is not reserved for PodOverrides
P3 0 none

Findings

Correctness

  • [P2] internal/controller/job_builder.go:870 — Add GitHubTokenVolumeName to reservedVolumeNames (and update the related PodOverrides docs/tests). When a Task uses a workspace secret, the builder now always appends an internal kelos-github-token volume before appending PodOverrides.Volumes; because this new internal name is not rejected by validateUserVolumes, a user override with the same name creates duplicate pod volume names and the Job is rejected by the API server instead of failing fast during job construction.

Key takeaways

  • The refresh loop, token annotations, file mount, credential helper, and gh wrapper behavior look consistent with the PR description.
  • The remaining issue is limited to the PodOverrides collision path and does not make the main token-refresh patch incorrect.

Comment thread internal/controller/job_builder.go
@gjkim42

gjkim42 commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

/kelos pick-up
rebase on origin/main and fix the latest review comments

@kelos-bot

kelos-bot Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Task Status

Task kelos-pr-responder-issue-comment-8ca8afb94a4d has failed. ❌

@kelos-bot kelos-bot Bot force-pushed the kelos-task-1182 branch from 3efcc6e to 8015e6e Compare June 6, 2026 02:23
@gjkim42

gjkim42 commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

/kelos review

@kelos-bot

kelos-bot Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Task Status

Task kelos-reviewer-issue-comment-afae0e75e56a has succeeded. ✅

@kelos-bot kelos-bot Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Kelos Reviewer Agent @gjkim42

Review Summary

Verdict: COMMENT
Overall correctness: patch is correct
Scope: Mounts per-task GitHub tokens as kubelet-synced files, refreshes GitHub App installation tokens while Jobs are active, and updates agent images/docs/tests so git and gh re-read refreshed tokens.

Findings Overview

Priority Count File:Line Summary
P0 0 none
P1 0 none
P2 1 internal/controller/task_controller.go:175 Retry token refresh sooner after a failed refresh
P3 0 none

Findings

Correctness

  • [P2] internal/controller/task_controller.go:175 — Retry failed token refreshes sooner than the full 5-minute refresh margin. This branch only runs once the stored token is already within that margin, so setting RequeueAfter to tokenRefreshMargin can schedule the next attempt at or after the old token's expiry; a single transient GitHub/API-server error can therefore still leave a long-running task with an expired token. Use a shorter backoff or compute the retry from the stored expiry so at least one more attempt happens before expiration.

Key takeaways

  • The reserved kelos-github-token PodOverrides collision from the prior review is fixed and the obsolete Kelos-bot review thread was resolved.
  • The main file/env plumbing, App-derived Secret annotations, Enterprise API base URL handling, and docs/tests otherwise match the PR scope.

Comment thread internal/controller/task_controller.go Outdated
logger.Error(refreshErr, "Unable to refresh GitHub App installation token")
r.recordEvent(&task, corev1.EventTypeWarning, "GitHubTokenRefreshFailed", "Failed to refresh GitHub App installation token: %v", refreshErr)
if result.RequeueAfter == 0 || tokenRefreshMargin < result.RequeueAfter {
result.RequeueAfter = tokenRefreshMargin

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Retry sooner than the full refresh margin after a failed refresh. Once this branch runs we are already inside (or past) expiresAt - tokenRefreshMargin, so scheduling the next attempt for tokenRefreshMargin can put the retry at or after the old token's expiry; a single transient GitHub/API-server error then leaves long-running tasks with an expired token until the next reconcile. Please requeue on a shorter backoff, or compute it from the stored expiry, so at least one more refresh attempt happens before expiration.

@gjkim42

gjkim42 commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

/kelos pick-up
address latest review comments

@kelos-bot

kelos-bot Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Task Status

Task kelos-pr-responder-issue-comment-1757c41c1c1f has succeeded. ✅

@kelos-bot

kelos-bot Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Agent @gjkim42

Addressed the latest P2 review comment in 0b0422f by retrying failed GitHub App token refreshes sooner than the 5-minute refresh margin, with the retry delay shortened near token expiry so another attempt is scheduled before the current token expires. Added a reconcile-level regression test for that failure path.

Verification: make test and make verify pass locally. The requested local /review --base origin/main command is not available in this container (/bin/bash: /review: No such file or directory). GitHub CI is green, including e2e. Ready for re-review.

@gjkim42

gjkim42 commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

/kelos pick-up
rebase on origin/main

@gjkim42

gjkim42 commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

/kelos pick-up

@kelos-bot

kelos-bot Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Task Status

Task kelos-pr-responder-issue-comment-7eae25a6900c has succeeded. ✅

gjkim42 and others added 4 commits June 14, 2026 22:21
GitHub App installation tokens expire ~1h after they are minted, but the
controller previously only minted a token at task admission and injected
it as an envFrom secret reference. Tasks that ran longer than 1h lost
the ability to push commits, comment, or update PR status for the
remainder of the run.

This change makes the per-task GitHub token Secret a refreshable
resource without restarting the agent pod:

- Mount the workspace token Secret as a kubelet-synced volume at
  /kelos/github-token/GITHUB_TOKEN in the agent and supporting init
  containers, alongside the existing env vars (which stay as a
  fallback for PAT-backed workspaces).
- Expose the mount path via KELOS_GITHUB_TOKEN_FILE so downstream
  consumers can re-read on every invocation.
- Update the git credential helper to source the token from that file
  on each git operation, falling back to $GITHUB_TOKEN when the file
  is unreadable.
- Ship a /usr/local/bin/gh wrapper in every reference agent image
  that re-exports GH_TOKEN/GH_ENTERPRISE_TOKEN from the file before
  exec'ing the distro gh.
- Add a controller refresh loop: while the Job is Active and the
  workspace uses GitHub App creds, re-mint the installation token
  before the 5-minute margin and update the per-task Secret in
  place. The Secret is annotated with the source App secret name
  and the current expiry so refresh decisions are cheap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@kelos-bot kelos-bot Bot force-pushed the kelos-task-1182 branch from 0b0422f to f9b9a9a Compare June 14, 2026 22:42
@kelos-bot

kelos-bot Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Kelos Agent @gjkim42

Rebased the PR branch on current origin/main and resolved the rebase fallout. I kept the token-refresh behavior intact, aligned the v1alpha2 CRD descriptions with the new kelos-github-token managed volume, and cleaned up stale test aliases introduced by the rebase.

Verification: make test and make verify pass locally. The requested local /review --base origin/main command is still unavailable in this container (/review: No such file or directory). GitHub CI is green, including integration and e2e. Ready for re-review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Long-running task fails to push or comment after GITHUB_TOKEN expires mid-run

1 participant