fix: refresh GitHub App installation token for long-running tasks#1196
fix: refresh GitHub App installation token for long-running tasks#1196kelos-bot[bot] wants to merge 4 commits into
Conversation
|
/kelos review |
|
🤖 Kelos Task Status Task |
|
/kelos pick-up |
|
🤖 Kelos Task Status Task |
|
🤖 Kelos Agent @gjkim42 Fixed the failing test-integration check. The existing Task Controller integration spec for workspaces with a secretRef asserted six env vars on the main container and two on the git-clone init container. The fix in this PR adds KELOS_GITHUB_TOKEN_FILE to both sets and mounts the per-task token Secret as a refreshable volume, so the old length assertions tripped under CI. The test now asserts the additional env var, verifies the kelos-github-token volume is declared on the pod, and confirms it is mounted on both the main and init containers at the expected path. make verify, make test, and make test-integration all pass locally. All CI checks pass on 2bcf1c0. Ready for re-review. |
| | `GITHUB_TOKEN` | GitHub token for workspace access; captured at pod start and not refreshed in-process — long-running consumers should read `KELOS_GITHUB_TOKEN_FILE` instead | When workspace has a `secretRef` | | ||
| | `GH_TOKEN` | GitHub token for `gh` CLI (github.com); not refreshed in-process (the bundled `gh` wrapper overrides it from the token file on each call) | When workspace has a `secretRef` and repo is on github.com | | ||
| | `GH_ENTERPRISE_TOKEN` | GitHub token for `gh` CLI (GitHub Enterprise); same in-process freshness caveat as `GH_TOKEN` | When workspace has a `secretRef` and repo is on a GitHub Enterprise host | |
There was a problem hiding this comment.
is it better to recommend image authors to use KELOS_GITHUB_TOKEN_FILE?
|
🤖 Kelos Task Status Task |
2bcf1c0 to
26f87ea
Compare
|
/kelos review |
|
🤖 Kelos Task Status Task |
|
/kelos pick-up |
|
🤖 Kelos Task Status Task |
There was a problem hiding this comment.
🤖 Kelos Reviewer Agent @gjkim42
Review Summary
Verdict: COMMENT (would be APPROVE — GitHub blocks self-approval)
Overall correctness: patch is correct
Scope: Adds in-place GitHub App installation token refresh by mounting the per-task token Secret as a kubelet-synced file, teaching git/gh/agents to read it on each call, and a controller refresh loop keyed off Job Active state.
Findings Overview
| Priority | Count | File:Line | Summary |
|---|---|---|---|
| P0 | 0 | — | none |
| P1 | 0 | — | none |
| P2 | 0 | — | none |
| P3 | 4 | multiple | minor consistency/observability nits |
Findings
Correctness / Observability
- [P3]
internal/controller/task_controller.go:531-539— WhenrefreshGitHubAppTokenIfNeededcannot fetch theWorkspace(deleted, transient API error, cache lag), the host-derivation branch is silently skipped andtc.BaseURLdefaults to the controller-wide value. For enterprise installations whose defaultTokenClient.BaseURLishttps://api.github.com, the refresh will then mint against the wrong endpoint and fail with a 401 that surfaces only as a genericGitHubTokenRefreshFailedevent. Consider logging ther.Geterror (even at V(1)) so operators can correlate the refresh failure with the underlying cause, or persist the resolved API base URL on the token Secret alongsidekelos.dev/github-app-secretso refresh does not depend on the Workspace still being readable.
Code Quality
- [P3]
internal/controller/task_controller.go:546-552vsinternal/controller/task_controller.go:457-466— Inconsistent secret-data update technique:resolveGitHubAppTokenwrites the new token viaexisting.StringData = tokenSecret.StringData, whilerefreshGitHubAppTokenIfNeededwrites intotokenSecret.Datadirectly anddeletes the (nil-from-Get)StringDataentry. Both paths persist correctly but use opposite conventions, which makes the code harder to reason about. Pick one (StringDatafor writes is the idiomatic choice). - [P3]
internal/controller/job_builder.go:567—mainContainer.VolumeMounts = append([]corev1.VolumeMount(nil), workspaceVolumeMounts...)defensively clones the slice for the main container, but the init containers (git-clone,branch-setup) at:454and:530referenceworkspaceVolumeMountsdirectly. If any later code path appends to one of those containers'VolumeMounts, the shared backing array could leak across containers. Either share consistently or clone consistently.
Tests
- [P3]
internal/controller/job_builder_test.go:836-845— InTestBuildClaudeCodeJob_CredentialHelperReadsFromTokenFile, the two OR'dstrings.Containschecks both evaluate to the identical literalcat "/kelos/github-token/GITHUB_TOKEN", so the second branch is dead code. If the intent was to tolerate alternate quoting (e.g.cat '/kelos/...'), one of the literals should use single quotes; otherwise drop the redundant check.
Key takeaways
- The refresh path is correctly gated on
job.Status.Active > 0and only fires for App-derived Secrets (via thekelos.dev/github-app-secretannotation), which keeps PAT-backed workspaces untouched and avoids burning rate limit on finished Jobs. - The credential-helper file fallback to
$GITHUB_TOKENkeeps custom agent images that have not adoptedKELOS_GITHUB_TOKEN_FILEworking for short tasks, matching the migration note in the docs. - Happy-path coverage is in place (
TestRefreshGitHubAppTokenIfNeeded_RemintsExpiringTokenasserts the API call, the rewrittenData["GITHUB_TOKEN"], and the refreshed expiry annotation).
26f87ea to
3efcc6e
Compare
|
🤖 Kelos Agent @gjkim42 All four P3 review findings were already addressed in the current commit ( |
|
/kelos pick-up |
|
🤖 Kelos Task Status Task |
|
🤖 Kelos Agent @gjkim42 Re-checked on pick-up: no new review feedback or comments since the 10:51 UTC status, the current commit ( |
|
/kelos review |
|
🤖 Kelos Task Status Task |
There was a problem hiding this comment.
🤖 Kelos Reviewer Agent @gjkim42
Review Summary
Verdict: COMMENT (would be APPROVE — GitHub blocks self-approval)
Overall correctness: patch is correct
Scope: Mounts the per-task GitHub token Secret as a kubelet-synced file, teaches git/gh/agents to re-read it each invocation, and adds a controller refresh loop that re-mints the GitHub App installation token in place while the Job is Active.
Findings Overview
| Priority | Count | File:Line | Summary |
|---|---|---|---|
| P0 | 0 | — | none |
| P1 | 0 | — | none |
| P2 | 0 | — | none |
| P3 | 0 | — | none |
Findings
No qualifying issues. The four P3 nits from the prior round (workspace-fetch fallback logging, StringData vs Data write inconsistency, VolumeMount slice cloning, and the duplicated strings.Contains assertion) are all resolved in 3efcc6e.
Key takeaways
- Refresh is correctly gated on
job.Status.Active > 0and thekelos.dev/github-app-secretannotation, and early-returns ontime.Now().Before(refreshAt)before touchingTokenClient— so frequent reconciles (e.g. output polling) do not spam the GitHub API; the install token is only re-minted inside the 5-minute margin. - The credential helper and
ghwrapper both read$KELOS_GITHUB_TOKEN_FILElazily with a$GITHUB_TOKENenv fallback, and the volume mounts the resolved per-task Secret (<task>-github-token), so kubelet sync delivers refreshed tokens without a pod restart. PAT-backed workspaces inherit the file plumbing but are never refreshed, matching prior behavior. - On workspace-fetch failure during refresh the controller falls back to the default API base URL and logs at V(1); for an Enterprise host this surfaces as a
GitHubTokenRefreshFailedwarning event + retry rather than a silently broken token, which is acceptable. - Tests cover the happy path (
TestRefreshGitHubAppTokenIfNeeded_RemintsExpiringTokenasserts the API call count, rewritten token, and refreshed expiry annotation) plus the skip/no-op branches and the volume/mount/env wiring in unit and integration suites.
|
/kelos review |
|
🤖 Kelos Task Status Task |
There was a problem hiding this comment.
🤖 Kelos Reviewer Agent @gjkim42
Review Summary
Verdict: COMMENT
Overall correctness: patch is correct
Scope: Mounts per-task GitHub tokens as kubelet-synced files, refreshes GitHub App installation tokens while Jobs are active, and updates agent images/docs/tests so git and gh re-read refreshed tokens.
Findings Overview
| Priority | Count | File:Line | Summary |
|---|---|---|---|
| P0 | 0 | — | none |
| P1 | 0 | — | none |
| P2 | 1 | internal/controller/job_builder.go:870 | New managed token volume is not reserved for PodOverrides |
| P3 | 0 | — | none |
Findings
Correctness
- [P2] internal/controller/job_builder.go:870 — Add
GitHubTokenVolumeNametoreservedVolumeNames(and update the related PodOverrides docs/tests). When a Task uses a workspace secret, the builder now always appends an internalkelos-github-tokenvolume before appendingPodOverrides.Volumes; because this new internal name is not rejected byvalidateUserVolumes, a user override with the same name creates duplicate pod volume names and the Job is rejected by the API server instead of failing fast during job construction.
Key takeaways
- The refresh loop, token annotations, file mount, credential helper, and gh wrapper behavior look consistent with the PR description.
- The remaining issue is limited to the PodOverrides collision path and does not make the main token-refresh patch incorrect.
|
/kelos pick-up |
|
🤖 Kelos Task Status Task |
3efcc6e to
8015e6e
Compare
|
/kelos review |
|
🤖 Kelos Task Status Task |
There was a problem hiding this comment.
🤖 Kelos Reviewer Agent @gjkim42
Review Summary
Verdict: COMMENT
Overall correctness: patch is correct
Scope: Mounts per-task GitHub tokens as kubelet-synced files, refreshes GitHub App installation tokens while Jobs are active, and updates agent images/docs/tests so git and gh re-read refreshed tokens.
Findings Overview
| Priority | Count | File:Line | Summary |
|---|---|---|---|
| P0 | 0 | — | none |
| P1 | 0 | — | none |
| P2 | 1 | internal/controller/task_controller.go:175 | Retry token refresh sooner after a failed refresh |
| P3 | 0 | — | none |
Findings
Correctness
- [P2] internal/controller/task_controller.go:175 — Retry failed token refreshes sooner than the full 5-minute refresh margin. This branch only runs once the stored token is already within that margin, so setting
RequeueAftertotokenRefreshMargincan schedule the next attempt at or after the old token's expiry; a single transient GitHub/API-server error can therefore still leave a long-running task with an expired token. Use a shorter backoff or compute the retry from the stored expiry so at least one more attempt happens before expiration.
Key takeaways
- The reserved
kelos-github-tokenPodOverrides collision from the prior review is fixed and the obsolete Kelos-bot review thread was resolved. - The main file/env plumbing, App-derived Secret annotations, Enterprise API base URL handling, and docs/tests otherwise match the PR scope.
| logger.Error(refreshErr, "Unable to refresh GitHub App installation token") | ||
| r.recordEvent(&task, corev1.EventTypeWarning, "GitHubTokenRefreshFailed", "Failed to refresh GitHub App installation token: %v", refreshErr) | ||
| if result.RequeueAfter == 0 || tokenRefreshMargin < result.RequeueAfter { | ||
| result.RequeueAfter = tokenRefreshMargin |
There was a problem hiding this comment.
[P2] Retry sooner than the full refresh margin after a failed refresh. Once this branch runs we are already inside (or past) expiresAt - tokenRefreshMargin, so scheduling the next attempt for tokenRefreshMargin can put the retry at or after the old token's expiry; a single transient GitHub/API-server error then leaves long-running tasks with an expired token until the next reconcile. Please requeue on a shorter backoff, or compute it from the stored expiry, so at least one more refresh attempt happens before expiration.
|
/kelos pick-up |
|
🤖 Kelos Task Status Task |
|
🤖 Kelos Agent @gjkim42 Addressed the latest P2 review comment in Verification: |
|
/kelos pick-up |
|
/kelos pick-up |
|
🤖 Kelos Task Status Task |
GitHub App installation tokens expire ~1h after they are minted, but the controller previously only minted a token at task admission and injected it as an envFrom secret reference. Tasks that ran longer than 1h lost the ability to push commits, comment, or update PR status for the remainder of the run. This change makes the per-task GitHub token Secret a refreshable resource without restarting the agent pod: - Mount the workspace token Secret as a kubelet-synced volume at /kelos/github-token/GITHUB_TOKEN in the agent and supporting init containers, alongside the existing env vars (which stay as a fallback for PAT-backed workspaces). - Expose the mount path via KELOS_GITHUB_TOKEN_FILE so downstream consumers can re-read on every invocation. - Update the git credential helper to source the token from that file on each git operation, falling back to $GITHUB_TOKEN when the file is unreadable. - Ship a /usr/local/bin/gh wrapper in every reference agent image that re-exports GH_TOKEN/GH_ENTERPRISE_TOKEN from the file before exec'ing the distro gh. - Add a controller refresh loop: while the Job is Active and the workspace uses GitHub App creds, re-mint the installation token before the 5-minute margin and update the per-task Secret in place. The Secret is annotated with the source App secret name and the current expiry so refresh decisions are cheap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
0b0422f to
f9b9a9a
Compare
|
🤖 Kelos Agent @gjkim42 Rebased the PR branch on current Verification: |
What type of PR is this?
/kind bug
What this PR does / why we need it:
GitHub App installation tokens expire about 1 hour after they are minted. Previously the controller minted a token at task admission and injected it through pod env vars, which are fixed for the life of the pod. Long-running tasks could then lose the ability to fetch, push, comment, or update checks once that token expired.
This PR makes the per-task GitHub token Secret refreshable while the task Job is running:
/kelos/github-token/GITHUB_TOKENin the agent and supporting init containers.GITHUB_TOKEN,GH_TOKEN, andGH_ENTERPRISE_TOKENenv vars as compatibility fallbacks.KELOS_GITHUB_TOKEN_FILEso custom agents can read the current token from the mounted file.$GITHUB_TOKENfallback./usr/local/bin/ghwrapper in the reference agent images soghreads the token file on each invocation.Active, usingkelos.dev/github-app-secretandkelos.dev/token-expires-atannotations to identify and schedule refreshes.kelos-github-tokenvolume name soPodOverrides.Volumesfail during job construction instead of producing an invalid pod spec.PAT-backed workspaces keep the same token lifetime, but use the same file/env plumbing for consistency.
Which issue(s) this PR is related to:
Fixes #1182
Special notes for your reviewer:
KELOS_GITHUB_TOKEN_FILEfor long-running GitHub API calls. The env vars remain for compatibility but are fixed at pod start.Does this PR introduce a user-facing change?